bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#8871: Bug with "sort -i" ?


From: Eric Blake
Subject: bug#8871: Bug with "sort -i" ?
Date: Wed, 15 Jun 2011 14:08:49 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110419 Red Hat/3.1.10-1.el6_0 Mnenhy/0.8.3 Thunderbird/3.1.10

retitle 8871 RFE enhance sort --debug -i
tag 8871 wishlist
thanks

On 06/15/2011 09:42 AM, Al Bogner wrote:
> Hi,
> 
> this looks like a bug for me:

Thanks for the report.  However, most likely this is not a bug in sort,
but a misunderstanding on your part about how locales affect which bytes
(or byte sequences, in multi-byte locales) are deemed printable.

> 
> var="φθινόπωρο,κισσός,Φύλλο"
> 
> 
> echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \

Wow, that's a complex way to change comma into newline.  Why not just:

var="φθινόπωρο
κισσός
Φύλλο"
echo "$var" | sort ...

[I'm assuming you've distilled this from a larger example where the
complex processing was actually useful rather than starting from the
right string to begin with]

> sort -f -u
> κισσός
> φθινόπωρο
> φύλλο
> 
> echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \
> sort -f -i -u
> φθινόπωρο

Let's put the new 'sort --debug' option to use to point out the
difference a locale makes (and note that on my machine, the C locale
deems non-ASCII bytes as non-printable, even though they still render
just fine on my terminal).  First, without -i:

$ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fu
sort: using `en_US.UTF-8' sorting rules
κισσός
______
φθινόπωρο
_________
Φύλλο
_____
$ echo "$var" | LC_ALL=C sort --debug -fu
sort: using simple byte comparison
Φύλλο
__________
κισσός
____________
φθινόπωρο
__________________


Did you notice how the line lengths differ between the en_US.UTF-8
locale (which knows how to treat multi-byte characters as single
characters) and the C locale (where every byte is a character, and the
multi-byte UTF-8 entities are treated as multiple non-printable characters)?

Then adding -i:

$ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fui
sort: using `en_US.UTF-8' sorting rules
κισσός
______
φθινόπωρο
_________
Φύλλο
_____
$ echo "$var" | LC_ALL=C sort --debug -fui
coreutils/src/sort: using simple byte comparison
φθινόπωρο
__________________

When all of the bytes are ignored as non-printable, then all three lines
are identical, hence -u prints only one line.

However, I think this report _did_ find a valid tangential issue - the
'sort --debug' option ought to be enhanced to use a different character
than '_' when flagging which bytes were ignored by -i as unprintable
characters.  That is, I would find it much nicer to see:

$ echo 'aφc' | LC_ALL=C sort --debug -i
aφc
_.._

to make it obvious that the two bytes for φ were being ignored from the
particular sort field that I requested, because -i was in effect.  Same
thing goes for other sort options, such as 'sort -k1n' ignoring
characters after the end of the first parsed number.

-- 
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]