bug#8871: Bug with "sort -i" ?

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#8871: Bug with "sort -i" ?

From:	Eric Blake
Subject:	bug#8871: Bug with "sort -i" ?
Date:	Wed, 15 Jun 2011 14:08:49 -0600
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110419 Red Hat/3.1.10-1.el6_0 Mnenhy/0.8.3 Thunderbird/3.1.10

retitle 8871 RFE enhance sort --debug -i
tag 8871 wishlist
thanks

On 06/15/2011 09:42 AM, Al Bogner wrote:
> Hi,
> 
> this looks like a bug for me:

Thanks for the report.  However, most likely this is not a bug in sort,
but a misunderstanding on your part about how locales affect which bytes
(or byte sequences, in multi-byte locales) are deemed printable.

> 
> var="φθινόπωρο,κισσός,Φύλλο"
> 
> 
> echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \

Wow, that's a complex way to change comma into newline.  Why not just:

var="φθινόπωρο
κισσός
Φύλλο"
echo "$var" | sort ...

[I'm assuming you've distilled this from a larger example where the
complex processing was actually useful rather than starting from the
right string to begin with]

> sort -f -u
> κισσός
> φθινόπωρο
> φύλλο
> 
> echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \
> sort -f -i -u
> φθινόπωρο

Let's put the new 'sort --debug' option to use to point out the
difference a locale makes (and note that on my machine, the C locale
deems non-ASCII bytes as non-printable, even though they still render
just fine on my terminal).  First, without -i:

$ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fu
sort: using `en_US.UTF-8' sorting rules
κισσός
______
φθινόπωρο
_________
Φύλλο
_____
$ echo "$var" | LC_ALL=C sort --debug -fu
sort: using simple byte comparison
Φύλλο
__________
κισσός
____________
φθινόπωρο
__________________

Did you notice how the line lengths differ between the en_US.UTF-8
locale (which knows how to treat multi-byte characters as single
characters) and the C locale (where every byte is a character, and the
multi-byte UTF-8 entities are treated as multiple non-printable characters)?

Then adding -i:

$ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fui
sort: using `en_US.UTF-8' sorting rules
κισσός
______
φθινόπωρο
_________
Φύλλο
_____
$ echo "$var" | LC_ALL=C sort --debug -fui
coreutils/src/sort: using simple byte comparison
φθινόπωρο
__________________

When all of the bytes are ignored as non-printable, then all three lines
are identical, hence -u prints only one line.

However, I think this report _did_ find a valid tangential issue - the
'sort --debug' option ought to be enhanced to use a different character
than '_' when flagging which bytes were ignored by -i as unprintable
characters.  That is, I would find it much nicer to see:

$ echo 'aφc' | LC_ALL=C sort --debug -i
aφc
_.._

to make it obvious that the two bytes for φ were being ignored from the
particular sort field that I requested, because -i was in effect.  Same
thing goes for other sort options, such as 'sort -k1n' ignoring
characters after the end of the first parsed number.

-- 
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

bug#8871: Bug with "sort -i" ?, Al Bogner, 2011/06/15
- bug#8871: Bug with "sort -i" ?, Eric Blake <=
  - Message not available
    - bug#8871: Bug with "sort -i" ?, Eric Blake, 2011/06/15
    - bug#8871: Bug with "sort -i" ?, Philipp Thomas, 2011/06/16

Prev by Date: bug#8871: Bug with "sort -i" ?
Next by Date: bug#8871: Bug with "sort -i" ?
Previous by thread: bug#8871: Bug with "sort -i" ?
Next by thread: bug#8871: Bug with "sort -i" ?
Index(es):
- Date
- Thread