bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sort order bug in GNU sort


From: Luke Hutchison
Subject: Re: Sort order bug in GNU sort
Date: Thu, 29 Oct 2009 19:30:06 -0700

On Thu, Oct 29, 2009 at 5:51 PM, Eric Blake <address@hidden> wrote:
> [please don't top-post on technical lists]

Sorry about the lack of mailing list etiquette, the sort manpage
doesn't make it clear that address@hidden is a mailing list...

> Well, that looks correct to me, if your current locale specifies that
> punctuation is ignored during collation (that is, you are getting: 101000
> < 101006 < 101010, after ignoring , and .).
>
> http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
>
> Try 'LC_ALL=C sort' to see the difference.

I don't know why punctuation is not treated as a space en the en_US
locale, or for that matter why the decision was made to ignore spaces
in en_US (I would love to see the background thinking that went into
that decision, the sorted order "San Juan, Santa Clara, San Teodoro"
doesn't make intuitive sense to me).  I note that the Wikipedia page
on Collation says that sorting is done both ways (with or without
spaces) but that ignoring spaces is supposedly more common.  Anyway,
thanks for explaining and sorry that I didn't see the explanation in
the FAQ.

Given that (according to the FAQ) "This one question arises almost
more often than any other", and given the inconvenience of changing
locales in a script just so sort will work right, wouldn't it make
sense to just add an optional switch that effectively sets LC_ALL=C
for the sort?

I note now the warning in the man page: "*** WARNING *** The locale
specified by the environment affects sort order.  Set LC_ALL=C to get
the traditional sort order that  uses native byte values."  I had no
idea this would affect non-accented characters before hitting this.
Could the manpage please be extended to give a simple example
comparing the sort order in the en_US locale with the C locale, to
make this much clearer?

Thanks,
Luke




reply via email to

[Prev in Thread] Current Thread [Next in Thread]