[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#36674: Sort Suggestion
From: |
Assaf Gordon |
Subject: |
bug#36674: Sort Suggestion |
Date: |
Mon, 15 Jul 2019 13:23:52 -0600 |
User-agent: |
Mutt/1.11.4 (2019-03-13) |
tag 36674 notabug
close 36674
stop
Hello,
On Mon, Jul 15, 2019 at 11:42:01AM -0700, Marshall Lake wrote:
> Even though this isn't a bug, I was asked to send the following to this
> email address.
(General suggestions and discussions are better suited for
address@hidden mailing list, that way the system won't open a new
bug item.)
>
> Re: SORT Command from GNU coreutils 8.25
>
> A suggestion for an additional option to the SORT command is to ignore
> non-alphanumeric characters.
>
> As an example, in attempting to sort an index ...
>
> Abbott, William 259
>
> sorts before:
>
> Abbot, William 099
>
> If non-alphanumeric characters were ignored then the same two records
> would sort as:
>
> Abbot, William 099
> Abbott, William 259
>
>
There's actually something else at play here:
In your case, sort does ignore non-alphanumeric characters,
but it ALSO ignores white space.
That happens because your locale is set to some language
(for example, en_US.UTF8).
Using such locale makes sort ignore all non-alphanumeric chareacters,
whitespace, and upper/lower cases.
In essense, you are compaing "AbbottWilliam" (two 't's) to
'AbbotWilliam' (one 't') - and then the second 't' is compared to a 'w',
and is determined to come first.
If you force a POSIX/C locate, then all characters are considered,
and the result will be as you requested.
Observe the following:
$ printf "%s\n" AbbottWilliam AbbotWilliam | LC_ALL=en_CA.utf8 sort
AbbottWilliam
AbbotWilliam
$ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=en_CA.utf8 sort
Abbott William
Abbot William
$ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=C sort
Abbot William
Abbott William
$ printf "%s\n" "Abbott, William" "Abbot, William" | LC_ALL=C sort
Abbot, William
Abbott, William
Note that 'sort' already has an option for dictionary style sorting:
-d, --dictionary-order: consider only blanks and alphanumeric characters.
However, locale rules take precedence over it, so effectively it only
works in "C" locale:
$ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort
Ab,,b,,ott William
Abbot William
$ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort -d
Abbot William
Ab,,b,,ott William
You can read past discussion about the confusion resulting from locale
sorting rules here:
https://debbugs.gnu.org/11621
https://debbugs.gnu.org/12783
As such, I'm closing this as "not a bug", but discussion can continue
by replying to this thread.
-assaf