Hello Karl!
On 05/31/2016 02:32 PM, Karl Berry wrote:
I run
LC_ALL=en_US.UTF-8 sort --debug -k 2 /tmp/foo # or -k 2,2 et al.
And get the nicely explanatory output for the "surprising" result:
[...]
Just to verify, the surprising result is in C locale?
I'm seeing the following, for "en_US.UTF-8" it's the order I'd expect, but the
"C" is surprising:
$ cat -A k.txt
M Build/zfile$
M Master/mfile$
MM Build/afile$
$ LC_ALL=en_US.UTF-8 sort -k2 k.txt
MM Build/afile
M Build/zfile
M Master/mfile
$ LC_ALL=C sort -k2 k.txt
M Build/zfile
M Master/mfile
MM Build/afile
But the information is just as valid in C as in UTF-8, so far as I can
see. Thus it would be nice for it to be present.
If I understand correctly, one could argue the warning is even more important
in C locale than in UTF-8 locales,
as collating rules for UTF-8 make leading spaces less significant.
As in:
$ cat -A s.txt
M A$
M B$
M D$
M C$
UTF-8 makes leading spaces less important:
$ LC_ALL=en_US.UTF-8 sort -k2 s.txt
M A
M B
M C
M D
in C locale, spaces (as simple bytes) do matter:
$ LC_ALL=C sort -k2 s.txt
M D
M B
M C
M A
-b skips leading spaces:
$ LC_ALL=C sort -k2b s.txt
M A
M B
M C
M D
More importantly, I urge that the documentation for sort give an example
of this. The idea that following blanks after the first become part of
the next field is highly counter-intuitive.
I agree,
I can add the above example to the documentation (also possibly to the FAQ or
Gotcha pages?).
What do you think?
The condition to print this message is here:
http://lingrok.org/xref/coreutils/src/sort.c#2435
I can try to suggest a patch to print it in C locale as well (hopefully
tonight).