bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uniq/sort documentation flaw


From: Andries E. Brouwer
Subject: Re: uniq/sort documentation flaw
Date: Tue, 5 May 2009 17:02:06 +0200
User-agent: Mutt/1.5.18 (2008-05-17)

On Tue, May 05, 2009 at 12:13:04PM +0100, Pádraig Brady wrote:
> Andries E. Brouwer wrote:
> > uniq(1) says
> > 
> >        Discard all but one of successive identical lines from INPUT
> > 
> > However, this is very misleading. "Identical" does not mean identical
> > but "equal if one ignores differences that LC_COLLATE says should be 
> > ignored".
> 
> How about the attached?

Certainly an improvement - now LC_COLLATE is mentioned.

> > (Sorting is an operation done on all kinds of data, not only lines of text.
> > I would not mind an option that tells sort to ignore the locale rules for
> > sorting because what is sorted is not text. That feels cleaner than
> > preceding each invocation with LC_COLLATE=C. And locale-free sort also
> > is much faster.)
> 
> Well it is a very common issue.
> http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
> I'm not sure there is a better solution than what we have though.

The place where I encountered this was a mathematical application
that worked fine on some machines and produced incorrect answers on others.
After some debugging it turned out that sort | uniq -D and sort | unic -cd
failed, and the cause was an en_US.UTF-8 locale.

The interesting part was that the order was totally irrelevant for the
application, anything would have been OK - the sort was there only to enable
uniq to do its job.  However, it failed because sort did not do a total sort
but ignored certain differences so that identical lines could be separated
by different lines in the sort output.

So, you see "Sort does not sort in normal order" is not the problem.
"Sort does not sort" is the problem.

The underlying problem with sort and locale is that sort assumes that
it is sorting text. But it was sorting binary data (sort -z)
and using the locale in such cases is nonsense. That is why I vaguely
wondered whether an option to do a non-locale sort would be useful.
Of course LC_COLLATE=C sort works, but sort --ignore-locale would be
more clean and have the additional advantage that describing this option
on the sort man page stresses that sort without it uses the locale
and therefore might not do a total sort. (That latter point is still
not mentioned anywhere in the docs, I think.)

Andries





reply via email to

[Prev in Thread] Current Thread [Next in Thread]