bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42340: Fwd: bug#42340: "join" reports that "sort"ed input is not sor


From: Beth Andres-Beck
Subject: bug#42340: Fwd: bug#42340: "join" reports that "sort"ed input is not sorted
Date: Wed, 15 Jul 2020 13:12:24 -0700

If that is the intended behavior, the bug is that:
> printf '12,\n1,\n' | sort -t, -k1 -s
1,
12,

does _not_ take the remainder of the line into account, and only sorts on
the initial field, prioritizing length.

It is at the very least unexpected that adding an `a` to the end of both
lines would change the sort order of those lines:
> printf '12,a\n1,a\n' | sort -t, -k1 -s
12,a
1,a

On Sun, Jul 12, 2020 at 11:58 PM Assaf Gordon <assafgordon@gmail.com> wrote:

> tags 42340 notabug
> close 42340
> stop
>
> Hello,
>
> On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote:
> > In trying to use `join` with `sort` I discovered odd behavior: even after
> > running a file through `sort` using the same delimiter, `join` would
> still
> > complain that it was out of order.
> [...]
> > Here is a way to reproduce the problem:
> >
> >> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt
> >> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt
> >> join -t, a.txt b.txt
> >   join: b.txt:2: is not sorted: 1.1.1,b
> >
> > The expected behavior would be that if a file has been sorted by "sort"
> it
> > will also be considered sorted by join.
> [...]
> > I traced this back to what I believe to be a bug in sort.c
>
> This is not a bug in sort or join, just a side-effect of the locale on
> your system on the sorting results.
>
> By forcing a C locale with "LC_ALL=C" (meaning simple ASCII order),
> the files are ordered in the same way 'join' expected them to be:
>
>   $ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=C sort -t, > a.txt
>   $ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=C sort -t, > b.txt
>   $ join -t, a.txt b.txt
>   1.1.1,2,b
>   1.1.12,2,a
>
> ---
>
> More details:
> I'm going to assume your system uses some locale based on UTF-8.
> You can check it by running 'locale', e.g. on my system:
>    $ locale
>    LANG=en_CA.utf8
>    LANGUAGE=en_CA:en
>    LC_CTYPE="en_CA.utf8"
>    ..
>    ..
>
> Under most UTF-8 locales, punctuation characters are *ignored* in the
> compared input lines. This might be confusing and non-intuitive, but
> that's the way most systems have been working for many years (locale
> ordering is defined in the GNU C Library, and coreutils has no way to
> change it).
>
> Observe the following:
>
>    $ printf '12,a\n1,b\n' | LC_ALL=en_CA.utf8 sort
>    12,a
>    1,b
>
>    $ printf '12,a\n1,b\n' | LC_ALL=C sort
>    1,b
>    12,a
>
> With a UTF-8 locale, the comma character is ignored, and then "12a"
> appears before "1b" (since the character '2' comes before the character
> 'b').
>
> With "C" locale, forcing ASCII or "byte comparison", punctuation
> characters are not ignored, and "1,b" appears before "12,a" (because
> the comma ',' ASCII value is 44 , which is smaller then the ASCII value
> digit '2').
>
> ---
>
> Somewhat related:
> Your sort command defines the delimiter ("-t,") but does not define
> which columns to sort by; sort then uses the entire input line - and
> there's no need to specify delimiter at all.
>
> ---
>
> As such, I'm closing this as "not a bug", but discussion can continue by
> replying to this thread.
>
> regards,
>   - assaf
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]