bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42340: "join" reports that "sort"ed input is not sorted


From: Assaf Gordon
Subject: bug#42340: "join" reports that "sort"ed input is not sorted
Date: Mon, 13 Jul 2020 00:58:32 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

tags 42340 notabug
close 42340
stop

Hello,

On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote:
In trying to use `join` with `sort` I discovered odd behavior: even after
running a file through `sort` using the same delimiter, `join` would still
complain that it was out of order.
[...]
Here is a way to reproduce the problem:

printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt
printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt
join -t, a.txt b.txt
  join: b.txt:2: is not sorted: 1.1.1,b

The expected behavior would be that if a file has been sorted by "sort" it
will also be considered sorted by join.
[...]
I traced this back to what I believe to be a bug in sort.c

This is not a bug in sort or join, just a side-effect of the locale on your system on the sorting results.

By forcing a C locale with "LC_ALL=C" (meaning simple ASCII order),
the files are ordered in the same way 'join' expected them to be:

 $ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=C sort -t, > a.txt
 $ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=C sort -t, > b.txt
 $ join -t, a.txt b.txt
 1.1.1,2,b
 1.1.12,2,a

---

More details:
I'm going to assume your system uses some locale based on UTF-8.
You can check it by running 'locale', e.g. on my system:
  $ locale
  LANG=en_CA.utf8
  LANGUAGE=en_CA:en
  LC_CTYPE="en_CA.utf8"
  ..
  ..

Under most UTF-8 locales, punctuation characters are *ignored* in the
compared input lines. This might be confusing and non-intuitive, but
that's the way most systems have been working for many years (locale
ordering is defined in the GNU C Library, and coreutils has no way to
change it).

Observe the following:

  $ printf '12,a\n1,b\n' | LC_ALL=en_CA.utf8 sort
  12,a
  1,b

  $ printf '12,a\n1,b\n' | LC_ALL=C sort
  1,b
  12,a

With a UTF-8 locale, the comma character is ignored, and then "12a" appears before "1b" (since the character '2' comes before the character
'b').

With "C" locale, forcing ASCII or "byte comparison", punctuation characters are not ignored, and "1,b" appears before "12,a" (because the comma ',' ASCII value is 44 , which is smaller then the ASCII value digit '2').

---

Somewhat related:
Your sort command defines the delimiter ("-t,") but does not define which columns to sort by; sort then uses the entire input line - and there's no need to specify delimiter at all.

---

As such, I'm closing this as "not a bug", but discussion can continue by
replying to this thread.

regards,
 - assaf






reply via email to

[Prev in Thread] Current Thread [Next in Thread]