bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42340: "join" reports that "sort"ed input is not sorted


From: Beth Andres-Beck
Subject: bug#42340: "join" reports that "sort"ed input is not sorted
Date: Sun, 12 Jul 2020 16:57:41 -0700

In trying to use `join` with `sort` I discovered odd behavior: even after
running a file through `sort` using the same delimiter, `join` would still
complain that it was out of order.

The field I am sorting on is ip addresses, which means that depending on
which digits are zero they can be of different lengths, and the fields
include periods as well as alpha-numeric characters.

Here is a way to reproduce the problem:

> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt
> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt
> join -t, a.txt b.txt
 join: b.txt:2: is not sorted: 1.1.1,b

The expected behavior would be that if a file has been sorted by "sort" it
will also be considered sorted by join.

---
I traced this back to what I believe to be a bug in sort.c when sorting on
a field other than the last field, where the original pointer is being
incremented one further than it ought to be.

On line 1675 it will always increment the pointer one position beyond the
delimiter unless the field is the last field. If both `eword` and `echar`
are 0 we incremented `eword` on line 1661.

Later when we use keylim (where the limfield value is stored) to calculate
the length of the field, it will include the delimiter in the comparison.
We can illustrate that the problem is including the delimiter because the
following case runs correctly without error:

> printf '1.1.1Z2\n1.1.12Z2\n1.1.2Z1' | sort -tZ > a.txt

> printf '1.1.12Za\n1.1.1Zb\n1.1.21Zc' | sort -tZ > b.txt

> join -tZ a.txt b.txt

In join.c, in comparison, we are comparing the contents of the keys without
the delimiter (on join.c:283 we call extract_field with `ptr` pointing to
the start of the key and len defined as `sep - ptr`, where `sep` is the
position of the tab character).

Cases illustrating the bug in sort:
> printf '12,\n1,\n' | sort -t, -k1
1,
12,

> printf '12,a\n1,a\n' | sort -t, -k1
12,a
1,a

Thank you,
Beth Andres-Beck


reply via email to

[Prev in Thread] Current Thread [Next in Thread]