bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Apparently irrational behaviour in sort


From: The Wanderer
Subject: Apparently irrational behaviour in sort
Date: Sat, 03 Dec 2005 21:46:03 -0500
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922

I have a text file whose lines each contain two dates, of the format
"MM-DD-YYYY". I want to sort these lines into order from oldest to most
recent - that is, first by YYYY, then by MM, then by DD. After I parse
out the spaces using sed, the only whitespace remaining in the file is
the blocks of contiguous tabs which I use to divide the columns which
contain the actual data; the date I want to sort on is in the fifth such
column, which should at that point be field number 5.

I would expect that 'sort -k 5.7,5 -k 5,5' would sort first by the
seventh character in the fifth field (the first digit of the year), thus
putting the file in order by year, and then by the entirety of the fifth
field, thus putting the file in order by month and day. It does not.
Instead, it appears to sort the file by day of the month (DD) - that is,
by what I would expect from '-k 5.4,5.6' - except for two lines, one of
which ends up in the middle of the file and the other of which ends up
at the end of the file.

I have tried various other things, in attempting to sort the file, and
virtually nothing has behaved the way I would expect it to. A simple
'sort -k 5' is almost the only exception: it sorts the file first by MM
and then by DD, which gets two-thirds of the job done but does nothing
about the rest. 'sort -k 5.3,5' *also* sorts the file first by MM and
then by DD, with the sole exception of the line which '-k 5.7,5' puts in
the middle of the file - which, now, winds up in a different place in
the middle of the file. This is decidedly odd, because it should be
*ignoring* the first two digits (the MM) entirely, but it is
consistently the case.

'sort -k 5.8,5' sorts the file by, apparently, only DD - except for five
lines; one of them is that same line which wound up in the middle before
(it's there again), and the other four should all have been various
places in the middle of the list but are together at the end.

The obvious answer is that sort is for some reason sorting on some point
in the line other than what I"m expecting it to, at least for the few
lines which are not being treated correctly - but no other column in the
file appears to be ending up in any readily obvious order as a result of
any of these sorts, and when I've examined the lines which wind up
together at the end of the file, I have not been able to identify any
character offset at which they appear to be in order.

I first noticed this while using coreutils 5.2.1, but have since
upgraded to 5.93 with no apparent effect on the problem. Most of my
tests of this have been in my default (albeit incompletely set up)
locale, "en_US.UTF-8", but prior to the upgrade I also tried with the
three relevant environment variables set to "POSIX"; under those
conditions, sort produced *different* apparently-irrational behaviour,
but still did not do anything like what I expected it to. Since the base
behaviour did not change after the upgrade, I have not repeated the test
with the different locale.

I have not the faintest idea what characteristics of the file I've been
testing this on could be producing the effect, so the only demonstration
I could provide would be that file itself in both unsorted and
wrongly-sorted forms. It's currently less than 5K (I do still add to it
in its unsorted form from time to time), but if that would be too much
to attach and send by list standards then I can make a link to it
available (temporarily, as our ISP's TOS explicitly forbid us from
running anything like a Web server).

I'm well aware that the above is reasonably dense and impenetrable, and
not especially helpful for diagnosing the problem, but I've been trying
to express it better for quite some time now and I'd rather get
something out there and be able to expand on it later than never report
the problem at all.

Thank you for your time.

--
The Wanderer expects to be annoyed by having to manually re-address his replies to the list every time...

Warning: Simply because I argue an issue does not mean I agree with any
side of it.

Secrecy is the beginning of tyranny.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]