bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26422: historical feature or grand daddy bug?


From: Kyle Sallee
Subject: bug#26422: historical feature or grand daddy bug?
Date: Sun, 9 Apr 2017 15:53:26 -0700

Thanks for the fast response.
Right or wrong POSIX is POSIX,
Yet a LF as part of a line does seem worth counting.
A line must terminate with a line feed.
Yet a string does not require a line feed.

Paul Eggert's assumption seems correct.
During line indexing
the lines which consist of only a line feed can be counted
and excluded from the sort.
And then the correct amount of line feeds can be output
before the sorted lines, or after if the -r parameter is present.
For test data consecutive LF seems plausible,
but for actual sorting tasks; would consecutive LF be common?
It might be a potential optimization worthy of omission.

If the sort function's compare function was inlined
rather than called from a pointer
then a modest 5% performance boon could become.
To implement some creativity would be required.

If the input data was not copied
and string conversion was omitted
then another 5% performance boon could become.

The sort method used is not known.
However, a merge sort has some surprisingly frequent
uhm code paths like a 3 way comparison
which can be implemented for 2 or 3 comparisons
and 0 to 4 memory moves.

A 15% to 20% overall performance improvement
from the three suggestions is not implausible.
Thanks for making it faster.

On Sun, Apr 9, 2017 at 12:04 PM, Paul Eggert <address@hidden> wrote:

> Historically, 'sort' ignored the \n at the end of each line, so that empty
> lines (i.e., lines consisting only of a single \n) collated before all
> other lines. An earlier version of the POSIX spec was (mis)written to
> require treating the \n as part of the data, and during development in 1999
> GNU sort was briefly changed to conform to that, but this was an error in
> the POSIX spec that was eventually fixed and GNU sort was changed back to
> the traditional behavior, before any release was made with the funky
> behavior.
>
> So, it's not a bug that \t\n collates after \n, since "\t" is
> lexicographically after "".
>
> As I understand it, the empty string should collate before all other
> strings in all POSIX locales, so empty lines should always sort first in
> 'sort' output. I'm by no means a collation expert, though, and if I'm wrong
> I'd like to see a counterexample.
>
> Come to think of it, 'sort' might be able to improve performance in the
> common case of sorting text files containing many empty lines, by merely
> counting the lines rather than storing them internally. I suppose this is a
> different topic, though.
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]