bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: feature request: gzip/bzip support for sort


From: Jim Meyering
Subject: Re: feature request: gzip/bzip support for sort
Date: Thu, 25 Jan 2007 19:04:33 +0100

Paul Eggert <address@hidden> wrote:
> Jim Meyering <address@hidden> writes:
>> I'm probably going to change the documentation so that
>> people will be less likely to depend on being able to run
>> a separate program.  To be precise, I'd like to document
>> that the only valid values of GNUSORT_COMPRESSOR are the
>> empty string, "gzip" and "bzip2"[*].
>
> This sounds extreme, particularly since gzip and bzip2 are
> not the best algorithms for 'sort' compression, where you
> want a fast compressor.  Better choices right now would
> include include lzop <http://www.lzop.org/> and maybe
> QuickLZ <http://www.quicklz.com/>.

I see that streaming support in QuickLZ is available only in the very
latest "beta".  That makes me think it's not quite ready for use in
GNU sort.  Maybe lzop is more mature.  I haven't looked yet.

> The fast-compressor field is moving fairly rapidly.
> (I've heard some rumors from some of my commercial friends.)
> QuickLZ, a new algorithm, is at the top of the
> maximumcompression list right now for fast compressors; see
> <http://www.maximumcompression.com/data/summary_mf3.php>.
> I would not be surprised to see a new champ next year.
>
>> Then we will have the liberty to remove the exec calls and use library
>> code instead, thus making the code a little more efficient -- but mainly,
>> more robust.
>
> It's not clear to me that it'll be more efficient for the
> soon-to-be common case of multicore chips, since 'sort' and
> the compressor can run in parallel.

No.  I proposed to remove only the *exec*, not the _fork_.  The idea
is to have each child run library code to (de)compress, rather than to
exec an external program.  Well, I do want it to revert to sequential
whenever fork fails, but otherwise it would still take advantage of
_some_ parallelism: 2-way when compressing, and up to 17-way (NMERGE + 1)
when merging.  Of course, this parallelism currently kicks in only
with a data set large enough to require temporary files.  Now that
dual-core systems are common, a nice project would be to make GNU sort
take advantage of that when performing day-to-day (in-memory) sorts.

> We'll have to measure.
> I agree about the robustness but that should be up to the user.

I want to keep the default code paths as simple and reliable as possible.
That means no exec, and a fall-back to running library code sequentially
upon fork failure -- both for compression and decompression.  While this
new feature is a very nice addition, we have to be careful to ensure
that it does not compromise sort's reliability, even under duress.

Here's my reasoning: when compressing with an arbitrary program, sort
may do a *lot* of work up front (there is no decompression at all in the
beginning), yet fail only near the end when it discovers that "prog"
doesn't accept -d or can't fork.  The result: the time spent sorting
that large data set is wasted.  With judicious support for at least one
built-in compression method, this error path is completely eliminated.
This is yet another way in which sort differs from tar.  If an invalid
compression program is specified to tar, tar fails right away -- it
doesn't first waste hours of compute time.  In fact, given an invalid
compressor program, sort may well work repeatedly, potentially over a span
of days or longer, before encountering an input large enough to trigger
the need to use temporary files, where it could hit a decompression
failure.  Avoiding this sort of delayed failure is my main motivation
for defining away the problem.  And as I'm sure you agree, I'd rather
not have sort "test" (fork/exec) the compression program to ensure that
it works with "-d".

Another reason to avoid the exec is memory.  When merging, there are
usually 16 separate gzip processes running in parallel.  Sure, each is
pretty small, with a RSS of under 500K on my system, but with 16 of
them, it does add up.

So we need support for at least one built-in compression
library.  Here are my selection criteria, in decreasing order
of importance:

    robust
    portable
    reasonable run-time efficiency
    reasonable compression performance

If some other compression library is a better fit, then we
should add support for it, and consider making it the default.

Similarly, if there are people interested in sorting huge data sets
for which general purpose compression isn't good enough, they can
provide profiling data showing how their domain-specific compressor
is essential.  If this ever happens, it would be a good reason to
allow sort to "exec" an arbitrary program again.

> Perhaps we could put in something that says, "If the
> compressor is named 'gzip' we may optimize that." and
> similarly for 'lzop' and/or a few other compressor names.
> Or, more generally, we could have the convention that if the
> compressor name starts with "-" we will strip the "-" and
> then try to optimize the result if we can.  Something like
> that, anyway.
>
>> [*] If gzip and bzip2 are good enough for tar, why should sort make any
>> compromise (exec'ing some other program) in order to be more flexible?
>
> For 'sort' the tradeoff is different than for 'tar'.  We
> don't particularly care if the format is stable, since it's
> throwaway.  And we want fast compression, whereas people
> generating tarballs often are willing to have way slower
> compression for a slightly higher compression ratio.  (Plus,
> new versions of 'tar' allow arbitrary compressors anyway.)
>
> I do have a suggestion: we shouldn't use an environment
> variable to select a compressor.  It should just be an
> option.  Environment variables are funny beasts and it's
> better to avoid them if we can.  I'll construct a patch
> along those lines if you like.

I agree wholeheartedly.
One should not be able to subvert sort by making the envvar
point to e.g., a -d-accepting wrapper around cat+shuf :)
And allowing yet another envvar would have required adding to the
list of envvars we have to be careful of when running tests.
Thanks for volunteering.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]