bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: feature request: gzip/bzip support for sort


From: Paul Eggert
Subject: Re: feature request: gzip/bzip support for sort
Date: Wed, 24 Jan 2007 15:48:11 -0800
User-agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.4 (gnu/linux)

Jim Meyering <address@hidden> writes:

> I'm probably going to change the documentation so that
> people will be less likely to depend on being able to run
> a separate program.  To be precise, I'd like to document
> that the only valid values of GNUSORT_COMPRESSOR are the
> empty string, "gzip" and "bzip2"[*].

This sounds extreme, particularly since gzip and bzip2 are
not the best algorithms for 'sort' compression, where you
want a fast compressor.  Better choices right now would
include include lzop <http://www.lzop.org/> and maybe
QuickLZ <http://www.quicklz.com/>.

The fast-compressor field is moving fairly rapidly.
(I've heard some rumors from some of my commercial friends.)
QuickLZ, a new algorithm, is at the top of the
maximumcompression list right now for fast compressors; see
<http://www.maximumcompression.com/data/summary_mf3.php>.
I would not be surprised to see a new champ next year.

> Then we will have the liberty to remove the exec calls and use library
> code instead, thus making the code a little more efficient -- but mainly,
> more robust.

It's not clear to me that it'll be more efficient for the
soon-to-be common case of multicore chips, since 'sort' and
the compressor can run in parallel.  We'll have to measure.
I agree about the robustness but that should be up to the user.

Perhaps we could put in something that says, "If the
compressor is named 'gzip' we may optimize that." and
similarly for 'lzop' and/or a few other compressor names.
Or, more generally, we could have the convention that if the
compressor name starts with "-" we will strip the "-" and
then try to optimize the result if we can.  Something like
that, anyway.

> [*] If gzip and bzip2 are good enough for tar, why should sort make any
> compromise (exec'ing some other program) in order to be more flexible?

For 'sort' the tradeoff is different than for 'tar'.  We
don't particularly care if the format is stable, since it's
throwaway.  And we want fast compression, whereas people
generating tarballs often are willing to have way slower
compression for a slightly higher compression ratio.  (Plus,
new versions of 'tar' allow arbitrary compressors anyway.)


I do have a suggestion: we shouldn't use an environment
variable to select a compressor.  It should just be an
option.  Environment variables are funny beasts and it's
better to avoid them if we can.  I'll construct a patch
along those lines if you like.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]