bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: feature request: gzip/bzip support for sort


From: Philip Rowlands
Subject: Re: feature request: gzip/bzip support for sort
Date: Thu, 18 Jan 2007 22:38:49 +0000 (GMT)

On Thu, 18 Jan 2007, Jim Meyering wrote:

I've done some more timings, but with two more sizes of input.
Here's the summary, comparing straight sort with sort --comp=gzip:

 2.7GB:   6.6% speed-up
 10.0GB: 17.8% speed-up

It would be interesting to see the individual stats returned by wait4(2) from the child, to separate CPU seconds spent in sort itself, and in the compression/decompression forks.

I think allowing an environment variable to define the compressor is a good idea, so long as there's a corresponding --nocompress override available from the command line.

 $ seq 9999999 > k
 $ cat k k k k k k k k k > j
 $ cat j j j j > sort-in
 $ wc -c sort-in
 2839999968 sort-in

I had to use "seq -f %.0f" to get this filesize.

With --compress=gzip:
 $ /usr/bin/time ./sort -T. --compress=gzip < sort-in > out
 814.07user 29.97system 14:50.16elapsed 94%CPU (0avgtext+0avgdata 
0maxresident)k  0inputs+0outputs (4major+2821589minor)pagefaults 0swaps

There's a big difference in the time spent on gzip compression depending on the -1/-9 option (default -6). For a similar seq-generated data set above, I get
gzip -1: User time (seconds): 48.63, output size is 6% of input
gzip -9: User time (seconds): 952.97, output size is 3% of input

Decompression time for both tests shows less variation (25s vs 21s).

This would suggest the elapsed time to sort can be improved by trading compression ratio for less CPU time. Obviously a critical factor is the disk latency.


Cheers,
Phil




reply via email to

[Prev in Thread] Current Thread [Next in Thread]