bug#23113: parallel gzip processes trash hard disks, need larger buffers

bug-gzip

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#23113: parallel gzip processes trash hard disks, need larger buffers

From:	Chevreux, Bastien
Subject:	bug#23113: parallel gzip processes trash hard disks, need larger buffers
Date:	Tue, 12 Apr 2016 16:55:30 +0000

Mark,

I knew about pigz, albeit not about -b, thank you for that. Together with -p 1 
that would replicate gzip and implement input buffering well enough to be used 
in parallel pipelines (where you do not want, e.g., 40 pipelines running 40 
pigz with 40 threads each).

Questions: how stable / error proof is pigz compared to gzip? I always shied 
away from it as gzip is so much tried and tested that errors are unlikely ... 
and the zlib.net homepage does not make an "official" statement like "you 
should all now move to pigz, it's good and tested enough." Additional question: 
is there a pigzlib planned? :-)

Jim, Paul: I'd say that this thread/bug can be closed if pigz proves to be as 
stable / error free as gzip. I suppose that while backporting -b to gzip could 
be done, it would not make much sense.

Best,
  Bastien

-- 
DSM Nutritional Products Microbia Inc | Bioinformatics
60 Westview Street | Lexington, MA 02421 | United States
Phone +1 781 259 7613 | Fax +1 781 259 0615

-----Original Message-----
From: Mark Adler [mailto:address@hidden 
Sent: Sonntag, 10. April 2016 03:49
To: Chevreux, Bastien
Cc: Jim Meyering; address@hidden
Subject: Re: bug#23113: parallel gzip processes trash hard disks, need larger 
buffers

Bastien,

pigz (a parallel version of gzip) has a variable buffer size. The -b or 
--blocksize option allows up to 512 MB buffers, defaulting to 128K. See 
http://zlib.net/pigz/

Mark


> On Mar 29, 2016, at 4:03 PM, Chevreux, Bastien <address@hidden> wrote:
> 
>> From: address@hidden [mailto:address@hidden On Behalf Of Jim 
>> Meyering [...] However, I suggest that you consider using xz in place 
>> of gzip.
>> Not only can it compress better, it also works faster for comparable 
>> compression ratios.
> 
> xz is not a viable alternative in this case: use case is not archiving. There 
> is a plethora of programs out there with zlib support compiled in and these 
> won't work on xz packed data. Furthermore, gzip -1 is approximately 4 times 
> faster than xz -1 on FASTQ files (sequencing data), and the use case here is 
> "temporary results, so ok-ish compression in a comparatively short amount of 
> time". Gzip is ideal in that respect as even at -1 it compresses down to 
> ~25-35% ... and that already helps a lot when you do not need 1 TiB of hard 
> disk but only ~350 GiB. Gzip -1 takes ~4.5 hrs, xz -1 almost a day.
> 
>> That said, if you find that setting gzip.h's INBUFSIZ or OUTBUFSIZ to larger 
>> values makes a significant difference, we'd like to hear about the results 
>> and how you measured.
> 
> Changing the INBUFSIZ did not have the effect hoped for as this is just the 
> buffer size allocated by gzip ... but in the end it uses only 64k at most  
> and the calls to the file system read() even end up to request only 32k per 
> call.
> 
> I traced this down through multiple layers to the function fill_window() in 
> deflate.c, where things get really intricate using multiple pre-set 
> variables, defines and memcpy()s. It became clear that the code is geared 
> towards using a 64k buffer with a rolling window of 32k. Optimised for 16 bit 
> machines that is.
> 
> There are a few mentions of SMALL_MEM, MEDIUM_MEM and BIG_MEM variants via 
> defines. However, code comments say that BIG_MEM would work on a complete 
> file loaded in memory ... which is a no-go for files in the area of 15 to 30 
> GiB. I'm not even sure the code would be doing what the comments say.
> 
> Long story short: I do not feel expert enough to touch said functions and 
> change them to provide for larger input buffering. If I were forced to 
> implement something I'd try it with an outer buffering layer, but I'm not 
> sure it would be elegant or even efficient.
> 
> Best,
>  Bastien
> 
> PS: then again I'm toying with the idea to write a simple gzip-packer 
> replacement which simply buffers data and passes it to zlib.
> 
> --
> DSM Nutritional Products Microbia Inc | Bioinformatics
> 60 Westview Street | Lexington, MA 02421 | United States Phone +1 781 
> 259 7613 | Fax +1 781 259 0615
> 
> 
> ________________________________
> 
> DISCLAIMER:
> This e-mail is for the intended recipient only.
> If you have received it by mistake please let us know by reply and then 
> delete it from your system; access, disclosure, copying, distribution or 
> reliance on any of it by anyone else is prohibited.
> If you as intended recipient have received this e-mail incorrectly, please 
> notify the sender (via e-mail) immediately.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#23113: parallel gzip processes trash hard disks, need larger buffers, Mark Adler, 2016/04/10
- bug#23113: parallel gzip processes trash hard disks, need larger buffers, Jim Meyering, 2016/04/11
  - bug#23113: parallel gzip processes trash hard disks, need larger buffers, Paul Eggert, 2016/04/12
- bug#23113: parallel gzip processes trash hard disks, need larger buffers, Chevreux, Bastien <=
  - bug#23113: parallel gzip processes trash hard disks, need larger buffers, Mark Adler, 2016/04/12
  - bug#23113: parallel gzip processes trash hard disks, need larger buffers, Jim Meyering, 2016/04/12
    - bug#23113: parallel gzip processes trash hard disks, need larger buffers, Mark Adler, 2016/04/12

Prev by Date: bug#23113: parallel gzip processes trash hard disks, need larger buffers
Next by Date: bug#23113: parallel gzip processes trash hard disks, need larger buffers
Previous by thread: bug#23113: parallel gzip processes trash hard disks, need larger buffers
Next by thread: bug#23113: parallel gzip processes trash hard disks, need larger buffers
Index(es):
- Date
- Thread