From: Gijs van Tulder
Subject: Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files
Date: Sun, 31 Mar 2013 00:46:00 +0100
> It appears wget may be creating slightly malformed GZIP skip-length
> fields

I think that's correct: Wget doesn't write the subfield length in the "extra field" section of the header. After the subfield ID "sl" it should write the length LEN (see RFC 1952 [1]), but it doesn't.

Luckily, it does write the correct length of all extra fields (XLEN in the RFC 1952), so Gzip implementations that just ignore the extra field can skip it without problems. This is the case for the GNU Gzip utility.

But it should be fixed. I've attached a patch.

> It's likely that we'll need to make the warc.gz parsers a bit more
> robust, but I thought I'd mention it here in case this is
> actually a bug in wget.

When I wrote the code for the extra field I used the old Hanzo warc-tools [2] as an example. That implementation has the same problem: it doesn't write the field length [3]. This means there's at least one other tool that writes these off-spec warc.gz files, so it's probably useful to make the parser a bit more robust.



[1] http://www.gzip.org/zlib/rfc-gzip.html
[2] https://code.google.com/p/warc-tools/
[2] https://code.google.com/p/warc-tools/source/browse/trunk/lib/private/wgzip.c#314

