[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files

From: Gijs van Tulder
Subject: Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files
Date: Sun, 31 Mar 2013 00:46:00 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4


> It appears wget may be creating slightly malformed GZIP skip-length
> fields

I think that's correct: Wget doesn't write the subfield length in the "extra field" section of the header. After the subfield ID "sl" it should write the length LEN (see RFC 1952 [1]), but it doesn't.

Luckily, it does write the correct length of all extra fields (XLEN in the RFC 1952), so Gzip implementations that just ignore the extra field can skip it without problems. This is the case for the GNU Gzip utility.

But it should be fixed. I've attached a patch.

> It's likely that we'll need to make the warc.gz parsers a bit more
> robust, but I thought I'd mention it here in case this is
> actually a bug in wget.

When I wrote the code for the extra field I used the old Hanzo warc-tools [2] as an example. That implementation has the same problem: it doesn't write the field length [3]. This means there's at least one other tool that writes these off-spec warc.gz files, so it's probably useful to make the parser a bit more robust.



[1] http://www.gzip.org/zlib/rfc-gzip.html
[2] https://code.google.com/p/warc-tools/
[2] https://code.google.com/p/warc-tools/source/browse/trunk/lib/private/wgzip.c#314

Attachment: warc-gzip-write-length-of-extra-field.patch
Description: Text Data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]