[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] WARC output
From: |
Gijs van Tulder |
Subject: |
Re: [Bug-wget] WARC output |
Date: |
Sat, 08 Oct 2011 21:51:31 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:7.0) Gecko/20110923 Thunderbird/7.0 |
Hi Giuseppe and Ángel,
Thanks for looking at the patch. Yes, it's quite big. (I should mention
that this was also not my intention to have this complete patch added
into the wget repository; it is a first patch to see the differences.)
Ángel González writes:
> I don't think all those files are even remotely needed.
> I am seeing for instance, python files for creating warc interacting
> with curl.
True. The patch I have sent you contains the complete warc tools
library, with lots of things that aren't really needed for this task.
I have looked at the C files and headers that are needed for wget. I
think there are approximately 110 files, with a total size of 1.3 MB,
that are actually used by the wget extension.
> Also, the patch seems to duplicate code (compare lines 337731-337810
> with 337944-338013 in the patch file). Surely that could be
> refactored?
That is also true. It has a reason: I tried to add the WARC bits with as
few changes to the current wget code as possible. However, the structure
of http.c and its gethttp function made it necessary to have bits of
very similar (but not exactly duplicate) code.
It's certainly possible to refactor, but I think that to do that you'd
also have to refactor large parts of gethttp and related methods.
Gijs