bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] WARC output


From: Gijs van Tulder
Subject: [Bug-wget] WARC output
Date: Tue, 09 Aug 2011 11:49:56 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Lightning/1.0b2 Thunderbird/3.1.11

Hi,

I'd like to propose a new feature that allows Wget to make WARC files.

Perhaps you're already familiar with it, but in short: WARC is a file format for web archives. In a single WARC file, you can store every file of the website, plus the HTTP request and response headers and other metadata. This makes it a very useful format for web archivists: you keep everything together, in the most detailed and original form.

The WARC format (an ISO standard, ISO 28500) has been developed by the International Internet Preservation Consortium, which includes the Internet Archive and many national libraries. It is supposed to become *the* standard file format for web archives. For example, it is used in the Internet Archive's Wayback Machine and its Heritrix crawler. There are several projects building tools to work with WARC files.


It would be cool if Wget could become one of these tools. Already the Swiss army knife for mirroring websites, the one thing that Wget is missing is a good way to store these mirrors. The current output of --mirror is not sufficient for archival purposes:

 - it throws away the HTTP headers (of the request and response);
 - it doesn't keep 404 pages and redirects;
 - it doesn't store the original urls but mangles the filenames;
 - and, if you're not careful, it even rewrites the links inside
   the documents that it has downloaded.

The WARC format supports these things.


With some help from others, I've added WARC functions to Wget. With the --warc-file option you can specify that the mirror should also be written to a WARC archive. Wget will then keep everything, including the HTTP request and response headers, redirects and 404 pages.

Do you think this is something that could be included in the main Wget version? If that's the case, what should be the next step?

Description, links to more information about WARC:
 http://www.archiveteam.org/index.php?title=Wget_with_WARC_output

Code:
 https://github.com/alard/wget-warc/
 https://github.com/downloads/alard/wget-warc/wget-warc-20110809.tar.bz2

The implementation makes use of the open source WARC Tools library
(Apache License 2.0):
 http://code.google.com/p/warc-tools/


I look forward to your response.

Kind regards,

Gijs van Tulder



reply via email to

[Prev in Thread] Current Thread [Next in Thread]