[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] WARC output
From: |
Gijs van Tulder |
Subject: |
[Bug-wget] WARC output |
Date: |
Tue, 09 Aug 2011 11:49:56 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Lightning/1.0b2 Thunderbird/3.1.11 |
Hi,
I'd like to propose a new feature that allows Wget to make WARC files.
Perhaps you're already familiar with it, but in short: WARC is a file
format for web archives. In a single WARC file, you can store every file
of the website, plus the HTTP request and response headers and other
metadata. This makes it a very useful format for web archivists: you
keep everything together, in the most detailed and original form.
The WARC format (an ISO standard, ISO 28500) has been developed by the
International Internet Preservation Consortium, which includes the
Internet Archive and many national libraries. It is supposed to become
*the* standard file format for web archives. For example, it is used in
the Internet Archive's Wayback Machine and its Heritrix crawler. There
are several projects building tools to work with WARC files.
It would be cool if Wget could become one of these tools. Already the
Swiss army knife for mirroring websites, the one thing that Wget is
missing is a good way to store these mirrors. The current output of
--mirror is not sufficient for archival purposes:
- it throws away the HTTP headers (of the request and response);
- it doesn't keep 404 pages and redirects;
- it doesn't store the original urls but mangles the filenames;
- and, if you're not careful, it even rewrites the links inside
the documents that it has downloaded.
The WARC format supports these things.
With some help from others, I've added WARC functions to Wget. With the
--warc-file option you can specify that the mirror should also be
written to a WARC archive. Wget will then keep everything, including the
HTTP request and response headers, redirects and 404 pages.
Do you think this is something that could be included in the main Wget
version? If that's the case, what should be the next step?
Description, links to more information about WARC:
http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
Code:
https://github.com/alard/wget-warc/
https://github.com/downloads/alard/wget-warc/wget-warc-20110809.tar.bz2
The implementation makes use of the open source WARC Tools library
(Apache License 2.0):
http://code.google.com/p/warc-tools/
I look forward to your response.
Kind regards,
Gijs van Tulder
- [Bug-wget] WARC output,
Gijs van Tulder <=