bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] WARC output


From: Patrick Steil
Subject: Re: [Bug-wget] WARC output
Date: Tue, 9 Aug 2011 17:42:10 -0500

That sounds awesome!  You have my vote... :)



On Tue, Aug 9, 2011 at 4:49 AM, Gijs van Tulder <address@hidden> wrote:

> Hi,
>
> I'd like to propose a new feature that allows Wget to make WARC files.
>
> Perhaps you're already familiar with it, but in short: WARC is a file
> format for web archives. In a single WARC file, you can store every file of
> the website, plus the HTTP request and response headers and other metadata.
> This makes it a very useful format for web archivists: you keep everything
> together, in the most detailed and original form.
>
> The WARC format (an ISO standard, ISO 28500) has been developed by the
> International Internet Preservation Consortium, which includes the Internet
> Archive and many national libraries. It is supposed to become *the* standard
> file format for web archives. For example, it is used in the Internet
> Archive's Wayback Machine and its Heritrix crawler. There are several
> projects building tools to work with WARC files.
>
>
> It would be cool if Wget could become one of these tools. Already the Swiss
> army knife for mirroring websites, the one thing that Wget is missing is a
> good way to store these mirrors. The current output of --mirror is not
> sufficient for archival purposes:
>
>  - it throws away the HTTP headers (of the request and response);
>  - it doesn't keep 404 pages and redirects;
>  - it doesn't store the original urls but mangles the filenames;
>  - and, if you're not careful, it even rewrites the links inside
>   the documents that it has downloaded.
>
> The WARC format supports these things.
>
>
> With some help from others, I've added WARC functions to Wget. With the
> --warc-file option you can specify that the mirror should also be written to
> a WARC archive. Wget will then keep everything, including the HTTP request
> and response headers, redirects and 404 pages.
>
> Do you think this is something that could be included in the main Wget
> version? If that's the case, what should be the next step?
>
> Description, links to more information about WARC:
>  
> http://www.archiveteam.org/**index.php?title=Wget_with_**WARC_output<http://www.archiveteam.org/index.php?title=Wget_with_WARC_output>
>
> Code:
>  https://github.com/alard/wget-**warc/<https://github.com/alard/wget-warc/>
>  https://github.com/downloads/**alard/wget-warc/wget-warc-**
> 20110809.tar.bz2<https://github.com/downloads/alard/wget-warc/wget-warc-20110809.tar.bz2>
>
> The implementation makes use of the open source WARC Tools library
> (Apache License 2.0):
>  http://code.google.com/p/warc-**tools/<http://code.google.com/p/warc-tools/>
>
>
> I look forward to your response.
>
> Kind regards,
>
> Gijs van Tulder
>
>


-- 

**

*Patrick Steil  |  ChurchBuzz.org*

Church Website Optimization <http://www.churchbuzz.org/>
Like us on Facebook <http://facebook.com/churchbuzz>!

Mobile: 940-391-9250


reply via email to

[Prev in Thread] Current Thread [Next in Thread]