[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] WARC, new version
From: |
David H. Lipman |
Subject: |
Re: [Bug-wget] WARC, new version |
Date: |
Sun, 30 Oct 2011 17:42:57 -0400 |
From: "Gijs van Tulder" <address@hidden>
> Hi David,
>
> David H. Lipman wrote:
>> I have seen WARC mentioned but have not seen a definition.
>
> WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web
> resources. It
> is used for making archives of web sites. The Internet Archive, for example,
> uses it as
> the file format for their Wayback Machine and Heritrix crawler.
>
> The nice thing about WARC is that it lets you store all information about
> your web crawl:
> the files you download, of course, but also things like the HTTP request and
> response
> headers, information about redirects and error pages. WARC also provides a
> place to keep
> the related metadata. It is, in short, a way to store everything, in a
> standardized file
> format.
>
> Adding WARC to wget means that you'll be able to do things like
>
> wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu
>
> which will produce (next to the normal wget download) a file named
> 'gnu.warc.gz' that
> contains every HTTP request and every HTTP response that wget made. This is a
> 'archival
> grade' copy of the mirrored site.
>
> Once you have the WARC file, you could store it in your archive, extract
> files, run your
> own local Wayback Machine [2, 3].
>
> wget is already a very useful tool to make a quick copy of a website, adding
> WARC
> support helps to make the copy is as complete as possible.
>
> Maybe that answers some of your questions?
>
> Regards,
>
> Gijs
>
>
> [1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> [2] http://archive-access.sourceforge.net/projects/wayback/
> [3] http://netpreserve.org/software/downloads.php
>
It answers all the question and now I understand.
*Thank You Gijs !*
--
Dave
Multi-AV Scanning Tool - http://multi-av.thespykiller.co.uk
http://www.pctipp.ch/downloads/dl/35905.asp
- [Bug-wget] WARC, new version, Gijs van Tulder, 2011/10/21
- Re: [Bug-wget] WARC, new version, Giuseppe Scrivano, 2011/10/23
- Re: [Bug-wget] WARC, new version, Gijs van Tulder, 2011/10/23
- Re: [Bug-wget] WARC, new version, Giuseppe Scrivano, 2011/10/30
- Re: [Bug-wget] WARC, new version, David H. Lipman, 2011/10/30
- Re: [Bug-wget] WARC, new version, Gijs van Tulder, 2011/10/30
- Re: [Bug-wget] WARC, new version,
David H. Lipman <=
- Re: [Bug-wget] WARC, new version, Gijs van Tulder, 2011/10/30