bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Difficulty downloading a site from archive.org


From: Micah Cowan
Subject: Re: [Bug-wget] Difficulty downloading a site from archive.org
Date: Sat, 13 Aug 2011 09:39:01 -0700
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Thunderbird/3.1.11

On 08/12/2011 11:56 AM, phil curb wrote:
I've been looking at downloading a site that's on archive.org

Archive.org's TOS on their website expressly forbids the use of "downloading agents", and names wget explicitly.

All URLs on archive.org always point at the _original_ (either modern, or nonexistent) locations they pointed to when they were archived. These links are pretty much never the ones you want. Then they embed some JavaScript that goes through and rewrites all these URLs to point at archive.org. This means that in a browser, you'll see the "correct" URLs when you hover, and when you click to follow.

The problem of course is that tools like wget won't run the script, so the original (useless) URLs remain, and it tries to follow these. Not really a lot you can do about it without rolling up your sleeves and hacking around the problem. But as I say, their TOS forbids you from accessing their site with wget anyway... they want you to always use their site directly.

(I'd be interested in knowing whether folks actually have legal obligations to respect TOS to an unrestricted-access site like that... I imagine it might even vary by location)

--
Micah J. Cowan
http://micah.cowan.name/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]