|
From: | Micah Cowan |
Subject: | Re: [Bug-wget] Difficulty downloading a site from archive.org |
Date: | Sat, 13 Aug 2011 09:39:01 -0700 |
User-agent: | Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Thunderbird/3.1.11 |
On 08/12/2011 11:56 AM, phil curb wrote:
I've been looking at downloading a site that's on archive.org
Archive.org's TOS on their website expressly forbids the use of "downloading agents", and names wget explicitly.
All URLs on archive.org always point at the _original_ (either modern, or nonexistent) locations they pointed to when they were archived. These links are pretty much never the ones you want. Then they embed some JavaScript that goes through and rewrites all these URLs to point at archive.org. This means that in a browser, you'll see the "correct" URLs when you hover, and when you click to follow.
The problem of course is that tools like wget won't run the script, so the original (useless) URLs remain, and it tries to follow these. Not really a lot you can do about it without rolling up your sleeves and hacking around the problem. But as I say, their TOS forbids you from accessing their site with wget anyway... they want you to always use their site directly.
(I'd be interested in knowing whether folks actually have legal obligations to respect TOS to an unrestricted-access site like that... I imagine it might even vary by location)
-- Micah J. Cowan http://micah.cowan.name/
[Prev in Thread] | Current Thread | [Next in Thread] |