bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Recursive download and `trivial' redirects


From: Ángel González
Subject: Re: [Bug-wget] Recursive download and `trivial' redirects
Date: Mon, 25 Nov 2013 21:02:26 +0100
User-agent: Thunderbird

On 25/11/13 02:58, Maxim Kuznetsov wrote:
Hello there,

Retrieving a directory (or some `clean' URL) without a slash at the
end of a URL -- e.g. example.com/foo -- web servers often add an
end-slash by a redirect example.com/foo ->  example.com/foo/.  I'll
hereafter call such redirects `trivial'.

The problem is that some websites (e.g. ocw.mit.edu) use links without
end-slash.  This means that when Wget (with -r) retrieves
example.com/foo, it'll save the content to the file `foo' regardless
of the redirect.  Then when Wget reads `foo' and sees a link to
example.com/foo/file.bar, it'll delete a regular file `foo' and create
a directory with the same name (by the function mkalldirs(), see
url.c:1220).  Therefore we lose the entire page.

Example of reproducer (GNU Wget 1.14.97-1221):
$ wget -d -r --no-parent
http://ocw.mit.edu/courses/mathematics/18-100b-analysis-i-fall-2010/
2>&1 | grep "directory danger"
Removing ocw.mit.edu/courses/<skipped>/assignments because of directory danger!
Removing ocw.mit.edu/courses/<skipped>/readings-notes because of
directory danger!
Removing ocw.mit.edu/courses/<skipped>/study-materials because of
directory danger!

--trust-server-names solves this problem, but it seems to be not
obvious for a user to use it every time together with -r, to say
nothing of security reasons.

Does it sound reasonable to handle such `trivial' redirects (that
simply add an end-slash) as a special case regardless of
`trust-server-names'?

Thanks
Probably instead of being removed the file should have been renamed as index.html inside the newly created folder.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]