[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Recursive download and `trivial' redirects

From: Maxim Kuznetsov
Subject: [Bug-wget] Recursive download and `trivial' redirects
Date: Mon, 25 Nov 2013 05:58:47 +0400

Hello there,

Retrieving a directory (or some `clean' URL) without a slash at the
end of a URL -- e.g. example.com/foo -- web servers often add an
end-slash by a redirect example.com/foo -> example.com/foo/.  I'll
hereafter call such redirects `trivial'.

The problem is that some websites (e.g. ocw.mit.edu) use links without
end-slash.  This means that when Wget (with -r) retrieves
example.com/foo, it'll save the content to the file `foo' regardless
of the redirect.  Then when Wget reads `foo' and sees a link to
example.com/foo/file.bar, it'll delete a regular file `foo' and create
a directory with the same name (by the function mkalldirs(), see
url.c:1220).  Therefore we lose the entire page.

Example of reproducer (GNU Wget 1.14.97-1221):
$ wget -d -r --no-parent
2>&1 | grep "directory danger"
Removing ocw.mit.edu/courses/<skipped>/assignments because of directory danger!
Removing ocw.mit.edu/courses/<skipped>/readings-notes because of
directory danger!
Removing ocw.mit.edu/courses/<skipped>/study-materials because of
directory danger!

--trust-server-names solves this problem, but it seems to be not
obvious for a user to use it every time together with -r, to say
nothing of security reasons.

Does it sound reasonable to handle such `trivial' redirects (that
simply add an end-slash) as a special case regardless of


Maxim Kuznetsov

reply via email to

[Prev in Thread] Current Thread [Next in Thread]