Hello there,
Retrieving a directory (or some `clean' URL) without a slash at the
end of a URL -- e.g. example.com/foo -- web servers often add an
end-slash by a redirect example.com/foo -> example.com/foo/. I'll
hereafter call such redirects `trivial'.
The problem is that some websites (e.g. ocw.mit.edu) use links without
end-slash. This means that when Wget (with -r) retrieves
example.com/foo, it'll save the content to the file `foo' regardless
of the redirect. Then when Wget reads `foo' and sees a link to
example.com/foo/file.bar, it'll delete a regular file `foo' and create
a directory with the same name (by the function mkalldirs(), see
url.c:1220). Therefore we lose the entire page.
Example of reproducer (GNU Wget 1.14.97-1221):
$ wget -d -r --no-parent
http://ocw.mit.edu/courses/mathematics/18-100b-analysis-i-fall-2010/
2>&1 | grep "directory danger"
Removing ocw.mit.edu/courses/<skipped>/assignments because of directory danger!
Removing ocw.mit.edu/courses/<skipped>/readings-notes because of
directory danger!
Removing ocw.mit.edu/courses/<skipped>/study-materials because of
directory danger!
--trust-server-names solves this problem, but it seems to be not
obvious for a user to use it every time together with -r, to say
nothing of security reasons.
Does it sound reasonable to handle such `trivial' redirects (that
simply add an end-slash) as a special case regardless of
`trust-server-names'?
Thanks