bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] URL normalisation: consecutive forward slashes


From: Giuseppe Scrivano
Subject: Re: [Bug-wget] URL normalisation: consecutive forward slashes
Date: Thu, 03 Jun 2010 14:32:45 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.50 (gnu/linux)

Hello,

thanks for your report.  I am not sure that the URL normalisation should
collapse multiple consecutive forward slashes, I don't see anything
about it in RFC 1808.  We can't assume that "foo//bar" is the same as
"foo/bar", it could be handled differently by the server, for example it
may be part of PATH_INFO.

AFAICS, Firefox and Chromium don't normalize consecutive forward slashes
too.

Cheers,
Giuseppe



Cillian Sharkey <address@hidden> writes:

> Hi,
>
> I've found wget does not always correctly normalise URLs by collapsing
> multiple consecutive forward slashes into a single slash.
>
> This is a problem when recursively mirroring a site, as certain kinds of
> links with multiple consecutive slashes will cause wget to go into an
> infinite loop, limited only by the maximum depth level.
>
> Without complete normalisation, a link with extra slashes is seen as a
> new URL that has not been visited, even if it has already.  With each
> traversal an extra slash is cumulatively appended to the URL, causing
> the loop.
>
> Example:
>
> /index.html has href to "foo/loop.html"
> /foo/loop.html has href to "..//index.html"
>
> Results in the following link traversal:
>
> /index.html
> /a/loop.html
> //index.html
> //a/loop.html
> ///index.html
> ///a/loop.html
> [..]
>
> I've tried a combination of URLs with and without consecutive slashes,
> to test wget's behaviour. Results as follows:
>
> /index.html links to:
>
> HREF:                  wget requests:      should be:
>
> /a//../b/10.html       /a/b/10.html        /b/10.html
> /a/../b/11.html        /b/11.html
>                        
> /a/b/..//../c/20.html  /a/c/20.html        /c/20.html
> /a/b/../../c/21.html   /c/21.html
>                        
> ..//30.html            //30.html           /30.html
> ../31.html             /31.html
>                        
> .//40.html             //40.html           /40.html
> ./41.html              /41.html
>                        
> //50.html              Skipped, not downloaded!
> /51.html               /51.html
>
>
> wget --version
>
> GNU Wget 1.12 built on linux-gnu.
>
> +digest +ipv6 +nls +ntlm +opie +md5/openssl +https -gnutls +openssl 
> -iri 
>
> Wgetrc: 
>     /etc/wgetrc (system)
> Locale: /usr/share/locale 
> Compile: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc" 
>     -DLOCALEDIR="/usr/share/locale" -I. -I../lib -g -O2 
>     -D_FILE_OFFSET_BITS=64 -O2 -g -Wall 
> Link: gcc -g -O2 -D_FILE_OFFSET_BITS=64 -O2 -g -Wall /usr/lib/libssl.so 
>     /usr/lib/libcrypto.so -ldl -lrt ftp-opie.o openssl.o http-ntlm.o 
>     gen-md5.o ../lib/libgnu.a
>
> Regards,



reply via email to

[Prev in Thread] Current Thread [Next in Thread]