bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Enqueue logic problems


From: Tim Ruehsen
Subject: Re: [Bug-wget] Enqueue logic problems
Date: Thu, 2 May 2013 17:30:23 +0200
User-agent: KMail/1.13.7 (Linux/3.2.0-4-amd64; KDE/4.8.4; x86_64; ; )

Darshit, I guess you are talking about redirection.

That is 'wget -r gnu.org' is being redirected to www.gnu.org (via Location 
header). Wget now follows the redirection, but only downloads index.html since 
all included URLs in index.html refer to www.gnu.org. But we requested stuff 
from gnu.org.

That's why only one file (index.html) is downloaded.
But that is not what the user expects...

The user could work around it using the -D and/or -H option, but then he has 
to know about the redirection before he starts wget. Not everyone has the 
understanding to find that out.

Should wget behaviour change (default or using a new option) or should we 
leave it and print out verbose message that makes it clear to the user.

Regards, Tim

Am Thursday 02 May 2013 schrieb Micah Cowan:
> I believe you want -H -D gnu.org. That's what it's for. Wget doesn't
> know which hostnames under a domain should be allowed and which should
> not be (do you want images.gnu.org? git.gnu.org? lists.gnu.org?), so
> turns 'em all off unless you ask for them explicitly.
> 
> HTH,
> -mjc
> 
> On Thu, May 2, 2013 at 4:52 AM, Darshit Shah <address@hidden> wrote:
> > I should have been more clear. --span-hosts will enqueue the other files,
> > but it will also enqueue files from other hosts. I wish to recursively
> > download a website but not other sites that it links to.
> > 
> > Of course I could add --accept-regex / --reject-regex options to prevent
> > wget from wandering onto other hosts. But shouldn't the default
> > --recursive option simply handle cases where a www is either added or
> > removed? Or is there any scenario that I am missing which would cause
> > undesirable effects here?
> > 
> > On Thu, May 2, 2013 at 5:22 PM, Giuseppe Scrivano <address@hidden> 
wrote:
> >> Darshit Shah <address@hidden> writes:
> >> > When using the --recursive command with wget, there seems to be a
> >> > small issue with the logic that decides whether to enqueue a file to
> >> > the downloads list or not.
> >> > 
> >> > By default wget downloads files only from the same host. However, this
> >> > causes a problem when the target hostname changes thus:
> >> > parent: gnu.org
> >> > target: www.gnu.org
> >> > 
> >> > This issue causes wget to stop after just one download on a lot of
> >> > sites. I'm not sure if this exists in the older or release since I
> >> > only have the development version installed.
> >> 
> >> does --span-hosts fix this scenario for you?
> >> 
> >> Cheers,
> >> Giuseppe
> > 
> > --
> > Thanking You,
> > Darshit Shah
> > Research Lead, Code Innovation
> > Kill Code Phobia.
> > B.E.(Hons.) Mechanical Engineering, '14. BITS-Pilani

Mit freundlichem Gruß

     Tim Rühsen



reply via email to

[Prev in Thread] Current Thread [Next in Thread]