bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Enqueue logic problems


From: Darshit Shah
Subject: Re: [Bug-wget] Enqueue logic problems
Date: Thu, 2 May 2013 21:11:44 +0530

Tim,
Almost bang on. While I hadn't thought of the case where the domain name
itself changes.
That brings a thought on the same lines. If the server responds with a
301/2 Redirect, the user probably does expect wget to download the
redirected website.
Say I execute: $wget -r www.example.com
And the server responds with a 302 Found and a Location Header to:
http://example.iana.org

In this case I had indeed intended to download the location pointed to by
the Location Header. Maybe if the first response is a Redirect, then wget
should either print a verbose message or change the parent domain name.

However, my question was much more specific, if the server redirects to a
domain which matches www.<old domain name> then shouldn't wget just accept
it and refresh the parent domain name that it holds?

We shouldn't ask the user to add extra -D -H options (Not everyone does
RTFM) for a common scenario.

On Thu, May 2, 2013 at 9:00 PM, Tim Ruehsen <address@hidden> wrote:

> Darshit, I guess you are talking about redirection.
>
> That is 'wget -r gnu.org' is being redirected to www.gnu.org (via Location
> header). Wget now follows the redirection, but only downloads index.html
> since
> all included URLs in index.html refer to www.gnu.org. But we requested
> stuff
> from gnu.org.
>
> That's why only one file (index.html) is downloaded.
> But that is not what the user expects...
>
> The user could work around it using the -D and/or -H option, but then he
> has
> to know about the redirection before he starts wget. Not everyone has the
> understanding to find that out.
>
> Should wget behaviour change (default or using a new option) or should we
> leave it and print out verbose message that makes it clear to the user.
>
> Regards, Tim
>
> Am Thursday 02 May 2013 schrieb Micah Cowan:
> > I believe you want -H -D gnu.org. That's what it's for. Wget doesn't
> > know which hostnames under a domain should be allowed and which should
> > not be (do you want images.gnu.org? git.gnu.org? lists.gnu.org?), so
> > turns 'em all off unless you ask for them explicitly.
> >
> > HTH,
> > -mjc
> >
> > On Thu, May 2, 2013 at 4:52 AM, Darshit Shah <address@hidden> wrote:
> > > I should have been more clear. --span-hosts will enqueue the other
> files,
> > > but it will also enqueue files from other hosts. I wish to recursively
> > > download a website but not other sites that it links to.
> > >
> > > Of course I could add --accept-regex / --reject-regex options to
> prevent
> > > wget from wandering onto other hosts. But shouldn't the default
> > > --recursive option simply handle cases where a www is either added or
> > > removed? Or is there any scenario that I am missing which would cause
> > > undesirable effects here?
> > >
> > > On Thu, May 2, 2013 at 5:22 PM, Giuseppe Scrivano <address@hidden>
> wrote:
> > >> Darshit Shah <address@hidden> writes:
> > >> > When using the --recursive command with wget, there seems to be a
> > >> > small issue with the logic that decides whether to enqueue a file to
> > >> > the downloads list or not.
> > >> >
> > >> > By default wget downloads files only from the same host. However,
> this
> > >> > causes a problem when the target hostname changes thus:
> > >> > parent: gnu.org
> > >> > target: www.gnu.org
> > >> >
> > >> > This issue causes wget to stop after just one download on a lot of
> > >> > sites. I'm not sure if this exists in the older or release since I
> > >> > only have the development version installed.
> > >>
> > >> does --span-hosts fix this scenario for you?
> > >>
> > >> Cheers,
> > >> Giuseppe
> > >
> > > --
> > > Thanking You,
> > > Darshit Shah
> > > Research Lead, Code Innovation
> > > Kill Code Phobia.
> > > B.E.(Hons.) Mechanical Engineering, '14. BITS-Pilani
>
> Mit freundlichem Gruß
>
>      Tim Rühsen
>



-- 
Thanking You,
Darshit Shah
Research Lead, Code Innovation
Kill Code Phobia.
B.E.(Hons.) Mechanical Engineering, '14. BITS-Pilani


reply via email to

[Prev in Thread] Current Thread [Next in Thread]