bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Wget redirection behavior


From: Dale R. Worley
Subject: [Bug-wget] Wget redirection behavior
Date: Mon, 10 Oct 2016 21:19:26 -0400

I've built and tested the 3d1d5b3 commit, which is the current head, or
near it.

In regard to my most important problem, that code will do what I need,
which is to download the IANA assignments files:

    $ wget -r --mirror --convert-links --page-requisites \
        --include-directories=/assignments \
        http://www.iana.org/assignments/index.html

In particular, the file http://www.iana.org/assignments/index.html
(which redirects to http://www.iana.org/assignments!) will be
downloaded.

Looking at your code changes, redirections are exempted from the tests
which cause download_child to return WG_RR_LIST or WG_RR_REGEX, which
are the tests based on the options
       --include-directories=list
       --exclude-directories=list
       --accept-regex urlregex
       --reject-regex urlregex
and no others.  These are the tests implemented by section 5 of
download_child.  (Am I correct here?)  In particular, the tests --accept
and --reject *are* applied.

It would be helpful if the manual page documented that tests are applied
to redirections, and which ones.  One way would be to add this text at
the start of the section "Recursive Accept/Reject Options":

   Recursive Accept/Reject Options
       Note that if an HTTP request receives a redirection response, the
       redirect URL is subjected to the same tests as any
       recursively-fetched URL, with the exception that the
       --include-directories, --exclude-directories, --accept-regex, and
       --reject-regex tests are not applied.

       -A acclist --accept acclist
       -R rejlist --reject rejlist
           ...

I see no reason to try at this time to figure out what options might be
needed to adjust this behavior, since I have only the one use case.

But looking at the organization of the code, it seems that we require
that download_child should return WG_RR_LIST or WG_RR_REGEX only if that
is the *only* reason that download_child would reject the URL.  E.g., if
the URL fails both the --accept-regex and the robots test,
download_child must return WG_RR_ROBOTS, not WG_RR_REGEX.  Otherwise,
redirections that fail both the --accept-regex and robots tests will be
followed, while redirections that fail only the robots test will not be
followed!  And that requires that the tests of section 5 be at the end
of download_child, and they aren't now.

So it seems to me that download_child needs to be reordered, and its
interface needs to document that the tests that redirections are
exempted from must remain at the end.

Alternatively, download_child could be provided with an additional
boolean argument telling whether the section 5 tests should be applied.

An independent item:  I notice that the tarball comes with *no* build
instructions whatsoever.  I have some memory that I've tangled with this
before, and that the correct behavior is to run "./boostrap".  In any
case, that worked for me.  IIRC, a file named "INSTALL" cannot be put
into the tarball because it would conflict with the INSTALL link that
will be added by ./bootstrap.  But it would be useful to the newcomer if
a file named "INSTALL-tarball" contained simple contents like:

    In the tarball distribution, the INSTALL file is absent because it
    is created by the bootstrap process.

    If the INSTALL file is absent, first run the bootstrap script to
    create it:

    $ ./bootstrap

    Then follow the build instructions in the INSTALL file.

Thanks for all your help with this problem!

Dale



reply via email to

[Prev in Thread] Current Thread [Next in Thread]