bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Filtering for requisites and redirections


From: Dale R. Worley
Subject: [Bug-wget] Filtering for requisites and redirections
Date: Thu, 13 Oct 2016 18:27:56 -0400

If --page-requisites is specified along with --no-parent, then requisite
files will be downloaded even if their URLs would normally be suppressed
by --no-parent.  This is implemented by a test in section 4 of
download_child in recur.c, and a flag in struct urlpos, link_inline_p,
which says that the *context* of that URL is as a page requisite.

This suggests that the exceptional processing we want to implement for
redirections might be more systematically implemented by using the above
processing as a model, and not by testing the value returned by
download_child.  This involves adding a flag link_redirect_p to struct
urlpos; this flag functions as an alternative to the additional argument
to download_child that I previously suggested.

In addition, this approach avoids the problem of ensuring that
download_child returns the correct value if a URL fails more than one
test, e.g., --accept-regex and robots, because any tests that are to be
ignored in the context are not executed and do not affect the return
value.

It also suggests that we may want to define that --no-parent does not
apply to redirections, in the same way that it does not apply to page
requisites when --page-requisite is set.

I've also updated the TEXI file to describe the functional changes, and
also the previously-undocumented behavior of --page-requisites
overriding --no-parent.  The changes are in the attached diff.

However, looking at the documentation for --no-parent:

       -np
       --no-parent
           Do not ever ascend to the parent directory when retrieving
           recursively.  This is a useful option, since it guarantees that
           only the files below a certain hierarchy will be downloaded.

           Note that the effect of --no-parent is suppressed for fetching
           redirected URLs and for fetching page requisite URLs if
           --page-requisites is specified.

Perhaps we do not want to have --no-parent suppressed by
--page-requisites.  It seems that --no-parent is intended as a security
measure, and the existing code (as well as this proposal) violate its
fundamental premise.

Dale

Attachment: wget.diff
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]