bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Filtering for requisites and redirections


From: Tim Ruehsen
Subject: Re: [Bug-wget] Filtering for requisites and redirections
Date: Fri, 14 Oct 2016 13:10:46 +0200
User-agent: KMail/5.2.3 (Linux/4.7.0-1-amd64; KDE/5.26.0; x86_64; ; )

On Thursday, October 13, 2016 6:27:56 PM CEST Dale R. Worley wrote:
> If --page-requisites is specified along with --no-parent, then requisite
> files will be downloaded even if their URLs would normally be suppressed
> by --no-parent.  This is implemented by a test in section 4 of
> download_child in recur.c, and a flag in struct urlpos, link_inline_p,
> which says that the *context* of that URL is as a page requisite.
> 
> This suggests that the exceptional processing we want to implement for
> redirections might be more systematically implemented by using the above
> processing as a model, and not by testing the value returned by
> download_child.  This involves adding a flag link_redirect_p to struct
> urlpos; this flag functions as an alternative to the additional argument
> to download_child that I previously suggested.
> 
> In addition, this approach avoids the problem of ensuring that
> download_child returns the correct value if a URL fails more than one
> test, e.g., --accept-regex and robots, because any tests that are to be
> ignored in the context are not executed and do not affect the return
> value.
> 
> It also suggests that we may want to define that --no-parent does not
> apply to redirections, in the same way that it does not apply to page
> requisites when --page-requisite is set.
> 
> I've also updated the TEXI file to describe the functional changes, and
> also the previously-undocumented behavior of --page-requisites
> overriding --no-parent.  The changes are in the attached diff.
> 
> However, looking at the documentation for --no-parent:
> 
>        -np
>        --no-parent
>            Do not ever ascend to the parent directory when retrieving
>            recursively.  This is a useful option, since it guarantees that
>            only the files below a certain hierarchy will be downloaded.
> 
>            Note that the effect of --no-parent is suppressed for fetching
>            redirected URLs and for fetching page requisite URLs if
>            --page-requisites is specified.
> 
> Perhaps we do not want to have --no-parent suppressed by
> --page-requisites.  It seems that --no-parent is intended as a security
> measure, and the existing code (as well as this proposal) violate its
> fundamental premise.

--no-parent seems to be intended as a bandwidth limiter together with -r. When 
talking about security, what realistic scenario do you have in mind ?

Anyways, we definitely don't want to change the default behavior.

If someone *really* needs a different precedence and has good arguments and 
finds someone to implement it (inclusive tests), we'll add such a feature.

Regarding redirections, we have --max-redirect and could use --max-redirect=0 
to disallow redirections. *But* we have at least two different qualities of 
redirections: 1. staying on the same host/domain, 2. host spanning.
If neither -H/--span-hosts is given nor -D/--domains matches, we should not 
span hosts for redirections.

> 
> Dale

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]