bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] What ought to be a simple use of wget


From: Tim Ruehsen
Subject: Re: [Bug-wget] What ought to be a simple use of wget
Date: Thu, 04 Aug 2016 11:38:33 +0200
User-agent: KMail/5.2.3 (Linux/4.6.0-1-amd64; KDE/5.23.0; x86_64; ; )

On Wednesday, August 3, 2016 11:55:55 AM CEST Dale R. Worley wrote:
> Tim Rühsen <address@hidden> writes:
> > If you have a look at 'man wget'/--page-requisites, the stuff is explained
> > quite well. To me it looks like you are missing --level 2.
> > 
> > If --level 2 is not what you want. you could make your point clear by
> > making up a small document tree as an example.
> 
> I definitely don't want --level 2, because that limits how many links
> the recursion can traverse.  If all the links are within the
> /assignments/ directory, wget should follow an unlimited number.
> 
> Here's an outline of what I want retrieved, based on Matthew White's
> listing:
> 
>     www.iana.org/
> Some or all of these files are OK, since they're likely page requisites:
>     www.iana.org/_css/
>     www.iana.org/_css/2015.1/
>     www.iana.org/_css/2015.1/print.css
>     www.iana.org/_css/2015.1/screen.css
>     www.iana.org/_img/
>     www.iana.org/_img/2011.1/
>     www.iana.org/_img/2011.1/icons/
>     ...
>     www.iana.org/_js/
>     www.iana.org/_js/2013.1/
>     www.iana.org/_js/2013.1/iana.js
>     www.iana.org/_js/2013.1/jquery.js
> Nothing in these directories:
>     www.iana.org/about/
>     www.iana.org/abuse/
> Lots and lots of files in this directory:
>     www.iana.org/assignments/
>     www.iana.org/assignments/_6lowpan-parameters/
>    
> www.iana.org/assignments/_6lowpan-parameters/_6lowpan-parameters.xhtml.html
> www.iana.org/assignments/_support/
>     www.iana.org/assignments/_support/iana-registry.css
>     www.iana.org/assignments/_support/jquery.js
>     www.iana.org/assignments/_support/sort.js
>     www.iana.org/assignments/aaa-parameters/
>     www.iana.org/assignments/aaa-parameters/aaa-parameters-1.csv
>     www.iana.org/assignments/aaa-parameters/aaa-parameters.txt
>     www.iana.org/assignments/aaa-parameters/aaa-parameters.xhtml.html
>     www.iana.org/assignments/aaa-parameters/aaa-parameters.xml
>     www.iana.org/assignments/abfab-parameters/
>     www.iana.org/assignments/abfab-parameters/abfab-parameters.txt
>     www.iana.org/assignments/abfab-parameters/abfab-parameters.xhtml.html
>     www.iana.org/assignments/abfab-parameters/abfab-parameters.xml
>     www.iana.org/assignments/abfab-parameters/urn-parameters.csv
>     ...
> Nothing in these directories:
>     www.iana.org/dnssec/
>     www.iana.org/domains/
>     www.iana.org/go/
>     www.iana.org/help/
>     www.iana.org/numbers/
>     www.iana.org/procedures/
>     www.iana.org/protocols/
>     www.iana.org/reports/

Sounds like "download everything from www.iana.org/assignments/ plus all page 
requisites on www.iana.org". Page requisites from other domains shouldn't be 
pulled in !?

Then your first try was very close, it was basically:
wget -r --no-parent --page-requisites http://www.iana.org/assignments/
index.html

With -d you can see that this page is being redirected to /protocols and thus 
no further downloading takes place since /protocols would escape the /
assignments/ directory  (not allowed due to --no-parent).

[It is debatable if this behavior regarding redirections should be changed or 
not, so feel free to open a bug report at https://savannah.gnu.org/bugs/?
func=additem&group=wget.]

Your are currently left with what Matthew White already suggested.

Similar approach would be to extract all links from 'protocols', build a list 
of all referenced links and filter with e.g. (e)grep:

wget -d --convert-links -r --no-parent --page-requisites http://www.iana.org/
assignments/index.html  2>&1|grep ^TO_COMPLETE|cut -d' ' -f 4 >list.txt

After editing, filtering list.txt, download all the URLs including --page-
requisites:
wget --convert-links --page-requisites -x -i list.txt

Tim
 

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]