bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #20808] -R should reject files _before_ downloading them


From: Oleksandr Gavenko
Subject: [Bug-wget] [bug #20808] -R should reject files _before_ downloading them
Date: Thu, 20 Aug 2015 20:56:38 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.1.0

Follow-up Comment #12, bug #20808 (project wget):

I try to retrieve specific replays from saved game storage
http://replays.wesnoth.org/1.12/

This site just usual directory/file list.

As data grouped per day for 2 year period there are a lot of subdirectories.

I try to get interesting replays by (see
http://forums.wesnoth.org/viewtopic.php?p=588686#p588686 ):

wget -e 'robots=off' -nc -c -np -A 'Scrolling_Survival_Turn_1??_*.bz2' -A
index.html -r http://replays.wesnoth.org/1.12/

but each subdirectory have links to sort table data on page (query string) and
for each page (which is 2 years*365 days) it try to download things that
rejected.

It take too long time to wait (even given that wget reuse connections) for
wget do useless job.

I quickly solve task with by manually scanning index.html files, just get them
by wget (--level=1 do job for limiting amount of processing time):

$ wget -r -np -A index.html --level=1 http://replays.wesnoth.org/1.12/

and retrieve interested files:

$ find . -type f -name index.html | while read f; do p=${f#./};
p=http://${p%index.html}; command grep -o
'href="Scrolling_Survival_Turn_[5-9]._[^"]*.bz2' $f | while read s; do
s=${s#href='"'}; wget $p$s; done; done

It is nice to have ability to list what links to follow, when processed HTML
files.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?20808>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]