[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] [bug #45801] Allowing to configure HTML engine which links to
From: |
Oleksandr Gavenko |
Subject: |
[Bug-wget] [bug #45801] Allowing to configure HTML engine which links to follow |
Date: |
Thu, 20 Aug 2015 21:29:01 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.1.0 |
URL:
<http://savannah.gnu.org/bugs/?45801>
Summary: Allowing to configure HTML engine which links to
follow
Project: GNU Wget
Submitted by: gavenkoa
Submitted on: Thu 20 Aug 2015 09:29:00 PM GMT
Category: Feature Request
Severity: 3 - Normal
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Name:
Originator Email:
Open/Closed: Open
Discussion Lock: Any
Release: None
Operating System: None
Reproducibility: None
Fixed Release: None
Planned Release: None
Regression: No
Work Required: None
Patch Included: None
_______________________________________________________
Details:
>From info page (2 paragraphs):
Note that these two options do not affect the downloading of HTML
files (as determined by a '.htm' or '.html' filename prefix). This
behavior may not be desirable for all users, and may be changed for
future versions of Wget.
Finally, it's worth noting that the accept/reject lists are matched
_twice_ against downloaded files: once against the URL's filename
portion, to determine if the file should be downloaded in the first
place; then, after it has been accepted and successfully downloaded, the
local file's name is also checked against the accept/reject lists to see
if it should be removed. The rationale was that, since '.htm' and
'.html' files are always downloaded regardless of accept/reject rules,
they should be removed _after_ being downloaded and scanned for links,
if they did match the accept/reject lists.
So any URL from href="..." are retrieved even if they are useless.
As result recursive download time dramatically increased.
For example I try to download specific game replays from
http://replays.wesnoth.org/1.12/
This site list file hierarchy and I hope that this command do interested me
job:
wget -e 'robots=off' -nc -c -np -r -A 'Scrolling_Survival_Turn_1??_*.bz2' -A
index.html http://replays.wesnoth.org/1.12/
But because any link checked and each page have service links to sort table
data (which are useless for me) it take too long time to wait while wget check
them.
I solve my task with --limit=1 and custom scanner for downloaded index.html
files:
$ wget -e 'robots=off' -nc -c -np -A index.html -r --level=1
http://replays.wesnoth.org/1.12/
$ find . -type f -name index.html | while read f; do p=${f#./};
p=http://${p%index.html}; command grep -o
'href="Scrolling_Survival_Turn_[5][0-5]_[^"]*\.bz2' $f | while read s; do
s=${s#href='"'}; wget $p$s; done; done
If there was options to limit what links to follow in HTML page writing custom
scripts was unnecessary.
Seems that instead of literal "Directory-Based Limits" I need glob/regex
matching for URLs (not just directory or page names).
There are a lot of confusion with -R/-A options, which useless when you know
exactly that type of links to follow:
* bug #3485 ( https://savannah.gnu.org/bugs/?34855 )
*
http://unix.stackexchange.com/questions/179020/wget-and-preventing-files-from-downloading-on-a-recursive-wget
*
http://superuser.com/questions/130653/wget-recursively-download-from-pages-with-lots-of-links
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?45801>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Bug-wget] [bug #45801] Allowing to configure HTML engine which links to follow,
Oleksandr Gavenko <=