[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #45801] Allowing to configure HTML engine which links to

From: Oleksandr Gavenko
Subject: [Bug-wget] [bug #45801] Allowing to configure HTML engine which links to follow
Date: Thu, 20 Aug 2015 21:29:01 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.1.0


                 Summary: Allowing to configure HTML engine which links to
                 Project: GNU Wget
            Submitted by: gavenkoa
            Submitted on: Thu 20 Aug 2015 09:29:00 PM GMT
                Category: Feature Request
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: 
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any
                 Release: None
        Operating System: None
         Reproducibility: None
           Fixed Release: None
         Planned Release: None
              Regression: No
           Work Required: None
          Patch Included: None



>From info page (2 paragraphs):

   Note that these two options do not affect the downloading of HTML
files (as determined by a '.htm' or '.html' filename prefix).  This
behavior may not be desirable for all users, and may be changed for
future versions of Wget.

   Finally, it's worth noting that the accept/reject lists are matched
_twice_ against downloaded files: once against the URL's filename
portion, to determine if the file should be downloaded in the first
place; then, after it has been accepted and successfully downloaded, the
local file's name is also checked against the accept/reject lists to see
if it should be removed.  The rationale was that, since '.htm' and
'.html' files are always downloaded regardless of accept/reject rules,
they should be removed _after_ being downloaded and scanned for links,
if they did match the accept/reject lists.

So any URL from href="..." are retrieved even if they are useless.

As result recursive download time dramatically increased.

For example I try to download specific game replays from

This site list file hierarchy and I hope that this command do interested me

wget -e 'robots=off' -nc -c -np -r  -A 'Scrolling_Survival_Turn_1??_*.bz2' -A
index.html http://replays.wesnoth.org/1.12/

But because any link checked and each page have service links to sort table
data (which are useless for me) it take too long time to wait while wget check

I solve my task with --limit=1 and custom scanner for downloaded index.html

$ wget -e 'robots=off' -nc -c -np -A index.html -r --level=1

$ find . -type f -name index.html | while read f; do p=${f#./};
p=http://${p%index.html}; command grep -o
'href="Scrolling_Survival_Turn_[5][0-5]_[^"]*\.bz2' $f | while read s; do
s=${s#href='"'}; wget $p$s; done; done

If there was options to limit what links to follow in HTML page writing custom
scripts was unnecessary.

Seems that instead of literal "Directory-Based Limits" I need glob/regex
matching for URLs (not just directory or page names).

There are a lot of confusion with -R/-A options, which useless when you know
exactly that type of links to follow:

* bug #3485 ( https://savannah.gnu.org/bugs/?34855 )


Reply to this item at:


  Message sent via/by Savannah

reply via email to

[Prev in Thread] Current Thread [Next in Thread]