bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #45801] Allowing to configure HTML engine which links to


From: Oleksandr Gavenko
Subject: [Bug-wget] [bug #45801] Allowing to configure HTML engine which links to follow
Date: Thu, 20 Aug 2015 21:29:01 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.1.0

URL:
  <http://savannah.gnu.org/bugs/?45801>

                 Summary: Allowing to configure HTML engine which links to
follow
                 Project: GNU Wget
            Submitted by: gavenkoa
            Submitted on: Thu 20 Aug 2015 09:29:00 PM GMT
                Category: Feature Request
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: 
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any
                 Release: None
        Operating System: None
         Reproducibility: None
           Fixed Release: None
         Planned Release: None
              Regression: No
           Work Required: None
          Patch Included: None

    _______________________________________________________

Details:

>From info page (2 paragraphs):

   Note that these two options do not affect the downloading of HTML
files (as determined by a '.htm' or '.html' filename prefix).  This
behavior may not be desirable for all users, and may be changed for
future versions of Wget.


   Finally, it's worth noting that the accept/reject lists are matched
_twice_ against downloaded files: once against the URL's filename
portion, to determine if the file should be downloaded in the first
place; then, after it has been accepted and successfully downloaded, the
local file's name is also checked against the accept/reject lists to see
if it should be removed.  The rationale was that, since '.htm' and
'.html' files are always downloaded regardless of accept/reject rules,
they should be removed _after_ being downloaded and scanned for links,
if they did match the accept/reject lists.

So any URL from href="..." are retrieved even if they are useless.

As result recursive download time dramatically increased.

For example I try to download specific game replays from
http://replays.wesnoth.org/1.12/

This site list file hierarchy and I hope that this command do interested me
job: 

wget -e 'robots=off' -nc -c -np -r  -A 'Scrolling_Survival_Turn_1??_*.bz2' -A
index.html http://replays.wesnoth.org/1.12/

But because any link checked and each page have service links to sort table
data (which are useless for me) it take too long time to wait while wget check
them.

I solve my task with --limit=1 and custom scanner for downloaded index.html
files:

$ wget -e 'robots=off' -nc -c -np -A index.html -r --level=1
http://replays.wesnoth.org/1.12/

$ find . -type f -name index.html | while read f; do p=${f#./};
p=http://${p%index.html}; command grep -o
'href="Scrolling_Survival_Turn_[5][0-5]_[^"]*\.bz2' $f | while read s; do
s=${s#href='"'}; wget $p$s; done; done

If there was options to limit what links to follow in HTML page writing custom
scripts was unnecessary.

Seems that instead of literal "Directory-Based Limits" I need glob/regex
matching for URLs (not just directory or page names).

There are a lot of confusion with -R/-A options, which useless when you know
exactly that type of links to follow:

* bug #3485 ( https://savannah.gnu.org/bugs/?34855 )
*
http://unix.stackexchange.com/questions/179020/wget-and-preventing-files-from-downloading-on-a-recursive-wget
*
http://superuser.com/questions/130653/wget-recursively-download-from-pages-with-lots-of-links




    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?45801>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]