bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] HTML Parsing


From: Tim Ruehsen
Subject: Re: [Bug-wget] HTML Parsing
Date: Thu, 04 Sep 2014 15:23:18 +0200
User-agent: KMail/4.14 (Linux/3.14-2-amd64; KDE/4.14.0; x86_64; ; )

On Wednesday 03 September 2014 16:03:59 The PowerTool wrote:
> In researching a specific wget question I came across "Note that wget is
> only parsing certain html markup (href/src) and css uris (url()) to
> determine what page requisites to get."(1)
> 
> I did quite a bit of troubleshooting a problem I was experiencing and
> believe this quote to be correct.
> 
> I have run into the following scenario:
> 
> Local index.html excerpt:
> <img src="../DomainA/Path-1-for-File1/File1-lowres.jpg"
> data-500px="http://DomainC/Path-1-for-File1/File1-medres.jpg";
> data-highres="http://DomainB/Path-1-for-File1/File1-hires.jpg"; alt="">
 
> <img src="../DomainB/Path-2-for-File2/File2-lowres.jpg"
> data-500px="http://DomainC/Path-2-for-File2/File2-medres.jpg";
> data-highres="http://DomainA/Path-2-for-File2/File2-hires.jpg"; alt="">
 
> <img src="../DomainC/Path-3-for-File3/File3-lowres.jpg"
> data-500px="http://DomainB/Path-3-for-File3/File3-medres.jpg";
> data-highres="http://DomainA/Path-3-for-File3/File3-hires.jpg"; alt="">
> 
> The above example was obtained using wget with -mpNHk -D(covering DomainABC)
> 
> You will note the img src URLs were appropriately replaced with pointers to
> local links which were successfully downloaded.
> 
> data-500px and data-hires link files were not downloaded and URLs were
> untouched.
> 
> After manually downloading the files and changing the links everything
> worked, as expected.
> 
> I tried -A jpg but links in data-500px and data-highres links were ignored.
> 
> Am I right and wget is ignoring tags like data-500px that have URLs pointing
> to files?
> 
> Is there a way to resolve this with the current wget?
> 
> If not what about:
> wget --include-custom-tags data-500px,data-highres
> where this would tell wget to treat data-500px,data-highres tags just like
> src tags.

Wget has the --follow-tags which allows for custom tags (if I am reading the 
source code correctly - I did not test it). But it does not allow for custom 
attributes (the docs are a bit unclear here).

Since Wget is very precise in interpreting tag/attribute/value, data-* 
attributes are not supported.

But as you suggest, --follow-tags could be expanded to e.g.
        --follow-tags="tag1/attribute tag2/attribute ..."

Right now it is
        --follow-tags="tag1 tag2 ..."
while the attributes are taken from a hard coded list.

Maybe you are willing to send a patch ?
It will be appreciated and of course we like to help you if you have  
questions.

> I'm writing a bash script to handle this, for now.  It seems given this is
> something wget already does and the only requirement is to allow for new
> tags (via existing code in wget) which will be used by current web
> developers just makes sense to keep wget current.
> 
> (1) Source:
> http://superuser.com/questions/55040/save-a-single-web-page-with-background
> -images-with-wget
> 
> Thank you!
> 
> ThePowerTool
> bigger, Faster, MORE POWER!!!! --Tim "the toolman" Taylor

Tim




reply via email to

[Prev in Thread] Current Thread [Next in Thread]