bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] HTML Parsing


From: The PowerTool
Subject: [Bug-wget] HTML Parsing
Date: Wed, 3 Sep 2014 16:03:59 -0400

In researching a specific wget question I came across "Note that wget is only 
parsing certain html markup (href/src) and css uris (url()) to determine what 
page requisites to get."(1)

I did quite a bit of troubleshooting a problem I was experiencing and believe 
this quote to be correct.

I have run into the following scenario:

Local index.html excerpt:
<img src="../DomainA/Path-1-for-File1/File1-lowres.jpg" 
data-500px="http://DomainC/Path-1-for-File1/File1-medres.jpg"; 
data-highres="http://DomainB/Path-1-for-File1/File1-hires.jpg"; alt="">
                    
<img src="../DomainB/Path-2-for-File2/File2-lowres.jpg" 
data-500px="http://DomainC/Path-2-for-File2/File2-medres.jpg"; 
data-highres="http://DomainA/Path-2-for-File2/File2-hires.jpg"; alt="">
                    
<img src="../DomainC/Path-3-for-File3/File3-lowres.jpg" 
data-500px="http://DomainB/Path-3-for-File3/File3-medres.jpg"; 
data-highres="http://DomainA/Path-3-for-File3/File3-hires.jpg"; alt="">

The above example was obtained using wget with -mpNHk -D(covering DomainABC)

You will note the img src URLs were appropriately replaced with pointers to 
local links which were successfully downloaded.

data-500px and data-hires link files were not downloaded and URLs were 
untouched.

After manually downloading the files and changing the links everything worked, 
as expected.

I tried -A jpg but links in data-500px and data-highres links were ignored.

Am I right and wget is ignoring tags like data-500px that have URLs pointing to 
files?

Is there a way to resolve this with the current wget?

If not what about:
wget --include-custom-tags data-500px,data-highres
where this would tell wget to treat data-500px,data-highres tags just like src 
tags.

I'm writing a bash script to handle this, for now.  It seems given this is 
something wget already does and the only requirement is to allow for new tags 
(via existing code in wget) which will be used by current web developers just 
makes sense to keep wget current.

(1) Source: 
http://superuser.com/questions/55040/save-a-single-web-page-with-background-images-with-wget

Thank you!

ThePowerTool
bigger, Faster, MORE POWER!!!! --Tim "the toolman" Taylor
                                          

reply via email to

[Prev in Thread] Current Thread [Next in Thread]