[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] HTML Parsing
From: |
The PowerTool |
Subject: |
[Bug-wget] HTML Parsing |
Date: |
Wed, 3 Sep 2014 16:03:59 -0400 |
In researching a specific wget question I came across "Note that wget is only
parsing certain html markup (href/src) and css uris (url()) to determine what
page requisites to get."(1)
I did quite a bit of troubleshooting a problem I was experiencing and believe
this quote to be correct.
I have run into the following scenario:
Local index.html excerpt:
<img src="../DomainA/Path-1-for-File1/File1-lowres.jpg"
data-500px="http://DomainC/Path-1-for-File1/File1-medres.jpg"
data-highres="http://DomainB/Path-1-for-File1/File1-hires.jpg" alt="">
<img src="../DomainB/Path-2-for-File2/File2-lowres.jpg"
data-500px="http://DomainC/Path-2-for-File2/File2-medres.jpg"
data-highres="http://DomainA/Path-2-for-File2/File2-hires.jpg" alt="">
<img src="../DomainC/Path-3-for-File3/File3-lowres.jpg"
data-500px="http://DomainB/Path-3-for-File3/File3-medres.jpg"
data-highres="http://DomainA/Path-3-for-File3/File3-hires.jpg" alt="">
The above example was obtained using wget with -mpNHk -D(covering DomainABC)
You will note the img src URLs were appropriately replaced with pointers to
local links which were successfully downloaded.
data-500px and data-hires link files were not downloaded and URLs were
untouched.
After manually downloading the files and changing the links everything worked,
as expected.
I tried -A jpg but links in data-500px and data-highres links were ignored.
Am I right and wget is ignoring tags like data-500px that have URLs pointing to
files?
Is there a way to resolve this with the current wget?
If not what about:
wget --include-custom-tags data-500px,data-highres
where this would tell wget to treat data-500px,data-highres tags just like src
tags.
I'm writing a bash script to handle this, for now. It seems given this is
something wget already does and the only requirement is to allow for new tags
(via existing code in wget) which will be used by current web developers just
makes sense to keep wget current.
(1) Source:
http://superuser.com/questions/55040/save-a-single-web-page-with-background-images-with-wget
Thank you!
ThePowerTool
bigger, Faster, MORE POWER!!!! --Tim "the toolman" Taylor
- [Bug-wget] HTML Parsing,
The PowerTool <=