[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] HTML Parsing
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] HTML Parsing |
Date: |
Thu, 04 Sep 2014 15:23:18 +0200 |
User-agent: |
KMail/4.14 (Linux/3.14-2-amd64; KDE/4.14.0; x86_64; ; ) |
On Wednesday 03 September 2014 16:03:59 The PowerTool wrote:
> In researching a specific wget question I came across "Note that wget is
> only parsing certain html markup (href/src) and css uris (url()) to
> determine what page requisites to get."(1)
>
> I did quite a bit of troubleshooting a problem I was experiencing and
> believe this quote to be correct.
>
> I have run into the following scenario:
>
> Local index.html excerpt:
> <img src="../DomainA/Path-1-for-File1/File1-lowres.jpg"
> data-500px="http://DomainC/Path-1-for-File1/File1-medres.jpg"
> data-highres="http://DomainB/Path-1-for-File1/File1-hires.jpg" alt="">
> <img src="../DomainB/Path-2-for-File2/File2-lowres.jpg"
> data-500px="http://DomainC/Path-2-for-File2/File2-medres.jpg"
> data-highres="http://DomainA/Path-2-for-File2/File2-hires.jpg" alt="">
> <img src="../DomainC/Path-3-for-File3/File3-lowres.jpg"
> data-500px="http://DomainB/Path-3-for-File3/File3-medres.jpg"
> data-highres="http://DomainA/Path-3-for-File3/File3-hires.jpg" alt="">
>
> The above example was obtained using wget with -mpNHk -D(covering DomainABC)
>
> You will note the img src URLs were appropriately replaced with pointers to
> local links which were successfully downloaded.
>
> data-500px and data-hires link files were not downloaded and URLs were
> untouched.
>
> After manually downloading the files and changing the links everything
> worked, as expected.
>
> I tried -A jpg but links in data-500px and data-highres links were ignored.
>
> Am I right and wget is ignoring tags like data-500px that have URLs pointing
> to files?
>
> Is there a way to resolve this with the current wget?
>
> If not what about:
> wget --include-custom-tags data-500px,data-highres
> where this would tell wget to treat data-500px,data-highres tags just like
> src tags.
Wget has the --follow-tags which allows for custom tags (if I am reading the
source code correctly - I did not test it). But it does not allow for custom
attributes (the docs are a bit unclear here).
Since Wget is very precise in interpreting tag/attribute/value, data-*
attributes are not supported.
But as you suggest, --follow-tags could be expanded to e.g.
--follow-tags="tag1/attribute tag2/attribute ..."
Right now it is
--follow-tags="tag1 tag2 ..."
while the attributes are taken from a hard coded list.
Maybe you are willing to send a patch ?
It will be appreciated and of course we like to help you if you have
questions.
> I'm writing a bash script to handle this, for now. It seems given this is
> something wget already does and the only requirement is to allow for new
> tags (via existing code in wget) which will be used by current web
> developers just makes sense to keep wget current.
>
> (1) Source:
> http://superuser.com/questions/55040/save-a-single-web-page-with-background
> -images-with-wget
>
> Thank you!
>
> ThePowerTool
> bigger, Faster, MORE POWER!!!! --Tim "the toolman" Taylor
Tim