[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: srcset lists are corrupted when converting links
From: |
Dan Ellis |
Subject: |
Re: srcset lists are corrupted when converting links |
Date: |
Tue, 29 Dec 2020 15:36:02 -0500 |
I did some digging.
The problem is that the link->size written for the srcset elements is
calculated based on the srcset content string passed to
html-url.c:tag_handle_image(), which already has URL escapes such as &
decoded (to plain "&"). But this size is used as the basis for skipping
over the original URLs to be replaced in convert.c:convert_links(), so when
skipping over the old URL that has been rewritten, the pointer does not
move forward far enough. (In my example, it lags by 8 chars - the 4 char
difference between "&" and "&" for the two occurrences in each URL) and
copies over the intervening part from the wrong place in the original file.
My "fix" is, in tag_handle_image(), to set the link->size based on a
re-escaped version of the URL extracted from the srcset. (We also have to
fiddle with the base_ind to make sure we point to the correct start point
for the remaining URLs in the value). This patch fixes my problem:
*** html-url.c~ 2019-02-19 17:23:46.000000000 -0500
--- html-url.c 2020-12-29 15:21:59.524993035 -0500
***************
*** 726,733 ****
{
char *url_text = strdupdelim (srcset + url_start,
srcset + url_end);
struct urlpos *up = append_url (url_text, base_ind +
url_start,
! url_end - url_start, ctx);
if (up)
{
up->link_inline_p = 1;
--- 726,748 ----
{
char *url_text = strdupdelim (srcset + url_start,
srcset + url_end);
+ /* The SIZE passed to append_url is stored with the URL and
used
+ to skip over the original URL in the source file when
rewriting
+ in convert_file. Because it has to skip over the
pre-decoded
+ text, it needs to be increased to reflect the length of
the
+ URL before decode_entity was applied. We don't have that
+ information (the entire srcset value was decoded at
once, not
+ one URL at a time), so we guess here by re-encoding and
using
+ the length of that. Will not work if the original
escaping
+ was non-canonical. */
+ char *quoted_url_text = html_quote_string(url_text);
+ int url_undecoded_size = strlen(quoted_url_text);
+ xfree(quoted_url_text);
struct urlpos *up = append_url (url_text, base_ind +
url_start,
! url_undecoded_size, ctx);
! /* We also have to update base_ind to account for the
unescaped
! characters. */
! base_ind += url_undecoded_size - (url_end - url_start);
if (up)
{
up->link_inline_p = 1;
Hope this helps.
DAn.
On Tue, Dec 29, 2020 at 11:43 AM Dan Ellis <dan.ellis@gmail.com> wrote:
> I'm using wget to make a frozen, offline mirror of a wordpress.com site.
> The original HTML makes extensive use of <img srcset=...> (responsive
> design for different browser resolutions. wget is corrupting the
> comma-separated lists of images.
>
> e.g.
>
>
> wget --page-requisites --span-hosts https://theliteratelens.com/
>
> downloads a set of files including theliteratelens.com/index.html which
> includes the following element as the first instance of srcset (line breaks
> inserted by me and irrelevant fields omitted):
>
> <img width="350" height="248"
> src="
> https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1
> "
> class="attachment-suburbia-sticky size-suburbia-sticky wp-post-image"
> alt=""
> loading="lazy"
> srcset="
> https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1
> 350w,
> https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=150&h=106&crop=1
> 150w,
> https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=300&h=212&crop=1
> 300w"
> sizes="(max-width: 350px) 100vw, 350px"
> ... />
>
> Note the srcset field with 3 versions of the image referenced whose
> decoded URL tails look like "realistfrontcover_small.jpg?w=150&h=248&crop=1"
>
> However, if I add --convert-links, e.g.
>
> wget --page-requisites --span-hosts --convert-links
> https://theliteratelens.com/
>
> the same element in theliteratelens.com/index.html becomes:
>
> <img width="350" height="248"
> src="../
> theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1
> "
> class="attachment-suburbia-sticky size-suburbia-sticky wp-post-image"
> alt=""
> loading="lazy"
> srcset="../
> theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1p;crop=../theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=150&h=106&crop=1h=106&a../theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=300&h=212&crop=1300&h=212&crop=1
> 300w"
> sizes="(max-width: 350px) 100vw, 350px"
> ... />
>
> i.e. the comma-separated list in the srcset has been badly corrupted. For
> instance, the end of the first path, which was originally
>
> ...h=248&crop=1 350w, https://
> theliteratelens.files.wordpress.com/2017/12...
>
> becomes
>
> ...h=248&crop=1p;crop=../
> theliteratelens.files.wordpress.com/2017/12...
>
> and the second boundary between elements starts as
>
> ...h=106&crop=1 150w, https://theliteratelens.files...
>
> but ends up as
>
> ...h=106&crop=1h=106&a../theliteratelens.files...
>
> What seems to be happening is that the convert-links logic is finding the
> absolute URLs to the second host (
> https://theliteratelens.files.wordpress.com) and correctly maps them to
> relative paths (../theliteratelens.files.wordpress.com/), but at the same
> time it reaches back one space-delimiter too far, and replaces those
> characters with a spurious sample from the preceding string.
>
> I hope this helps identify the problem.
>
> DAn.
>
>
>