bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #60287] Windows recursive download escapes utf8 URLs twice


From: Cameron Tacklind
Subject: [bug #60287] Windows recursive download escapes utf8 URLs twice
Date: Fri, 26 Mar 2021 16:08:06 -0400 (EDT)
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36

Follow-up Comment #7, bug #60287 (project wget):

> Not the local one.
Is this because wget first downloads the html file and then reads the contents
off disk to parse and find links before initiating subsequent http requests?

> And not every page you download has these headers, so the remote one isn't
always known, either.
On the server I control, I've set nginx to always add the full "Content-Type:
text/html; charset=utf8" header.

> The browser just shows the page, it doesn't save it to a disk file.  So
encoding of the page's name isn't an issue for the browser, as it is for
Wget.
This confirms by assumption that wget reads the file off disk to parse it for
links. For instance, if wget downloaded to memory, and parsed the html from
memory, there couldn't possibly be encoding issues because the fs isn't used?

If the bytes were downloaded with the correct encoding, and written to the
file system with the correct encoding, I would expect it to be able to parse
the file with the correct encoding.

But this makes me think about it more, the file `wget-test.html` has no
non-ascii characters in it:


$ file -bi wget-test.html
text/html; charset=us-ascii


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?60287>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]