bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #60287] Windows recursive download escapes utf8 URLs twice


From: Eli Zaretskii
Subject: [bug #60287] Windows recursive download escapes utf8 URLs twice
Date: Sat, 27 Mar 2021 02:43:56 -0400 (EDT)
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0

Follow-up Comment #8, bug #60287 (project wget):

> Is this because wget first downloads the html file and then reads the
contents off disk

No.  It's because Wget downloads the pages you told it to, and saves them as
disk files.  Any links in the downloaded pages that lead to other pages
produce additional disk files (e.g., if you told Wget to download
recursively).

IOW, the file-name encoding issue happens when a Web page needs to be saved to
a file for some reason.

> If the bytes were downloaded with the correct encoding, and written to the
file system with the correct encoding, I would expect it to be able to parse
the file with the correct encoding.

What is the "correct encoding", though?

> the file `wget-test.html` has no non-ascii characters in it

Of course, it doesn't: the non-ASCII characters appear when we decode the
hex-encoded bytes.



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?60287>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]