bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #60287] Windows recursive download escapes utf8 URLs twice


From: Eli Zaretskii
Subject: [bug #60287] Windows recursive download escapes utf8 URLs twice
Date: Sun, 28 Mar 2021 02:57:02 -0400 (EDT)
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0

Follow-up Comment #10, bug #60287 (project wget):

Without converting charsets, it would be difficult to rely on certain library
functions and support certain features.

For example, locale-dependent C library functions work only with the locale's
encoding, and will produce wrong results if presented with strings encoded
differently.  The IRI support needs to work in UTF-8 internally.  And when
writing Web pages to disk, Wget needs to encode the page name so that it would
be acceptable as a file name by the local filesystem.

That is why conversion to the locale's charset is rather necessary. Using the
original bytes might work for some operations, but not for others, so keeping
the original bytes would need some logic for where they can and cannot be
used, which is a complication.  It is better to convert once, and then forget
about it.

The 404 error is most probably because Wget does attempt to convert encoding,
but does it incorrectly when you don't tell it the actual encodings.  So the
re-encoded URL is garbled.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?60287>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]