bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #60287] Windows recursive download escapes utf8 URLs twice


From: Cameron Tacklind
Subject: [bug #60287] Windows recursive download escapes utf8 URLs twice
Date: Sun, 28 Mar 2021 23:28:21 -0400 (EDT)
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36

Follow-up Comment #11, bug #60287 (project wget):

Except a URI is always in a restricted character set, by design, to make all
the encoding issues go away.

I hear the point about writing the file to disk and making sure the path used
on disk can be reliably generated from an arbitrary encoding scheme. But that
should happen independently from contactinating the relative uri with the base
uri, both of which are always in a restricted subset of octets that is a
subset of printable ascii characters.

So, while I agree that a conversion to the local charset needs to happen, that
should *only* happen with regard to the file system file name, which is
independent from the request line sent to the HTTP server.

The 404 is *exactly* the problem I think is a bug. The downloaded HTML file
has embedded <a> tags with `href` attributes that are *never* outside of the
printable ascii range.

This 404 happens, as far as I can tell, because wget *assumes* local character
set is important instead of doing what is specified in the HTML/HTTP
standards, as far as I understand them, of not doing any character encoding
translations.

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?60287>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]