[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug #60287] Windows recursive download escapes utf8 URLs twice
From: |
Cameron Tacklind |
Subject: |
[bug #60287] Windows recursive download escapes utf8 URLs twice |
Date: |
Fri, 26 Mar 2021 16:08:06 -0400 (EDT) |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 |
Follow-up Comment #7, bug #60287 (project wget):
> Not the local one.
Is this because wget first downloads the html file and then reads the contents
off disk to parse and find links before initiating subsequent http requests?
> And not every page you download has these headers, so the remote one isn't
always known, either.
On the server I control, I've set nginx to always add the full "Content-Type:
text/html; charset=utf8" header.
> The browser just shows the page, it doesn't save it to a disk file. So
encoding of the page's name isn't an issue for the browser, as it is for
Wget.
This confirms by assumption that wget reads the file off disk to parse it for
links. For instance, if wget downloaded to memory, and parsed the html from
memory, there couldn't possibly be encoding issues because the fs isn't used?
If the bytes were downloaded with the correct encoding, and written to the
file system with the correct encoding, I would expect it to be able to parse
the file with the correct encoding.
But this makes me think about it more, the file `wget-test.html` has no
non-ascii characters in it:
$ file -bi wget-test.html
text/html; charset=us-ascii
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?60287>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Cameron Tacklind, 2021/03/25
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Cameron Tacklind, 2021/03/25
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Eli Zaretskii, 2021/03/25
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Cameron Tacklind, 2021/03/26
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Eli Zaretskii, 2021/03/26
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Cameron Tacklind, 2021/03/26
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Eli Zaretskii, 2021/03/26
- [bug #60287] Windows recursive download escapes utf8 URLs twice,
Cameron Tacklind <=
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Eli Zaretskii, 2021/03/27
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Cameron Tacklind, 2021/03/27
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Eli Zaretskii, 2021/03/28
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Cameron Tacklind, 2021/03/28
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Eli Zaretskii, 2021/03/29
- [bug #60287] Windows recursive download escapes utf8 URLs twice, Cameron Tacklind, 2021/03/29