[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] bad filenames (again)
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] bad filenames (again) |
Date: |
Tue, 18 Aug 2015 10:29:40 +0200 |
User-agent: |
KMail/4.14.2 (Linux/4.1.0-1-amd64; KDE/4.14.2; x86_64; ; ) |
On Monday 17 August 2015 22:51:12 Andries E. Brouwer wrote:
> On Mon, Aug 17, 2015 at 10:31:13PM +0300, Eli Zaretskii wrote:
> > what do we want to achieve here, and why is what wget did
> > before your patch the wrong thing?
>
> Wget modified filenames, and users are unhappy.
> See
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=387745
> http://savannah.gnu.org/bugs/?37564
> http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
> http://stackoverflow.com/questions/27054765/wget-japanese-characters
> http://www.win.tue.nl/~aeb/linux/misc/wget.html
> etc.
>
> It is debatable what precisely would be the right thing,
> but my patch greatly increases the number of happy users.
> Further improvement is possible.
> For example, nothing was changed yet for Windows, but also
> Windows users complain about this wget escaping.
I am going with Eli that we should use iconv.
We know the remote encoding and the local encoding, so I don't see a problem
here. There are a few cases (when using --input-file) where we have to tell
wget the encoding via --remote-encoding.
On Windows we very often have the default locale Windows-1252 (aka CP1252)
which is a superset of iso-8859-1. While web servers more and more often
deliver content encoded as UTF-8. A UTF-8 filename of 'ö.html' (\C3x\B6x.html)
should be saved as CP1252 ö.html (\F6x.html). If conversion is not possible
due to characters not included into CP1252, we should fallback to escaping (
as improvement we could first try to convert codepoint by codepoint and just
escape the ones not convertable).
This already done in 'wget2' branch where it can be tested (using src2/wget2).
We just have to backport it to Wget 'master' branch. For me, this is just a
matter of available time.
Tim
signature.asc
Description: This is a digitally signed message part.
- Re: [Bug-wget] bad filenames (again), (continued)
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/13
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/13
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/16
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/16
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/16
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/17
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/17
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/17
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/17
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/17
- Re: [Bug-wget] bad filenames (again),
Tim Ruehsen <=
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/18
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/18
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/18
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/18
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/18
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/18
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/18
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/18
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/18
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/18