[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] bad filenames (again)
From: |
Andries E. Brouwer |
Subject: |
Re: [Bug-wget] bad filenames (again) |
Date: |
Tue, 25 Aug 2015 14:59:58 +0200 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Mon, Aug 24, 2015 at 03:44:09PM +0200, Tim Ruehsen wrote:
> Just implemented (or let's say fixed) Content-Disposition in wget2. It now
> saves the file as
> 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf
Good!
> Content-Disposition (filename, filename*) is standardized, but browsers seems
> to behave/parse very different, ignoring standards.
Yes. On the web a general phenomenon is that non-specialists create websites.
They know nothing about standards, but fiddle until it works (say, with IE6).
Also Microsoft does/did not respect standards.
A consequence is that practice is more important than theory.
One has to try for robust solutions.
> > I prefer to base the decision about what to do on the form
> > of the filename (ASCII / UTF-8 / other), not on the
> > headers encountered on the way to this file.
>
> I guess we can find an easy agreement.
>
> 1. Wget has to obey the defaults. If it fails or we find a well-known
> misbehavior (server/document fault), handle it automatically.
> That's how we try do do it now.
>
> 2. If still a problem arises, the user should be able to intercept. Using
> special command line options for fine-tuning Wget's behavior.
Yes, whatever the user says, we do, the case where options have been given
is nonproblematic.
Remains your point 1. I am not sure what you think the defaults are.
My basic example is the %-encoded pure ASCII url, referring to a non-text
object. How should wget save the object? There is zero charset information.
My answer today (after conversation with Eli) is:
"Decode the %-encoded string. The last part is the suggested filename.
If it is ASCII, use that ASCII name (where valid for the OS).
If it is UTF-8 (but not ASCII), use it when the locale is UTF-8,
otherwise convert (if possible) or escape. If it is not UTF-8, escape."
[That is, I would recognize only what is easy to recognize,
and prefer not to rely on any headers. Prefer not to convert
except possibly in the UTF-8 case.]
How does your answer differ?
Some ancient docs say that ISO-8859-1 is a default. Even if such docs
have not yet been replaced, I feel that they no longer reflect current
practice. ISO-8859-x is dying. All the web should converge to Unicode,
whatever that may be.
The relevant example might be that
http://www.win.tue.nl/~aeb/linux/lk/kn%e4ckebr%f6d.jpg
I have the impression that you are happy with "kn=C3=A4ckebr=C3=B6d.jpg"
but I would be unhappy with that (although it happens to be correct),
since guessing and conversion is involved.
Guessing may not be so bad, but guessing and converting is terrible:
it can be really complicated to retrieve the original filename
after an incorrect conversion.
Andries
Another URL:
http://hongaarskinderplezier.eu/index.php?pagina=96&naam=Gy%25F5r-Moson-Sopron
This is about holidays near the beautiful city Győr in Hungary.
But what happened with the city? Its name was written in ISO-8859-2,
using 0xf5, and that was %-escaped to %f5, and that was again
%-escaped to %25f5.
I would undo the %-escape and see pure ASCII, and save as
index.php?pagina=96&naam=Gy%F5r-Moson-Sopron.
What would you do?
The page has <meta charset="ISO-8859-2" />
The headers have Content-Type: text/html without charset information.
---
Similarly http://www.matklubben.se/recept/lchf+kn%25e4ckebr%25f6d+mandelmj%25f6l
has the %-encoded version of "Lchf kn%e4ckebr%f6d mandelmj%f6l"
which again encoded the ISO-8859-1 version of lchf knäckebröd mandelmjöl.
Such double encodings are not uncommon.
But as a first approximation I think wget should not try to recognize them.
---
http://www.eet-china.com/SEARCH/ART/%EF%BC%85C0%EF%BC%85B6%E7%9A%84%EF%BC%85D1%E7%9A%84%EF%BC%85C0.HTM
ends in %C0%B6的%D1的%C0.HTM - this is an %-encoding using fat %-signs (U+ff05).
You see that one can encounter all levels of messiness.
- Re: [Bug-wget] bad filenames (again), (continued)
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Tim Rühsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/24
- Re: [Bug-wget] bad filenames (again),
Andries E. Brouwer <=
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/19
- Re: [Bug-wget] bad filenames (again), Ángel González, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Ángel González, 2015/08/23
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/23
- Re: [Bug-wget] bad filenames (again), Ángel González, 2015/08/23
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/23
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/19