bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Tim Ruehsen
Subject: Re: [Bug-wget] bad filenames (again)
Date: Wed, 12 Aug 2015 17:54:25 +0200
User-agent: KMail/4.14.2 (Linux/4.1.0-1-amd64; KDE/4.14.2; x86_64; ; )

On Wednesday 12 August 2015 14:38:15 Andries E. Brouwer wrote:
> Hi Tim,
> 
> > Just a few questions.
> > 
> > 1.
> > Why don't you use 'opt.locale' to check if the local encoding is UTF-8 ?
> 
> I thought that was usable only if ENABLE_IRI was defined.

I see. ENABLE_IRI, libiconv (for conversion) and libidn (used for setting 
opt.locale) are tightly coupled. Understandable that you won't go into that 
swamp.

> > 2.
> > I don't understand how you distinguish between illegal and legal UTF-8
> > sequences. I guess only legal sequences should be unescaped.
> > Or to make it easy: if the string is valid UTF-8, do not escape.
> > If it is not valid UTF-8, escape it.
> > You could:
> > Add unistr/u8-check to bootstrap.conf (./bootstrap thereafter),
> > include #include "unistr.h" and use
> > if (u8_check (s, strlen(s)) == 0) to test for validity.
> 
> Yes, I expected you to say something like this.
> 
> My reason: I consider this escaping a very doubtful activity.
> In my eyes the correct code is not: always escape except when UTF-8,
> but rather: never escape except perhaps when someone asks for it.
> So the precise check for UTF-8 is in my eyes just bloat.

Of course, only when someone asks (in this special case).
But the user should *really* know what he is doing, else the requested 'not-
escaping' becomes an epic fail.

> Moreover: what to do if the name is not valid UTF-8?
> The current escaping produces something that not valid UTF-8.
> So doing the current escaping is certainly a mistake, not better
> than using the name as-is. Invent a new type of escaping?

The procedure should be (simplified):
When extracting an URL from a document, we know it's encoding. When we 
generate a filename from this URL we should (and can) convert to local 
encoding first, then generate the filename. If this fails (likely iconv() 
problem), we start escaping regarding the user's wish (except the user does 
explicitly not want escaping).

> So, for the time being, my previous patch avoided the old mistake,
> without introducing new mistakes :-).

OK. Let's set up a test where we define input and expected output.
If that works, I am fine.

Regards, Tim

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]