Re: [Bug-wget] bad filenames (again)

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)

From:	Eli Zaretskii
Subject:	Re: [Bug-wget] bad filenames (again)
Date:	Mon, 17 Aug 2015 22:31:13 +0300

> Date: Mon, 17 Aug 2015 19:58:31 +0200
> From: "Andries E. Brouwer" <address@hidden>
> Cc: "Andries E. Brouwer" <address@hidden>, address@hidden,
>         address@hidden
> 
> On Mon, Aug 17, 2015 at 06:27:05PM +0300, Eli Zaretskii wrote:
> 
> >> (ii) [about possibly using iconv]
> >> 
> >>>> How do you guess the original character set?
> >
> > The answer is call "nl_langinfo (CODESET)".
> 
> I think we are not communicating.
> 
> wget fetches a file from a remote machine.
> We know the filename (as a sequence of bytes).
> As far as I can see, there is no information on what character set
> (if any) that sequence of bytes might be in.

Then please explain why you started this thread by saying that the
byte sequence should end up unaltered in the filesystem (and wrote the
patch to do the same, AFAIU) if the target's locale uses UTF-8 as its
encoding.  What do you expect the file names to look like in 'ls' or
anything similar, after doing that?

> In order to call iconv, I need a from-charset and a to-charset.
> I think your answer tells me how to find a reasonable to-charset.
> But the problem is how to find a from-charset.

I thought the from-charset was UTF-8, or at least you assumed that.
If it isn't, I see even less sense in the idea of your patch, which is
basically writing the bytes unaltered.  Don't we want to try to have
on the target the same file names as on the source?  If not, what do
we want to achieve here, and why is what wget did before your patch
the wrong thing?

> [Even when from-charset and to-charset are known there is
> a can of worms involved in conversion.

No can of worms that I could see.  Either the conversion succeeds, or
it fails.  You get a clear indication from iconv about that.

> > > Unix filenames are not necessarily in any particular character set.
> > > They are sequences of bytes different from NUL and '/'.
> > > A different sequence of bytes is a different filename.
> > 
> > As long as you treat them as UTF-8 encoded strings, ...
> 
> I don't understand how one can treat sequences of bytes
> that are not valid UTF-8 as UTF-8 encoded strings.
> If all the world is UTF-8 then fine. But the remote machine
> is an unknown system. We just have a byte sequence, that is all.

If we know nothing about the source encoding, then the only sane thing
is to always hex-encode characters with 8th bit set.  But that's not
what your patch does.  It writes the byte stream verbatim to the
filesystem if the target locale uses UTF-8 as its codeset.  Please
explain the logic behind this, because I don't see it.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] bad filenames (again), (continued)

Prev by Date: Re: [Bug-wget] bad filenames (again)
Next by Date: Re: [Bug-wget] bad filenames (again)
Previous by thread: Re: [Bug-wget] bad filenames (again)
Next by thread: Re: [Bug-wget] bad filenames (again)
Index(es):
- Date
- Thread