bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Tim Ruehsen
Subject: Re: [Bug-wget] bad filenames (again)
Date: Tue, 18 Aug 2015 11:58:54 +0200
User-agent: KMail/4.14.2 (Linux/4.1.0-1-amd64; KDE/4.14.2; x86_64; ; )

On Tuesday 18 August 2015 10:55:46 Andries E. Brouwer wrote:
> On Tue, Aug 18, 2015 at 10:29:40AM +0200, Tim Ruehsen wrote:
> > I am going with Eli that we should use iconv.
> > We know the remote encoding and the local encoding
> 
> Do we?
> How do you guess the remote encoding?
> Is there any particular encoding?

Yes we do.
Starting with 'wget URL', URL has the local encoding (can be overridden by --
local-encoding).
Using wget -r will download documents (HTML and CSS right now) and parse them 
for more URLs. These documents have a well known encoding (either by default 
or by explicit setting via HTTP header or document settings). For broken 
servers, we still have --remote-encoding.

> Unix filenames are sequences of bytes, they do not have a character set.

The character encoding makes with what symbols these bytes (or byte sequences 
aka multibyte / codepoints) are displayed for you. I gave an example in my 
last email.

Change your locale to iso-8859-1 and make a 'touch äöü'. 'ls' will show it 
correctly. Then change your locale to UTF-8 and now 'ls' will show garbage 
though your file name did not change.

Tim

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]