bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Unexpected character on a downloaded page


From: Ángel González
Subject: Re: [Bug-wget] Unexpected character on a downloaded page
Date: Sun, 15 Jun 2014 22:28:14 +0200
User-agent: Thunderbird

On 14/06/14 20:31, Angel Tsankov wrote:
Why does wget 1.15 (and 1.12) insert  in several places in the copy it makes of the following page:

http://www.helloquizzy.com/results/helen-fisher-personality-type-test/?var_Explorer=1&var_Negotiator=1&var_Director=1&var_Builder=1&fromCGI=1
Short answer: because that's what is at that page.

Long answer: That page contains several non-breaking spaces (ASCII 160, U+00A0) which when encoded as UTF-8 result in the bytes C2 A0. If you read the page as if it was iso-8859, you will view instead the byte C2 as the glyph Â.

The page correctly states it's in utf-8:
Content-Type: text/html; charset=utf-8
so it should be read in utf-8 mode.

(wget is doing nothing here, it's just receiving bytes and storing in the file as-is)


Also, why does it download the page to 'index.html?var_Explorer=1&var_Negotiator=1&var_Director=1&var_Builder=1&fromCGI=1' (i.e. with index.html prepended)?
That's because if you just downloaded http://www.helloquizzy.com/results/helen-fisher-personality-type-test/ (ie. no filename) it will save it as index.html
I think you could argue both ways for .


Regards,

Angel Tsankov
Best regards

Ángel González







reply via email to

[Prev in Thread] Current Thread [Next in Thread]