[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Unexpected character on a downloaded page
From: |
Ángel González |
Subject: |
Re: [Bug-wget] Unexpected character on a downloaded page |
Date: |
Sun, 15 Jun 2014 22:28:14 +0200 |
User-agent: |
Thunderbird |
On 14/06/14 20:31, Angel Tsankov wrote:
Why does wget 1.15 (and 1.12) insert  in several places in the copy
it makes of the following page:
http://www.helloquizzy.com/results/helen-fisher-personality-type-test/?var_Explorer=1&var_Negotiator=1&var_Director=1&var_Builder=1&fromCGI=1
Short answer: because that's what is at that page.
Long answer: That page contains several non-breaking spaces (ASCII 160,
U+00A0) which when encoded as UTF-8 result in the bytes C2 A0. If you
read the page as if it was iso-8859, you will view instead the byte C2
as the glyph Â.
The page correctly states it's in utf-8:
Content-Type: text/html; charset=utf-8
so it should be read in utf-8 mode.
(wget is doing nothing here, it's just receiving bytes and storing in
the file as-is)
Also, why does it download the page to
'index.html?var_Explorer=1&var_Negotiator=1&var_Director=1&var_Builder=1&fromCGI=1'
(i.e. with index.html prepended)?
That's because if you just downloaded
http://www.helloquizzy.com/results/helen-fisher-personality-type-test/
(ie. no filename) it will save it as index.html
I think you could argue both ways for .
Regards,
Angel Tsankov
Best regards
Ángel González