bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Redirect containing %2B behaves differently depending on


From: Ander Juaristi
Subject: Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale
Date: Mon, 13 Apr 2015 17:03:23 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0

On 04/03/2015 02:16 PM, Tim Rühsen wrote:
Hi Ander,

Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi:
On 03/13/2015 11:48 PM, Adam Sampson wrote:
Hi,

I've just found a case where wget 1.16.3 responds to a 302 redirect
differently depending on whether it's in an ASCII or UTF-8 locale.

This works:
LC_ALL=en_GB.UTF-8 wget
https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2

This doesn't work:
LC_ALL=C wget
https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2

I've attached logs with -d showing what's actually going on. The

initial request gives a 302 response with a Location: that contains:
    ....tar.bz2?Signature=up6%2BtTpSF...

In the UTF-8 locale, wget correctly redirects to that location.

In the ASCII locale, wget -d print a "converted: '...' -> '...'" line

(from iri.c's do_conversion), then redirects to:
    ....tar.bz2?Signature=up6+tTpSF...

(If you try it yourself you'll get a slightly different URL, but at
least for me it usually contains %2B somewhere.)

This appears to be because do_conversion calls url_unescape on the
input string it's given -- even though that input string is a _const_
char * in the code that calls it (main -> retrieve_url -> url_parse ->
remote_to_utf8 -> do_conversion). It's not immediately obvious to me
whether that's intentional or not; at the very least, it's a surprising
bit of behaviour.

That call to url_unescape() is necessary because iconv() needs the multibyte
characters with no encoding. My first approach, by the way, was to remove
that call, but that caused Test-iri-percent.px to fail, which is pretty
clear.

The issue seems to be at the call to reencode_escapes(), just after
remote_to_utf8() returns. The problem here is that %2B resolves to "+"
(literal). And that character is equal to the reserved character "+", and
reencode_escapes() treats it as a reserved characters and leaves it as-is.
The same happens with other characters, such as "=" (%3D).

What I propose is to tag the characters that have been decoded, in
url_unescape(), and then in reencode_escapes(), verify if they coincide
with reserved characters as well.

What do you guys think?

Without looking at the code right now and from what you describe above, your
proposal sounds like a good idea. This problem pops up again and again. If you
solve the issue, some people will love you :-)

Regards, Tim

As promised, here it goes.

This works to me, although I'm expecting to send a test case in the following 
days.

I read RFC 3987 on which iri.c is based, and it proposed a better approach than 
mine for this specific case,
concretely, in section 3.2 "Converting URIs to IRIs". Thus, I decided to 
implement that approach, which basically
says that characters in "reserved" should *not* be unescaped prior to 
converting to UTF-8.

--
Regards,
- AJ

Attachment: 0001-Fixed-incorrect-handling-of-reserved-characters-when.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]