[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] no_proxy domain matching
From: |
Tim Rühsen |
Subject: |
Re: [PATCH] no_proxy domain matching |
Date: |
Wed, 20 Nov 2019 18:47:03 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 |
On 20.11.19 12:41, Tomas Hozza wrote:
> On 7. 11. 2019 21:30, Tim Rühsen wrote:
>> On 07.11.19 15:21, Tomas Hozza wrote:
>>> Hi.
>>>
>>> In RHEL-8, we ship a wget version that suffers from bug fixed by [1]. The
>>> fix resolved issue with matching subdomains when no_proxy domain definition
>>> was prefixed with dot, e.q. "no_prefix=.mit.edu". As part of backporting
>>> the fix to RHEL, I wanted to create an upstream test for no_prefix
>>> functionality. However I found that there is still one corner case, which
>>> is not handled by the current upstream code and honestly I'm not sure what
>>> is the intended domain matching behavior in that case. Man page is also not
>>> very specific in this regard.
>>>
>>> The corner case is as follows:
>>> - no_proxy=.mit.edu
>>> - download URL is e.g. "http://mit.edu/file1"
>>>
>>> In this case the proxy settings are used, because domains don't match due
>>> to the leftmost dot in no_proxy domain definition. This is either intended
>>> or corner case that was not considered. One could argue, that if the
>>> no_proxy is set to ".mit.edu", then leftmost dot means that the proxy
>>> settings should not apply only to subdomains of "mit.edu", but proxy
>>> settings should still apply to "mit.edu" domain itself. From my point of
>>> view, after reading wget man page, I don't think that the leftmost dost in
>>> no_proxy definition has any special meaning.
>>
>> Hello Tomas,
>>
>> hard to decide how to handle this. I personally would like to see a
>> match with curl's behavior (see https://github.com/curl/curl/issues/1208).
>>
>> Given the docs from GNU emacs, you are right. "no_proxy=.mit.edu" means
>> "mit.edu and subdomains" are excluded from proxy settings.
>> (see https://www.gnu.org/software/emacs/manual/html_node/url/Proxies.html)
>>
>> The caveat with emacs' behavior is that you cannot exclude just all
>> subdomains of mit.edu without mit.edu itself. Effectively, that creates
>> a corner case that can't be handled at all. (but if curl also does it
>> that way, let's go for it).
>>
>> Maybe you can find out about the current no_proxy behavior of typical
>> and wide-spread tools (regarding leftmost dot) !? Once we have that
>> information, we can make a confident decision.
>>
>> Regards, Tim
>
> Hi Tim.
>
> It took me some time to go through the current situation and to be honest, it
> is kind of a mess. While each tool handles the no_proxy env a little bit
> differently, there are some similarities. Nevertheless I was not able to find
> any standard.
>
> curl's behavior:
> - "no_proxy=.mit.edu"
> - will match the domain and subdomains e.g. "www.mit.edu" or
> "www.subdomain.mit.edu"
> - will match the host "mit.edu"
> - "no_proxy=mit.edu"
> - will match the domain and subdomains e.g. "www.mit.edu" or
> "www.subdomain.mit.edu"
> - will match the host "mit.edu"
> - downside: can not match only the host; can not match only the domain and
> subdomains
>
> current wget's behavior:
> - "no_proxy=.mit.edu"
> - will match the domain and subdomains e.g. "www.mit.edu" or
> "www.subdomain.mit.edu"
> - will NOT match the host "mit.edu"
> - "no_proxy=mit.edu"
> - will match the domain and subdomains e.g. "www.mit.edu" or
> "www.subdomain.mit.edu"
> - will match the host "mit.edu"
> - downside: can not match only the host
>
> wget's behavior with proposed patch:
> - "no_proxy=.mit.edu"
> - will match the domain and subdomains e.g. "www.mit.edu" or
> "www.subdomain.mit.edu"
> - will match the host "mit.edu"
> - "no_proxy=mit.edu"
> - will match the domain and subdomains e.g. "www.mit.edu" or
> "www.subdomain.mit.edu"
> - will match the host "mit.edu"
> - downside: can not match only the host; can not match only the domain and
> subdomains
> - it would be consistent with curl's behavior
>
> emacs's behavior:
> - "no_proxy=.mit.edu"
> - will match the domain and subdomains e.g. "www.mit.edu" or
> "www.subdomain.mit.edu"
> - will match the host "mit.edu"
> - "no_proxy=mit.edu"
> - will NOT match the domain and subdomains e.g. "www.mit.edu" or
> "www.subdomain.mit.edu"
> - will match the host "mit.edu"
> - downside: can not match only subdomains
>
> python httplib2's behavior:
> - "no_proxy=.mit.edu"
> - will match the domain and subdomains e.g. "www.mit.edu" or
> "www.subdomain.mit.edu"
> - will match the host "mit.edu"
> - "no_proxy=mit.edu"
> - will NOT match the domain and subdomains e.g. "www.mit.edu" or
> "www.subdomain.mit.edu"
> - will match the host "mit.edu"
> - downside: can not match only subdomains
>
> To sum it up. Each approach has some downsides. Given the change that I
> provided, wget's behavior would be consistent with curl's behavior. However
> it will have more downsides that it currently has, specifically it will loose
> the ability to not to match the host, but only domain and subdomains. Emacs's
> behavior is similar to Python's httplib2 behavior regarding the leftmost dot.
>
> Honestly I have a soft preference for keeping the current wget's behavior.
> But I admit that making the behavior consistent with curl's behavior makes
> sense. Please let me know how you would like to proceed.
>
> To make the behavior consistent with curl, the previously attached changes
> should be OK. If you find those new conditions too complicated, I can try to
> rethink it, but I already tried to make it as little complicated as possible
> and at the same time trying to not completely rewrite the function.
>
> If you'll decide to keep the current behavior, I'll modify the test that I
> added to cope with the behavior.
Great work, Tomas !
Wow, didn't think it's so messed up :-(
We should definitely document your results, e.g. in the wget manual.
If we keep the current behavior, we could adjust it with a new option or
a new env variable 'WGET_NO_PROXY_MODE'. Which could take well-defined
values like 'curl', 'emacs', 'wget' (the default), and maybe a new one
('strict') with none of the detected downsides.
Looks a bit over-engineered, but it means that wget can easily adopt to
existing environments. And the code seems pretty straight forward.
Let's see if some more opinions come in.
Regards, Tim