bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?


From: Micah Cowan
Subject: Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?
Date: Mon, 07 Jun 2010 16:04:27 -0700
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100423 Thunderbird/3.0.4

On 06/07/2010 01:27 PM, Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> For some value of "quickly". This obviously necessitates extra
>> round-trips to the server. Can still be useful, but still perhaps not as
>> useful as doing URL-matching properly.
> 
> I would prefer an extra round trip to avoid downloading a 2GB file that will
> immediately be ignored and deleted.

Such files very rarely end in ".html" AFAICT, but okay, sure. As I said,
it's useful, but the extra round trip would need to be clearly
documented, and it would be particularly effective when paired with
something that prevents even the first round-trip when it's unnecessary.

> While I would love to see proper URL matching in wget, I don't think that
> solves the problem for this use case. I think we to want to parse all
> text/html regardless of the URL.

I don't think the second sentence follows the first particularly well.
Just because one wants to control downloads based by content-type does
not imply that one doesn't also want to control by URL. I realize that
there are cases where one wants to trawl an entire website looking to
keep only specific types, and in that case content-type matching fits
the bill. But I've never personally been in that situation. The closest
I've been is situations where I want to trawl some _portion_ of an
entire website, in which case I want both content-type matching _and_
better URL matching, which is why I said it works best when _combined_
with URL-matching.

And again, this is _particularly_ the case because I rarely ever
encounter a site where you can effectively use content-types in such a
way that URL-matching could not have done the same thing better (without
extra round-trips). This is because, when sites don't advertise content
types via extensions, it's most often because they're hidden behind CGI
scripts or such (foo.php?filename=file-i-don't-want-to-download.wmv),
and those rarely ever respond correctly to HEAD requests. This type of
URL is a great example because it won't be solved by wget's current
leave-off-the-query-string matching behavior, nor by checking HEAD's
content-type, and the only way to avoid downloading it is by improved
URL matching. It _could_ be solved by terminating connection when we see
that the body of a GET request is of a "reject" content-type, but that
is not efficient behavior (especially when we're proxied or using NTLM),
and could contribute to loading the server unnecessarily if we're
repeatedly asking for things we don't intend to accept (though one might
argue they get what's coming to them for not supplying a proper HEAD
response :) ).

In addition, failing to provide proper URL matching means that Wget
behaves completely inappropriately when it comes to CRM-style sites,
wikis in particular. Wget currently has no means of distinguishing pages
with page.php?action=logout, page.php?action=delete, or
page.php?perform-some-cpu-intensive-transformation-to-pdf-or-whatnot,
which is a pretty major gap. Most of the major wikis supply robots rules
that prevent this, but not all, and those robots rules may contain
other, less-appropriate bans, because after all they're intended for
robots, not user agents.

-- 
Micah J. Cowan
http://micah.cowan.name/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]