bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?


From: Micah Cowan
Subject: Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?
Date: Mon, 07 Jun 2010 12:12:01 -0700
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100423 Thunderbird/3.0.4

On 06/07/2010 08:41 AM, Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> Yeah, that was the original thinking. But I still hate it. For one
>> thing, there are no longer any guarantees that recurse-able HTML files
>> end in ".html"
> 
> There are a bunch of suffixes that are actively used for HTML plus there is
> no reason that one has to include a suffix at all. Furthermore, the
> existence of a .html suffix is no guarantee that the file really contains
> HTML.

Exactly.

>> It's better to let you explicitly specifiy what files to download
> 
> I think an option that says "spider the site and save any PDF files that you
> find" is useful. It's a matter of figuring out a meaningful way to implement
> "spider the site" for this scenario.

Of course it's useful. It just shouldn't be the only possible mode of
operating. That's exactly why I said we should split off the
accept/reject and "download this, but only to parse it" bits, because
right now the "download/parse" part is hardwired to always happen for
".htm/.html" files, and only for those files, which is nearing
uselessness, for exactly the reasons you state in the first quote-block
above.

> I wonder if it would make more sense to look at the Content-Type header and
> only parse "text/html" files. By using HEAD, you can quickly ignore files
> that don't need to be parsed.

For some value of "quickly". This obviously necessitates extra
round-trips to the server. Can still be useful, but still perhaps not as
useful as doing URL-matching properly. In particular, it would work best
when _combined_ with proper URL-matching, so that you could dictate
which files shouldn't even be bothered with a HEAD (why bother to see if
a *.pdf file has content-type text/html?).

It's made even less useful by the fact that so many servers botch HEAD
completely. Providing errors on HEAD is one problem, but the bigger
problem is servers that provide _erroneous_ responses to HEAD requests.
But there are enough servers that get it right to make this a worthwhile
feature, so long as we document the fact that it takes extra round-trips
(too bad there's no If-Content-Type header in HTTP/1.1 :) ).

-- 
Micah J. Cowan
http://micah.cowan.name/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]