Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?

From:	Micah Cowan
Subject:	Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?
Date:	Mon, 07 Jun 2010 12:12:01 -0700
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100423 Thunderbird/3.0.4

On 06/07/2010 08:41 AM, Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> Yeah, that was the original thinking. But I still hate it. For one
>> thing, there are no longer any guarantees that recurse-able HTML files
>> end in ".html"
> 
> There are a bunch of suffixes that are actively used for HTML plus there is
> no reason that one has to include a suffix at all. Furthermore, the
> existence of a .html suffix is no guarantee that the file really contains
> HTML.

Exactly.

>> It's better to let you explicitly specifiy what files to download
> 
> I think an option that says "spider the site and save any PDF files that you
> find" is useful. It's a matter of figuring out a meaningful way to implement
> "spider the site" for this scenario.

Of course it's useful. It just shouldn't be the only possible mode of
operating. That's exactly why I said we should split off the
accept/reject and "download this, but only to parse it" bits, because
right now the "download/parse" part is hardwired to always happen for
".htm/.html" files, and only for those files, which is nearing
uselessness, for exactly the reasons you state in the first quote-block
above.

> I wonder if it would make more sense to look at the Content-Type header and
> only parse "text/html" files. By using HEAD, you can quickly ignore files
> that don't need to be parsed.

For some value of "quickly". This obviously necessitates extra
round-trips to the server. Can still be useful, but still perhaps not as
useful as doing URL-matching properly. In particular, it would work best
when _combined_ with proper URL-matching, so that you could dictate
which files shouldn't even be bothered with a HEAD (why bother to see if
a *.pdf file has content-type text/html?).

It's made even less useful by the fact that so many servers botch HEAD
completely. Providing errors on HEAD is one problem, but the bigger
problem is servers that provide _erroneous_ responses to HEAD requests.
But there are enough servers that get it right to make this a worthwhile
feature, so long as we document the fact that it takes extra round-trips
(too bad there's no If-Content-Type header in HTTP/1.1 :) ).

-- 
Micah J. Cowan
http://micah.cowan.name/

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Peng Yu, 2010/06/01
- Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Micah Cowan, 2010/06/01
- Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Keisial, 2010/06/03
  - Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Micah Cowan, 2010/06/03
    - Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Keisial, 2010/06/03
    - Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Guillaume Turri, 2010/06/03
    - RE: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Tony Lewis, 2010/06/03
    - Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Guillaume Turri, 2010/06/06
    - Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Micah Cowan, 2010/06/06
    - RE: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Tony Lewis, 2010/06/07
    - Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Micah Cowan <=
    - RE: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Tony Lewis, 2010/06/07
    - Re: [Bug-wget] How to ignore link like "index.html?lang=ja"?, Micah Cowan, 2010/06/07

Prev by Date: RE: [Bug-wget] compiling on osf fails just due to space after -I
Next by Date: RE: [Bug-wget] How to ignore link like "index.html?lang=ja"?
Previous by thread: RE: [Bug-wget] How to ignore link like "index.html?lang=ja"?
Next by thread: RE: [Bug-wget] How to ignore link like "index.html?lang=ja"?
Index(es):
- Date
- Thread