bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Bug-wget] How to ignore link like "index.html?lang=ja"?


From: Tony Lewis
Subject: RE: [Bug-wget] How to ignore link like "index.html?lang=ja"?
Date: Mon, 7 Jun 2010 08:41:10 -0700

Micah Cowan wrote:

> Yeah, that was the original thinking. But I still hate it. For one
> thing, there are no longer any guarantees that recurse-able HTML files
> end in ".html"

There are a bunch of suffixes that are actively used for HTML plus there is
no reason that one has to include a suffix at all. Furthermore, the
existence of a .html suffix is no guarantee that the file really contains
HTML.

> It's better to let you explicitly specifiy what files to download

I think an option that says "spider the site and save any PDF files that you
find" is useful. It's a matter of figuring out a meaningful way to implement
"spider the site" for this scenario.

I wonder if it would make more sense to look at the Content-Type header and
only parse "text/html" files. By using HEAD, you can quickly ignore files
that don't need to be parsed.

Tony





reply via email to

[Prev in Thread] Current Thread [Next in Thread]