[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] just download HTML content
From: |
Micah Cowan |
Subject: |
Re: [Bug-wget] just download HTML content |
Date: |
Sun, 28 Jun 2009 15:08:31 -0700 |
User-agent: |
Thunderbird 2.0.0.22 (X11/20090608) |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Richard Baron Penman wrote:
> hello,
>
> When mirroring a website how do I just download HTML content (whether
> static, PHP, ASP, etc) and ignore images, css, js, and everything else?
> At first I thought of creating an accept list, but I can't rely on the file
> extension because many HTML pages do not include an extension (eg
> http://en.wikipedia.org/wiki/Foo)
> Then I thought of a reject list, but there are so many different kinds of
> non-HTML content.
>
> Is there a way to do this with wget?
Not really... at some point we'd like to supply content-type-based
accept/reject options, but this will also tend to increase the amount of
traffic, as we'd have to send extra requests to determine the content
type. Perhaps a robust version of it would use a mixture of heuristic
(e.g., when a filename extension exists, make assumptions about the
content-type)...
- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkpH6d8ACgkQ7M8hyUobTrF+xwCeOAlZEyfV2ranXEYJRIYTlHnn
pBwAn3B4BURi0sUCW/gpdMrR5JMcgmv6
=lnUH
-----END PGP SIGNATURE-----