bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] just download HTML content


From: Micah Cowan
Subject: Re: [Bug-wget] just download HTML content
Date: Sun, 28 Jun 2009 15:08:31 -0700
User-agent: Thunderbird 2.0.0.22 (X11/20090608)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Richard Baron Penman wrote:
> hello,
> 
> When mirroring a website how do I just download HTML content (whether
> static, PHP, ASP, etc) and ignore images, css, js, and everything else?
> At first I thought of creating an accept list, but I can't rely on the file
> extension because many HTML pages do not include an extension (eg
> http://en.wikipedia.org/wiki/Foo)
> Then I thought of a reject list, but there are so many different kinds of
> non-HTML content.
> 
> Is there a way to do this with wget?

Not really... at some point we'd like to supply content-type-based
accept/reject options, but this will also tend to increase the amount of
traffic, as we'd have to send extra requests to determine the content
type. Perhaps a robust version of it would use a mixture of heuristic
(e.g., when a filename extension exists, make assumptions about the
content-type)...

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpH6d8ACgkQ7M8hyUobTrF+xwCeOAlZEyfV2ranXEYJRIYTlHnn
pBwAn3B4BURi0sUCW/gpdMrR5JMcgmv6
=lnUH
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]