Re: [Bug-wget] just download HTML content

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] just download HTML content

From:	Micah Cowan
Subject:	Re: [Bug-wget] just download HTML content
Date:	Sun, 28 Jun 2009 15:08:31 -0700
User-agent:	Thunderbird 2.0.0.22 (X11/20090608)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Richard Baron Penman wrote:
> hello,
> 
> When mirroring a website how do I just download HTML content (whether
> static, PHP, ASP, etc) and ignore images, css, js, and everything else?
> At first I thought of creating an accept list, but I can't rely on the file
> extension because many HTML pages do not include an extension (eg
> http://en.wikipedia.org/wiki/Foo)
> Then I thought of a reject list, but there are so many different kinds of
> non-HTML content.
> 
> Is there a way to do this with wget?

Not really... at some point we'd like to supply content-type-based
accept/reject options, but this will also tend to increase the amount of
traffic, as we'd have to send extra requests to determine the content
type. Perhaps a robust version of it would use a mixture of heuristic
(e.g., when a filename extension exists, make assumptions about the
content-type)...

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpH6d8ACgkQ7M8hyUobTrF+xwCeOAlZEyfV2ranXEYJRIYTlHnn
pBwAn3B4BURi0sUCW/gpdMrR5JMcgmv6
=lnUH
-----END PGP SIGNATURE-----

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] just download HTML content, Richard Baron Penman, 2009/06/28
- Re: [Bug-wget] just download HTML content, Micah Cowan <=
  - Re: [Bug-wget] just download HTML content, Richard Baron Penman, 2009/06/28
    - Re: [Bug-wget] just download HTML content, Micah Cowan, 2009/06/28

Prev by Date: [Bug-wget] just download HTML content
Next by Date: Re: [Bug-wget] just download HTML content
Previous by thread: [Bug-wget] just download HTML content
Next by thread: Re: [Bug-wget] just download HTML content
Index(es):
- Date
- Thread