bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] text/html


From: Tim Ruehsen
Subject: Re: [Bug-wget] text/html
Date: Wed, 3 Jul 2013 10:40:10 +0200
User-agent: KMail/1.13.7 (Linux/3.9-1-amd64; KDE/4.8.4; x86_64; ; )

Am Wednesday 03 July 2013 schrieb address@hidden:
> Is there means of saying that I want _only_ pages of MIME-type "text/html",
> whatever the extension?

I guess you are talking about recursive retrieving.

You could use a two-pass method using some scripting.

1. create a list with URLs and Content-Type information
2. retrieving only the wanted URLs

Step 1 is something (this is really naive !) like this
wget -d --spider -r www.example.com 2>&1|egrep -i '^Dequeuing|^Content-Type:'|
grep -A1 ^Deq|cut -d' ' -f2|grep -B1 '^text/html'|grep -v ^text/html 
>my_urls.txt

Step 2 would be wget -i my_urls.txt

I guess, a little awk or perl script would be more elegant.

Regards, Tim



reply via email to

[Prev in Thread] Current Thread [Next in Thread]