Re: [Bug-wget] text/html

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] text/html

From:	Tim Ruehsen
Subject:	Re: [Bug-wget] text/html
Date:	Wed, 3 Jul 2013 10:40:10 +0200
User-agent:	KMail/1.13.7 (Linux/3.9-1-amd64; KDE/4.8.4; x86_64; ; )

Am Wednesday 03 July 2013 schrieb address@hidden:
> Is there means of saying that I want _only_ pages of MIME-type "text/html",
> whatever the extension?

I guess you are talking about recursive retrieving.

You could use a two-pass method using some scripting.

1. create a list with URLs and Content-Type information
2. retrieving only the wanted URLs

Step 1 is something (this is really naive !) like this
wget -d --spider -r www.example.com 2>&1|egrep -i '^Dequeuing|^Content-Type:'|
grep -A1 ^Deq|cut -d' ' -f2|grep -B1 '^text/html'|grep -v ^text/html 
>my_urls.txt

Step 2 would be wget -i my_urls.txt

I guess, a little awk or perl script would be more elegant.

Regards, Tim

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] text/html, hsv, 2013/07/03
- Re: [Bug-wget] text/html, Tim Ruehsen <=

Prev by Date: [Bug-wget] text/html
Next by Date: Re: [Bug-wget] merge parallel-wget into master
Previous by thread: [Bug-wget] text/html
Next by thread: Re: [Bug-wget] Welcome to the "Bug-wget" mailing list
Index(es):
- Date
- Thread