[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] text/html
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] text/html |
Date: |
Wed, 3 Jul 2013 10:40:10 +0200 |
User-agent: |
KMail/1.13.7 (Linux/3.9-1-amd64; KDE/4.8.4; x86_64; ; ) |
Am Wednesday 03 July 2013 schrieb address@hidden:
> Is there means of saying that I want _only_ pages of MIME-type "text/html",
> whatever the extension?
I guess you are talking about recursive retrieving.
You could use a two-pass method using some scripting.
1. create a list with URLs and Content-Type information
2. retrieving only the wanted URLs
Step 1 is something (this is really naive !) like this
wget -d --spider -r www.example.com 2>&1|egrep -i '^Dequeuing|^Content-Type:'|
grep -A1 ^Deq|cut -d' ' -f2|grep -B1 '^text/html'|grep -v ^text/html
>my_urls.txt
Step 2 would be wget -i my_urls.txt
I guess, a little awk or perl script would be more elegant.
Regards, Tim