[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] fetching directory listing in spider mode
From: |
Marjorie |
Subject: |
[Bug-wget] fetching directory listing in spider mode |
Date: |
Wed, 21 Aug 2013 23:14:15 +0200 |
User-agent: |
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 |
Hello everyone,
I am currently working on a little tool to produce sitemaps. I have been
using wget with the --spider option which does the job *almost* perfectly:
it performs HEAD requests first, then GET requests only when the
content-type is HTML or similar.
I currently have a test site (say test.com) set up like this:
/index.html (default page)
/robots.txt
/images
/images/image1.jpg
/images/hidden-image.jpg
So this command line:
wget -r --spider http://test.com
produces the following result:
"HEAD / HTTP/1.0" 200 - "-" "Wget/1.12 (linux-gnu)"
"GET / HTTP/1.0" 200 402 "-" "Wget/1.12 (linux-gnu)"
"GET /robots.txt HTTP/1.0" 200 38 "-" "Wget/1.12 (linux-gnu)"
"HEAD /images/image1.jpg HTTP/1.0" 200 - "http://test.com/" "Wget/1.12
(linux-gnu)"
Wget has parsed the default (index.html page) and found the file image1.jpg.
However I would like wget to also recursively read the directory
http://test.com/images that is browseable, so it will also discover
hidden-image.jpg...
But with this command line:
wget -r --spider http://test.com/images
it will list all the files contained in that images folder
So my question is this: is there a way to force wget to try browsing
*every* directory found during the crawl, starting from the root URL
(http://test.com) ?
The aim is of course to discover as many files as possible, including
those not linked from any page.
Thanks a lot for the insight.
Marj
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Bug-wget] fetching directory listing in spider mode,
Marjorie <=