bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] fetching directory listing in spider mode


From: Marjorie
Subject: [Bug-wget] fetching directory listing in spider mode
Date: Wed, 21 Aug 2013 23:14:15 +0200
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8

Hello everyone,

I am currently working on a little tool to produce sitemaps. I have been using wget with the --spider option which does the job *almost* perfectly: it performs HEAD requests first, then GET requests only when the content-type is HTML or similar.

I currently have a test site (say test.com) set up like this:

/index.html (default page)
/robots.txt
/images
    /images/image1.jpg
    /images/hidden-image.jpg

So this command line:
    wget  -r --spider http://test.com
produces the following result:

"HEAD / HTTP/1.0" 200 - "-" "Wget/1.12 (linux-gnu)"
"GET / HTTP/1.0" 200 402 "-" "Wget/1.12 (linux-gnu)"
"GET /robots.txt HTTP/1.0" 200 38 "-" "Wget/1.12 (linux-gnu)"
"HEAD /images/image1.jpg HTTP/1.0" 200 - "http://test.com/"; "Wget/1.12 (linux-gnu)"

Wget has parsed the default (index.html page) and found the file image1.jpg.
However I would like wget to also recursively read the directory http://test.com/images that is browseable, so it will also discover hidden-image.jpg...

But with this command line:
    wget  -r --spider http://test.com/images
it will list all the files contained in that images folder

So my question is this: is there a way to force wget to try browsing *every* directory found during the crawl, starting from the root URL (http://test.com) ?

The aim is of course to discover as many files as possible, including those not linked from any page.

Thanks a lot for the insight.

Marj



reply via email to

[Prev in Thread] Current Thread [Next in Thread]