[Bug-wget] fetching directory listing in spider mode

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] fetching directory listing in spider mode

From:	Marjorie
Subject:	[Bug-wget] fetching directory listing in spider mode
Date:	Wed, 21 Aug 2013 23:14:15 +0200
User-agent:	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8

Hello everyone,

I am currently working on a little tool to produce sitemaps. I have beenusing wget with the --spider option which does the job *almost* perfectly:it performs HEAD requests first, then GET requests only when thecontent-type is HTML or similar.


I currently have a test site (say test.com) set up like this:

/index.html (default page)
/robots.txt
/images
    /images/image1.jpg
    /images/hidden-image.jpg

So this command line:
    wget  -r --spider http://test.com
produces the following result:

"HEAD / HTTP/1.0" 200 - "-" "Wget/1.12 (linux-gnu)"
"GET / HTTP/1.0" 200 402 "-" "Wget/1.12 (linux-gnu)"
"GET /robots.txt HTTP/1.0" 200 38 "-" "Wget/1.12 (linux-gnu)"

"HEAD /images/image1.jpg HTTP/1.0" 200 - "http://test.com/"; "Wget/1.12(linux-gnu)"


Wget has parsed the default (index.html page) and found the file image1.jpg.

However I would like wget to also recursively read the directoryhttp://test.com/images that is browseable, so it will also discoverhidden-image.jpg...


But with this command line:
    wget  -r --spider http://test.com/images
it will list all the files contained in that images folder

So my question is this: is there a way to force wget to try browsing*every* directory found during the crawl, starting from the root URL(http://test.com) ?

The aim is of course to discover as many files as possible, includingthose not linked from any page.


Thanks a lot for the insight.

Marj

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] fetching directory listing in spider mode, Marjorie <=

Prev by Date: Re: [Bug-wget] Wget and Perfect Forward Secrecy
Next by Date: [Bug-wget] [PATCH] new option --https-only
Previous by thread: Re: [Bug-wget] download videos from youtube
Next by thread: [Bug-wget] [PATCH] new option --https-only
Index(es):
- Date
- Thread