[Bug-wget] Filter question: Downloading only L2 and deeper?

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Filter question: Downloading only L2 and deeper?

From:	wesley
Subject:	[Bug-wget] Filter question: Downloading only L2 and deeper?
Date:	Fri, 16 Dec 2011 03:46:47 +0000

I'm trying to figure out if there is any way to setup directoryinclude/excludes or filters to recursively download files that are atlevel 2 or deeper from the base url but drop any nonhtml L1 links.

In other words, if I pass wget http://example.com/stuff/ I only want todownload files where --include-directories=/stuff/*/* holds true.

The problem I run into when using --include-directories=/stuff/*/* isthat when wget fetches the index at example.com/stuff/ it dequeues itand thus never recurses into the subdirectories I'm interested in. Thesecond issue is that the index pages at '/' boundries are allauto/dynamically generated. There is no "index.html" or file-extension Ican add as a filter rule (unless there is a syntax for doing so I'm notaware of).

And while I'm on the topic, to be clear, --accept="/stuff/*.html" is nota valid syntax correct? Accept filters don't accept path components frommy understanding, they only operate on the filename.

What I'm trying to accomplish could easily be solved if there was a wayto combine path + filename filters into atomic groupings (or with fullurl regex parsing :). In the meantime however, if there is any hackishway to accomplish what I'm trying to do I would appreciate any pointersin the right direction. This basically came about because I already dida very large crawl at L1 and would now like to continue the crawl fromL2 links and deeper. I don't want to wait on tens of thousands of headrequests for files I already know are up to date just to be able to getto the L2+ links.


Thanks :)

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] Filter question: Downloading only L2 and deeper?, wesley <=

Prev by Date: Re: [Bug-wget] empty VERSION in 1.13.4
Next by Date: [Bug-wget] Wish: --no-parent with new option --allow-directories
Previous by thread: [Bug-wget] empty VERSION in 1.13.4
Next by thread: [Bug-wget] Wish: --no-parent with new option --allow-directories
Index(es):
- Date
- Thread