bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] How do I tell wget not to follow links in a file?


From: David Skalinder
Subject: Re: [Bug-wget] How do I tell wget not to follow links in a file?
Date: Thu, 7 Apr 2011 23:46:11 +0100
User-agent: SquirrelMail/1.4.21

> On 04/07/2011 05:26 AM, Giuseppe Scrivano wrote:
>> "David Skalinder" <address@hidden> writes:
>>
>>>> I want to mirror part of a website that contains two links pages, each
>>>> of
>>>> which contains links to many root-level directories and also to the
>>>> other
>>>> links page.  I want to download recursively all the links from one
>>>> links
>>>> page, but not from the other: that is, I want to tell wget "download
>>>> links1 and follow all of its links, but do not download or follow
>>>> links
>>>> from links2".
>>>>
>>>> I've put a demo of this problem up at http://fangjaw.com/wgettest --
>>>> there
>>>> is a diagram there that might state the problem more clearly.
>>>>
>>>> This functionality seems so basic that I assume I must be overlooking
>>>> something.  Clearly wget has been designed to give users control over
>>>> which files they download; but all I can find is that -X controls both
>>>> saving and link-following at the directory level, while -R controls
>>>> saving
>>>> at the file level but still follows links from unsaved files.
>>
>> why doesn't -X work in the scenario you have described?  If all links
>> from `links2' are under /B, you can exclude them using something like:
>
> That scenario seems rather unlikely, unless we're talking about
> autogenerated folder index files...
>
> This issue would be resolved if wget had a way to avoid its current
> behavior of always unconditionally downloading HTML files regardless of
> what rejection rules say. Then you can just reject that single file (and
> if need be, download it as part of a separate session.
>
> --
> Micah J. Cowan
> http://micah.cowan.name/
>

I think that's right.  As I mention on the demo page, links2 could easily
contain links to hundreds of different directories, in which case you're
out of luck.

As Micah notes, if -R did not download the files at all (or even just
downloaded them but did not queue their links), that should fix the
problem.  Also, if a user could alter the robots.txt file, I think she
could make wget act correctly by including something like

User-agent: *
Disallow: wgettest/links2.html

But obviously, most wget users won't have access to the server side. 
Since (I assume) wget knows how to follow that robots instruction, it
seems like it should be able to follow a similar instruction from the
client side.

David




reply via email to

[Prev in Thread] Current Thread [Next in Thread]