bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] How do I tell wget not to follow links in a file?


From: David Skalinder
Subject: Re: [Bug-wget] How do I tell wget not to follow links in a file?
Date: Mon, 11 Apr 2011 03:36:02 +0100
User-agent: SquirrelMail/1.4.21

Okay, I have filed bug #33044 for this issue at
https://savannah.gnu.org/bugs/index.php?33044.  I've also moved the demo
to http://davidskalinder.com/wgettest/ and added a bunch of directories to
the unwanted link page to make the problem clearer.

It strikes me that this issue must come up fairly frequently, especially
for sites with fairly flat directory hierarchies.  For example, any site
which keeps a "recent updates" page that includes a link to a "previous
updates" page, both of which contain links to many root-level directories,
would be affected.  A user who wanted to maintain an up-to-date mirror of
such a site would have no option but to download the entire site every
week.

HTH

DS


>> On 04/07/2011 05:26 AM, Giuseppe Scrivano wrote:
>>> "David Skalinder" <address@hidden> writes:
>>>
>>>>> I want to mirror part of a website that contains two links pages,
>>>>> each
>>>>> of
>>>>> which contains links to many root-level directories and also to the
>>>>> other
>>>>> links page.  I want to download recursively all the links from one
>>>>> links
>>>>> page, but not from the other: that is, I want to tell wget "download
>>>>> links1 and follow all of its links, but do not download or follow
>>>>> links
>>>>> from links2".
>>>>>
>>>>> I've put a demo of this problem up at http://fangjaw.com/wgettest --
>>>>> there
>>>>> is a diagram there that might state the problem more clearly.
>>>>>
>>>>> This functionality seems so basic that I assume I must be overlooking
>>>>> something.  Clearly wget has been designed to give users control over
>>>>> which files they download; but all I can find is that -X controls
>>>>> both
>>>>> saving and link-following at the directory level, while -R controls
>>>>> saving
>>>>> at the file level but still follows links from unsaved files.
>>>
>>> why doesn't -X work in the scenario you have described?  If all links
>>> from `links2' are under /B, you can exclude them using something like:
>>
>> That scenario seems rather unlikely, unless we're talking about
>> autogenerated folder index files...
>>
>> This issue would be resolved if wget had a way to avoid its current
>> behavior of always unconditionally downloading HTML files regardless of
>> what rejection rules say. Then you can just reject that single file (and
>> if need be, download it as part of a separate session.
>>
>> --
>> Micah J. Cowan
>> http://micah.cowan.name/
>>
>
> I think that's right.  As I mention on the demo page, links2 could easily
> contain links to hundreds of different directories, in which case you're
> out of luck.
>
> As Micah notes, if -R did not download the files at all (or even just
> downloaded them but did not queue their links), that should fix the
> problem.  Also, if a user could alter the robots.txt file, I think she
> could make wget act correctly by including something like
>
> User-agent: *
> Disallow: wgettest/links2.html
>
> But obviously, most wget users won't have access to the server side.
> Since (I assume) wget knows how to follow that robots instruction, it
> seems like it should be able to follow a similar instruction from the
> client side.
>
> David
>
>
>





reply via email to

[Prev in Thread] Current Thread [Next in Thread]