[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Wget follows "button" links
From: |
Tim Rühsen |
Subject: |
Re: [Bug-wget] Wget follows "button" links |
Date: |
Tue, 5 Jun 2018 15:40:21 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 |
Hi,
in this case you could try it with -X / --exclude-directories.
E.g. wget -X /delete,/remove
That wouldn't help with "xpage=watch..." though.
And I can't tell you if and how good -X works with wget 1.12.
Why (or since when) doesn't --no-clobber plus --convert-links work any
more ?
Please feel free to open a bug report at
https://savannah.gnu.org/bugs/?func=additem&group=wget with a detailed
description, please.
Cause it works for me :-)
Regards, Tim
On 06/05/2018 03:11 PM, CryHard wrote:
> Hey Tim,
>
> Thanks for the info. The wiki software we use (xwiki) appends something to
> wiki pages URLs to express a certain behavior. For example, to "watch" a
> page, the button once pressed redirects you to
> "www.wiki.com/WIKI-PAGE-NAME?xpage=watch&do=adddocument"
>
> Where the only thing that changes is the "WIKI-PAGE-NAME" part.
>
> Also, for actions such as like "deleting" or "reverting" a wiki page, the URL
> changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these are
> usually in the middle, before the actual page name. For example:
> www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is in
> the middle of the actual wiki page URL.
>
> What I would need to do is exclude from wget visiting any www.wiki.com/delete
> or www.wiki.com/remove/ pages. I'd also need to exclude links that end with
> "xpage=watch&do=adddocument" which triggers me to watch that page.
>
> I am using v1.12 because the most recent versions have disabled --no-clobber
> and --convert-links from working together. I need --no-clobber because if the
> download stops, I need to be able to resume without re-downloading all the
> files. And I need --convert-links because this needs to work as a local copy.
>
> From my understanding the options you mention have been added after v1.12. Is
> there any way to achieve this?
>
> BTW, -N (timestamps) doesn't work, as the server on which the wiki is hosted
> doesn't seem to support this, hence wget keeps redownloading the same files.
>
> Thanks a lot!
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>
> On June 5, 2018 1:57 PM, Tim Rühsen <address@hidden> wrote:
>
>> On 06/05/2018 11:53 AM, CryHard wrote:
>>
>>> Hey there,
>>>
>>> I've used the following:
>>>
>>> wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)
>>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"
>>> --user=myuser --ask-password --no-check-certificate --recursive
>>> --page-requisites --adjust-extension --span-hosts
>>> --restrict-file-names=windows --domains wiki.com --no-parent wiki.com
>>> --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
>>>
>>> To download a wiki. The problem is that this will follow "button" links,
>>> e.g the links that allow a user to put a page on a watchlist for further
>>> modifications. This has led to me watching hundreds of pages. Not only
>>> that, but apparently it also follows the links that lead to reverting
>>> changes made by others on a page.
>>>
>>> Is there a way to avoid this behavior?
>>
>> Hi,
>>
>> that depends on how these "button links" are realized.
>>
>> A button may be part of a HTML FORM tag/structure where the URL is the
>>
>> value of the 'action' attribute. Wget doesn't download such URLs because
>>
>> of the problem you describe.
>>
>> A dynamic web page can realize "button links" by using simple links.
>>
>> Wget doesn't know about hidden semantics and so downloads these URLs -
>>
>> and maybe they trigger some changes in a database.
>>
>> If this is your issue, you have to look into the HTML files and exclude
>>
>> those URLs from being downloaded. Or you create a whitelist. Look at
>>
>> options -A/-R and --accept-regex and --reject-regex.
>>
>>> I'm using the following version:
>>>
>>>> wget --version
>>>>
>>>> GNU Wget 1.12 built on linux-gnu.
>>
>> Ok, you should update wget if possible. Latest version is 1.19.5.
>>
>> Regards, Tim
>
>
signature.asc
Description: OpenPGP digital signature