[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Wget follows "button" links
From: |
CryHard |
Subject: |
Re: [Bug-wget] Wget follows "button" links |
Date: |
Tue, 05 Jun 2018 09:11:24 -0400 |
Hey Tim,
Thanks for the info. The wiki software we use (xwiki) appends something to wiki
pages URLs to express a certain behavior. For example, to "watch" a page, the
button once pressed redirects you to
"www.wiki.com/WIKI-PAGE-NAME?xpage=watch&do=adddocument"
Where the only thing that changes is the "WIKI-PAGE-NAME" part.
Also, for actions such as like "deleting" or "reverting" a wiki page, the URL
changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these are
usually in the middle, before the actual page name. For example:
www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is in
the middle of the actual wiki page URL.
What I would need to do is exclude from wget visiting any www.wiki.com/delete
or www.wiki.com/remove/ pages. I'd also need to exclude links that end with
"xpage=watch&do=adddocument" which triggers me to watch that page.
I am using v1.12 because the most recent versions have disabled --no-clobber
and --convert-links from working together. I need --no-clobber because if the
download stops, I need to be able to resume without re-downloading all the
files. And I need --convert-links because this needs to work as a local copy.
>From my understanding the options you mention have been added after v1.12. Is
>there any way to achieve this?
BTW, -N (timestamps) doesn't work, as the server on which the wiki is hosted
doesn't seem to support this, hence wget keeps redownloading the same files.
Thanks a lot!
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On June 5, 2018 1:57 PM, Tim Rühsen <address@hidden> wrote:
> On 06/05/2018 11:53 AM, CryHard wrote:
>
> > Hey there,
> >
> > I've used the following:
> >
> > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)
> > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"
> > --user=myuser --ask-password --no-check-certificate --recursive
> > --page-requisites --adjust-extension --span-hosts
> > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com
> > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
> >
> > To download a wiki. The problem is that this will follow "button" links,
> > e.g the links that allow a user to put a page on a watchlist for further
> > modifications. This has led to me watching hundreds of pages. Not only
> > that, but apparently it also follows the links that lead to reverting
> > changes made by others on a page.
> >
> > Is there a way to avoid this behavior?
>
> Hi,
>
> that depends on how these "button links" are realized.
>
> A button may be part of a HTML FORM tag/structure where the URL is the
>
> value of the 'action' attribute. Wget doesn't download such URLs because
>
> of the problem you describe.
>
> A dynamic web page can realize "button links" by using simple links.
>
> Wget doesn't know about hidden semantics and so downloads these URLs -
>
> and maybe they trigger some changes in a database.
>
> If this is your issue, you have to look into the HTML files and exclude
>
> those URLs from being downloaded. Or you create a whitelist. Look at
>
> options -A/-R and --accept-regex and --reject-regex.
>
> > I'm using the following version:
> >
> > > wget --version
> > >
> > > GNU Wget 1.12 built on linux-gnu.
>
> Ok, you should update wget if possible. Latest version is 1.19.5.
>
> Regards, Tim