bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Wget follows "button" links


From: CryHard
Subject: Re: [Bug-wget] Wget follows "button" links
Date: Tue, 05 Jun 2018 09:52:24 -0400

Hey Tim,

Please see http://savannah.gnu.org/bugs/?31781 where it implemented. Since 
version 1.12.1.

On my personal mac I have 1.19.5, and when I run the command with both 
arguments i get: 

"Both --no-clobber and --convert-links were specified, only --convert-links 
will be used."

As a response. 

Anyway, I might make due without -nc if I can use the regex argument. Could you 
give an example on how would that argument work in my case? Can I just use 
www.mywiki.com/delete/* as an argument for example? or .*/xpage=watch.* ?

Thanks!


​Sent with ProtonMail Secure Email.​

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On June 5, 2018 2:40 PM, Tim Rühsen <address@hidden> wrote:

> Hi,
> 
> in this case you could try it with -X / --exclude-directories.
> 
> E.g. wget -X /delete,/remove
> 
> That wouldn't help with "xpage=watch..." though.
> 
> And I can't tell you if and how good -X works with wget 1.12.
> 
> Why (or since when) doesn't --no-clobber plus --convert-links work any
> 
> more ?
> 
> Please feel free to open a bug report at
> 
> https://savannah.gnu.org/bugs/?func=additem&group=wget with a detailed
> 
> description, please.
> 
> Cause it works for me :-)
> 
> Regards, Tim
> 
> On 06/05/2018 03:11 PM, CryHard wrote:
> 
> > Hey Tim,
> > 
> > Thanks for the info. The wiki software we use (xwiki) appends something to 
> > wiki pages URLs to express a certain behavior. For example, to "watch" a 
> > page, the button once pressed redirects you to 
> > "www.wiki.com/WIKI-PAGE-NAME?xpage=watch&do=adddocument"
> > 
> > Where the only thing that changes is the "WIKI-PAGE-NAME" part.
> > 
> > Also, for actions such as like "deleting" or "reverting" a wiki page, the 
> > URL changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these 
> > are usually in the middle, before the actual page name. For example: 
> > www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is 
> > in the middle of the actual wiki page URL.
> > 
> > What I would need to do is exclude from wget visiting any 
> > www.wiki.com/delete or www.wiki.com/remove/ pages. I'd also need to exclude 
> > links that end with "xpage=watch&do=adddocument" which triggers me to watch 
> > that page.
> > 
> > I am using v1.12 because the most recent versions have disabled 
> > --no-clobber and --convert-links from working together. I need --no-clobber 
> > because if the download stops, I need to be able to resume without 
> > re-downloading all the files. And I need --convert-links because this needs 
> > to work as a local copy.
> > 
> > From my understanding the options you mention have been added after v1.12. 
> > Is there any way to achieve this?
> > 
> > BTW, -N (timestamps) doesn't work, as the server on which the wiki is 
> > hosted doesn't seem to support this, hence wget keeps redownloading the 
> > same files.
> > 
> > Thanks a lot!
> > 
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > 
> > On June 5, 2018 1:57 PM, Tim Rühsen address@hidden wrote:
> > 
> > > On 06/05/2018 11:53 AM, CryHard wrote:
> > > 
> > > > Hey there,
> > > > 
> > > > I've used the following:
> > > > 
> > > > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) 
> > > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 
> > > > Safari/537.36" --user=myuser --ask-password --no-check-certificate 
> > > > --recursive --page-requisites --adjust-extension --span-hosts 
> > > > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com 
> > > > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
> > > > 
> > > > To download a wiki. The problem is that this will follow "button" 
> > > > links, e.g the links that allow a user to put a page on a watchlist for 
> > > > further modifications. This has led to me watching hundreds of pages. 
> > > > Not only that, but apparently it also follows the links that lead to 
> > > > reverting changes made by others on a page.
> > > > 
> > > > Is there a way to avoid this behavior?
> > > 
> > > Hi,
> > > 
> > > that depends on how these "button links" are realized.
> > > 
> > > A button may be part of a HTML FORM tag/structure where the URL is the
> > > 
> > > value of the 'action' attribute. Wget doesn't download such URLs because
> > > 
> > > of the problem you describe.
> > > 
> > > A dynamic web page can realize "button links" by using simple links.
> > > 
> > > Wget doesn't know about hidden semantics and so downloads these URLs -
> > > 
> > > and maybe they trigger some changes in a database.
> > > 
> > > If this is your issue, you have to look into the HTML files and exclude
> > > 
> > > those URLs from being downloaded. Or you create a whitelist. Look at
> > > 
> > > options -A/-R and --accept-regex and --reject-regex.
> > > 
> > > > I'm using the following version:
> > > > 
> > > > > wget --version
> > > > > 
> > > > > GNU Wget 1.12 built on linux-gnu.
> > > 
> > > Ok, you should update wget if possible. Latest version is 1.19.5.
> > > 
> > > Regards, Tim





reply via email to

[Prev in Thread] Current Thread [Next in Thread]