bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Implementation suggestion for JavaScript execution


From: Tony Lewis
Subject: Re: [Bug-wget] Implementation suggestion for JavaScript execution
Date: Wed, 28 May 2014 16:13:50 -0700

Darshit Shah wrote:

> > how would you programmatically retrieve these links?  Triggering 
> > "onload" or other events?  I wonder how many of these occurrences we 
> > can cover by simply trying to parse cases like document.location='foo'
> > without involving any JS engine.
> >
> I think the only way, *if* we do want to implement this, is to package a
complete
> javascript engine with Wget. Which in any opinion would be overkill. The
problem
> with languages like javascript is twofold:
> 1. Various links aren't always encoded as simply as
document.location='foo'. They
> could be very obfuscated which makes it very difficult for Wget to parse
them
> without a full blown engine to do it.
> 2. Dynamic webpages using javascript could vary from session to session.
Which
> means, Wget doesn't know which codepath to follow when downloading a
webpage.

It seems to me that a full JavaScript engine is the only way to get this to
work in all cases.

I think the right solution is to identify a small set of variables
(document.location, etc.) and events (onload, onclick, etc.) for which wget
would execute the JavaScript. In the case of events, the task would be to
determine if the event routine downloads content. Then the question becomes:
What do you do with the content? Consider this script:

function onload()
{
  var img = new Image
  img.src = "/images/image.gif"
  document.getElementById("show").innerHTML = "<img src='"+img.src+"'>";
}

Assuming the function is the onload routine for <body> then the file
"/images/image.gif" is clearly a page requisite for the webpage and it would
make sense (to me at least) that the image file would be downloaded if -p
were specified.

However, what do you do if the onload event code uses an XMLHttpRequest to
download content with an onReadyStateChange event that replaces the
innerHTML of an element? Do you dynamically update the HTML that will be
stored on disk? Do you save the response in its own file?

What if the XMLHttpRequest is downloading JSON, which is then used by
JavaScript to dynamically update content on the webpage?

Yes, implementing document.location is a reasonable place to start, but once
JavaScript support has been added to wget, I expect the floodgates will
quickly open with bug reports of pages where wget didn't retrieve all the
requisite content.

Tony




reply via email to

[Prev in Thread] Current Thread [Next in Thread]