bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Implementation suggestion for JavaScript execution


From: Darshit Shah
Subject: Re: [Bug-wget] Implementation suggestion for JavaScript execution
Date: Tue, 27 May 2014 10:41:38 +0530

On Mon, May 26, 2014 at 6:50 PM, Giuseppe Scrivano <address@hidden> wrote:
> Andrew Pennebaker <address@hidden> writes:
>
>> Tumblr and other websites delay loading some of their content (images)
>> through JavaScript events like *onload*. It would be nice if wget supported
>> a *-j* flag for executing this, in order to access these dynamically loaded
>> resources. Execution may add some time to downloads, but for users that
>> really want the content, having the option is better than not.
>>
>> Possible solutions:
>>
>> The HtmlUnit <http://htmlunit.sourceforge.net/> library can already do
>> this, but it's written in Java and I believe wget is written in C?
>
> correct, wget is written in C.
>
>
>> Another consideration for attaching JS execution to wget is
>> Node<http://nodejs.org/>, a
>> C++ implementation, though we probably only want the core, the
>> V8<https://code.google.com/p/v8/>JavaScript engine itself.
>>
>> Other possibilities include
>> SpiderMonkey<http://en.wikipedia.org/wiki/SpiderMonkey_(JavaScript_engine)>,
>> the JS engine for Firefox, or
>> JavaScriptCore<http://www.webkit.org/projects/javascript/>,
>> Safari's JS engine.
>
> how would you programmatically retrieve these links?  Triggering
> "onload" or other events?  I wonder how many of these occurrences we can
> cover by simply trying to parse cases like document.location='foo'
> without involving any JS engine.
>
I think the only way, *if* we do want to implement this, is to package
a complete javascript engine with Wget. Which in any opinion would be
overkill. The problem with languages like javascript is twofold:
1. Various links aren't always encoded as simply as
document.location='foo'. They could be very obfuscated which makes it
very difficult for Wget to parse them without a full blown engine to
do it.
2. Dynamic webpages using javascript could vary from session to
session. Which means, Wget doesn't know which codepath to follow when
downloading a webpage.

I am highly skeptical about bundling a full javascript engine with
Wget simply for allowing users to download interactive webpages that
use Javascript. However, as javascript becomes more and more
ubiquitous, we should at some point consider this as an option. Maybe,
we could have something similar to our CSS parsing code that tries to
identify various links to other webpages in javascript. It would be a
difficult task, but seems like the only option we have.



-- 
Thanking You,
Darshit Shah



reply via email to

[Prev in Thread] Current Thread [Next in Thread]