[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Async webcrawling

From: Darshit Shah
Subject: Re: [Bug-wget] Async webcrawling
Date: Wed, 1 Aug 2018 12:28:37 +0200
User-agent: NeoMutt/20180716

Hi James,

Wget2 is built on top of the libwget library which uses Asynchronous network
calls. However, Wget2 is written such that it only utilizes one connection per
thread. This is essentially a design decision to simplify the codebase. In case
you want a more complex crawler, you can use libwget to write your own as Tim
suggested in his email.

Instead of this kind of async behaviour, we rely on HTTP/2 multiplexed streams
which allow you to send multiple requests over the same connection in parallel.
So, when crawling any website using HTTP/2, Wget2 can get the benefits of async
access without requiring all those code paths.

* James Read <address@hidden> [180731 20:28]:
> Thanks,
> as I understand it though there is only so much you can do with threading.
> For more scalable solutions you need to go with async programming
> techniques. See http://www.kegel.com/c10k.html for a summary of the
> problem. I want to do large scale webcrawling and am not sure if wget2 is
> up to the job.
> On Tue, Jul 31, 2018 at 6:22 PM, Tim Rühsen <address@hidden> wrote:
> > On 31.07.2018 18:39, James Read wrote:
> > > Hi,
> > >
> > > how much work would it take to convert wget into a fully fledged
> > > asynchronous webcrawler?
> > >
> > > I was thinking something like using select. Ideally, I want to be able to
> > > supply wget with a list of starting point URLs and then for wget to crawl
> > > the web from those starting points in an asynchronous fashion.
> > >
> > > James
> > >
> >
> > Just use wget2. It is already packaged in Debian sid.
> > To build from git source, see https://gitlab.com/gnuwget/wget2.
> >
> > To build from tarball (much easier), download from
> > https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.
> >
> > Regards, Tim
> >
> >

Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6

Attachment: signature.asc
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]