[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Async webcrawling
From: |
Darshit Shah |
Subject: |
Re: [Bug-wget] Async webcrawling |
Date: |
Wed, 1 Aug 2018 12:28:37 +0200 |
User-agent: |
NeoMutt/20180716 |
Hi James,
Wget2 is built on top of the libwget library which uses Asynchronous network
calls. However, Wget2 is written such that it only utilizes one connection per
thread. This is essentially a design decision to simplify the codebase. In case
you want a more complex crawler, you can use libwget to write your own as Tim
suggested in his email.
Instead of this kind of async behaviour, we rely on HTTP/2 multiplexed streams
which allow you to send multiple requests over the same connection in parallel.
So, when crawling any website using HTTP/2, Wget2 can get the benefits of async
access without requiring all those code paths.
* James Read <address@hidden> [180731 20:28]:
> Thanks,
>
> as I understand it though there is only so much you can do with threading.
> For more scalable solutions you need to go with async programming
> techniques. See http://www.kegel.com/c10k.html for a summary of the
> problem. I want to do large scale webcrawling and am not sure if wget2 is
> up to the job.
>
> On Tue, Jul 31, 2018 at 6:22 PM, Tim Rühsen <address@hidden> wrote:
>
> > On 31.07.2018 18:39, James Read wrote:
> > > Hi,
> > >
> > > how much work would it take to convert wget into a fully fledged
> > > asynchronous webcrawler?
> > >
> > > I was thinking something like using select. Ideally, I want to be able to
> > > supply wget with a list of starting point URLs and then for wget to crawl
> > > the web from those starting points in an asynchronous fashion.
> > >
> > > James
> > >
> >
> > Just use wget2. It is already packaged in Debian sid.
> > To build from git source, see https://gitlab.com/gnuwget/wget2.
> >
> > To build from tarball (much easier), download from
> > https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.
> >
> > Regards, Tim
> >
> >
>
--
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
signature.asc
Description: PGP signature
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: [Bug-wget] Async webcrawling,
Darshit Shah <=