Re: [Bug-wget] Async webcrawling

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Async webcrawling

From:	Tim Rühsen
Subject:	Re: [Bug-wget] Async webcrawling
Date:	Tue, 31 Jul 2018 21:01:33 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

On 31.07.2018 20:17, James Read wrote:
> Thanks,
> 
> as I understand it though there is only so much you can do with
> threading. For more scalable solutions you need to go with async
> programming techniques. See http://www.kegel.com/c10k.html for a summary
> of the problem. I want to do large scale webcrawling and am not sure if
> wget2 is up to the job.

Well, you'll surprised how fast wget2 is. Especially with HTTP/2
spreading more and more, you can easily fill larger bandwidths with even
a few threads. Of course it also heavily depends on the server's
capabilities and ping/RTT values you have.

Since you can control host spanning, you could also split your workload
onto several processes (or even hosts).

Are you going to crawl complete web sites or just a few files per site ?
The speed heavily depends on those (and more) details.

If it turns out that you really need a highly specialized crawler, it
might be the best to use libwget's API. I did so for scanning the top 1M
Alexa sites a while ago and it worked out pretty well (took ~2h on a
500/50 mbps cable connection). The source is in examples/ directory.

Maybe you just start with a test.

I am personally pretty interested in tuning bottlenecks (CPU, memory,
bandwidth, ...), so let me know if there is something and I go for it.

You can also PM me with more details, if you don't like to post it in
public.

Regards, Tim

> 
> On Tue, Jul 31, 2018 at 6:22 PM, Tim Rühsen <address@hidden
> <mailto:address@hidden>> wrote:
> 
>     On 31.07.2018 18:39, James Read wrote:
>     > Hi,
>     >
>     > how much work would it take to convert wget into a fully fledged
>     > asynchronous webcrawler?
>     >
>     > I was thinking something like using select. Ideally, I want to be
>     able to
>     > supply wget with a list of starting point URLs and then for wget
>     to crawl
>     > the web from those starting points in an asynchronous fashion.
>     >
>     > James
>     >
> 
>     Just use wget2. It is already packaged in Debian sid.
>     To build from git source, see https://gitlab.com/gnuwget/wget2
>     <https://gitlab.com/gnuwget/wget2>.
> 
>     To build from tarball (much easier), download from
>     https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz
>     <https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz>.
> 
>     Regards, Tim
> 
>

signature.asc
Description: OpenPGP digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] Async webcrawling, James Read, 2018/07/31
- Re: [Bug-wget] Async webcrawling, Tim Rühsen, 2018/07/31
  - Re: [Bug-wget] Async webcrawling, James Read, 2018/07/31
    - Re: [Bug-wget] Async webcrawling, Tim Rühsen <=

Prev by Date: Re: [Bug-wget] Async webcrawling
Previous by thread: Re: [Bug-wget] Async webcrawling
Index(es):
- Date
- Thread