bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [bug #30999] wget should respect robots.txt directive cra


From: Miquel Llobet
Subject: Re: [Bug-wget] [bug #30999] wget should respect robots.txt directive crawl-delay
Date: Fri, 10 Apr 2015 01:10:34 +0200

>
> Crawl-delay is host/domain specific. Thus a wget -r 'domain1 domain2
> domain3'
> can't simply wait 'crawl-delay' seconds after a download. We need some
> specific logic when dequeing the next file.
>
Okay I understand now! Well this makes the problem much more complex, every
new domain needs it's own crawl-delay and robots.txt parsing.

Also how comes --wait into play

By that I meant the crawl-delay behaviour is the same as using --wait with
the given time. However by default --waitdelay is set to 10s, so that would
need to be corrected, as it would potentially violate the crawl-delay when
retrying a download.

Today, web servers often allow for 50+ parallel connections from one client
> -
> I really don't see the point in implementing crawl-delay.
> I could change my mind if someone has a *real* good reason for it *and*
> comes
> up with a good algorithm / patch to handle all corner cases.

Fair point, but it's always nice to respect robots.txt :-) given the
complexity of this fix it might not be worth to pursue unless we have a
good reason to do so.

Thanks,
Miquel Llobet



On Thu, Apr 9, 2015 at 10:25 PM, Tim Ruehsen <address@hidden>
wrote:

> Follow-up Comment #6, bug #30999 (project wget):
>
> Crawl-delay is host/domain specific. Thus a wget -r 'domain1 domain2
> domain3'
> can't simply wait 'crawl-delay' seconds after a download. We need some
> specific logic when dequeing the next file. Also how comes --wait into
> play ?
> The user might be able to override crawl-delay for domain1 but not for
> domain2
> and domain3.
>
> Today, web servers often allow for 50+ parallel connections from one
> client -
> I really don't see the point in implementing crawl-delay.
>
> I could change my mind if someone has a *real* good reason for it *and*
> comes
> up with a good algorithm / patch to handle all corner cases.
>
>
>     _______________________________________________________
>
> Reply to this item at:
>
>   <http://savannah.gnu.org/bugs/?30999>
>
> _______________________________________________
>   Nachricht gesendet von/durch Savannah
>   http://savannah.gnu.org/
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]