[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] [bug #30999] wget should respect robots.txt directive cra
From: |
Miquel Llobet |
Subject: |
Re: [Bug-wget] [bug #30999] wget should respect robots.txt directive crawl-delay |
Date: |
Fri, 10 Apr 2015 01:10:34 +0200 |
>
> Crawl-delay is host/domain specific. Thus a wget -r 'domain1 domain2
> domain3'
> can't simply wait 'crawl-delay' seconds after a download. We need some
> specific logic when dequeing the next file.
>
Okay I understand now! Well this makes the problem much more complex, every
new domain needs it's own crawl-delay and robots.txt parsing.
Also how comes --wait into play
By that I meant the crawl-delay behaviour is the same as using --wait with
the given time. However by default --waitdelay is set to 10s, so that would
need to be corrected, as it would potentially violate the crawl-delay when
retrying a download.
Today, web servers often allow for 50+ parallel connections from one client
> -
> I really don't see the point in implementing crawl-delay.
> I could change my mind if someone has a *real* good reason for it *and*
> comes
> up with a good algorithm / patch to handle all corner cases.
Fair point, but it's always nice to respect robots.txt :-) given the
complexity of this fix it might not be worth to pursue unless we have a
good reason to do so.
Thanks,
Miquel Llobet
On Thu, Apr 9, 2015 at 10:25 PM, Tim Ruehsen <address@hidden>
wrote:
> Follow-up Comment #6, bug #30999 (project wget):
>
> Crawl-delay is host/domain specific. Thus a wget -r 'domain1 domain2
> domain3'
> can't simply wait 'crawl-delay' seconds after a download. We need some
> specific logic when dequeing the next file. Also how comes --wait into
> play ?
> The user might be able to override crawl-delay for domain1 but not for
> domain2
> and domain3.
>
> Today, web servers often allow for 50+ parallel connections from one
> client -
> I really don't see the point in implementing crawl-delay.
>
> I could change my mind if someone has a *real* good reason for it *and*
> comes
> up with a good algorithm / patch to handle all corner cases.
>
>
> _______________________________________________________
>
> Reply to this item at:
>
> <http://savannah.gnu.org/bugs/?30999>
>
> _______________________________________________
> Nachricht gesendet von/durch Savannah
> http://savannah.gnu.org/
>
>