bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Does wget check if specified user agent is allowed in rob


From: Ángel González
Subject: Re: [Bug-wget] Does wget check if specified user agent is allowed in robots.txt?
Date: Sun, 29 Jun 2014 22:20:00 +0200
User-agent: Thunderbird

On 21/06/14 21:31, Darshit Shah wrote:
Hi,

I responded to your original question on Stack Overflow. However for
completeness and to document facts, I'll add a response here too.

The answer to your question is: No. Sadly enough, Wget does NOT check
for the user agent string it is using when parsing the robots file. It
simply reads rules for `User-Agent: *` and `User-Agent: wget` giving
preference to the rules specified for Wget alone.

This also has another major implication. Wget seems to be reading and
adhering to robots rules ONLY for * and wget. Which means that not
only does Wget ignore the correct robots exclusion rules, it even
follows the wrong set of rules if Wget is using a different User-Agent
and the website provides a set of rules for Wget.

I'm not convinced this is wrong. You *are* using wget after all.

I don't think you should compare with the User-Agent, as that's different than
the robots.txt identifier. For instance Bing uses “bingbot” for robots.txt
but an user-agent of

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

If we want to make it configurable, it should be a new setting (preferably
a wgetrc-only one)


Best regards




reply via email to

[Prev in Thread] Current Thread [Next in Thread]