bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Does wget check if specified user agent is allowed in rob


From: Darshit Shah
Subject: Re: [Bug-wget] Does wget check if specified user agent is allowed in robots.txt?
Date: Sun, 22 Jun 2014 01:01:06 +0530

Hi,

I responded to your original question on Stack Overflow. However for
completeness and to document facts, I'll add a response here too.

The answer to your question is: No. Sadly enough, Wget does NOT check
for the user agent string it is using when parsing the robots file. It
simply reads rules for `User-Agent: *` and `User-Agent: wget` giving
preference to the rules specified for Wget alone.

This also has another major implication. Wget seems to be reading and
adhering to robots rules ONLY for * and wget. Which means that not
only does Wget ignore the correct robots exclusion rules, it even
follows the wrong set of rules if Wget is using a different User-Agent
and the website provides a set of rules for Wget.

This bug can be seen in action by the test case I created. Apply the
attached patch and run the Test--UA.py test. The patch is made against
the new python based test suite which exists in the parallel-wget
branch.

On Fri, Jun 20, 2014 at 2:47 AM, György Chityil
<address@hidden> wrote:
> If I specify a custom user agent for wget, eg "MyBot 1.0 (address@hidden)"
> Will wget check this in robots.txt as well, if the bot was banned, or only
> the general robot exclusions? Does wget check if "MyBot" is allowed to
> crawl?
> If not, this would be a nice feature.  If yes, it would be great to include
> this info in the robots overview here https://www.gnu.org/software/wget
>
> I originally posted this question here , but then I found this list
> http://stackoverflow.com/questions/24316018/does-wget-check-if-specified-user-agent-is-allowed-in-robots-txt
>
> --
> Gyuri
> 274 44 98
> 06 30 5888 744



-- 
Thanking You,
Darshit Shah

Attachment: 0001-Test-case-showing-User-agent-bug.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]