[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Does wget check if specified user agent is allowed in rob
From: |
Darshit Shah |
Subject: |
Re: [Bug-wget] Does wget check if specified user agent is allowed in robots.txt? |
Date: |
Sun, 22 Jun 2014 01:01:06 +0530 |
Hi,
I responded to your original question on Stack Overflow. However for
completeness and to document facts, I'll add a response here too.
The answer to your question is: No. Sadly enough, Wget does NOT check
for the user agent string it is using when parsing the robots file. It
simply reads rules for `User-Agent: *` and `User-Agent: wget` giving
preference to the rules specified for Wget alone.
This also has another major implication. Wget seems to be reading and
adhering to robots rules ONLY for * and wget. Which means that not
only does Wget ignore the correct robots exclusion rules, it even
follows the wrong set of rules if Wget is using a different User-Agent
and the website provides a set of rules for Wget.
This bug can be seen in action by the test case I created. Apply the
attached patch and run the Test--UA.py test. The patch is made against
the new python based test suite which exists in the parallel-wget
branch.
On Fri, Jun 20, 2014 at 2:47 AM, György Chityil
<address@hidden> wrote:
> If I specify a custom user agent for wget, eg "MyBot 1.0 (address@hidden)"
> Will wget check this in robots.txt as well, if the bot was banned, or only
> the general robot exclusions? Does wget check if "MyBot" is allowed to
> crawl?
> If not, this would be a nice feature. If yes, it would be great to include
> this info in the robots overview here https://www.gnu.org/software/wget
>
> I originally posted this question here , but then I found this list
> http://stackoverflow.com/questions/24316018/does-wget-check-if-specified-user-agent-is-allowed-in-robots-txt
>
> --
> Gyuri
> 274 44 98
> 06 30 5888 744
--
Thanking You,
Darshit Shah
0001-Test-case-showing-User-agent-bug.patch
Description: Text Data