bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget produces erroneous robots.txt


From: Darshit Shah
Subject: Re: [Bug-wget] wget produces erroneous robots.txt
Date: Wed, 18 Feb 2015 19:19:39 +0530

Hi Leoh,

What you're seeing is entirely possible. Some misconfigured servers
tend to send a HTTP 200 response with a 404 Not Found page.

If the website that you were trying to mirror had a similar
configuration, then its possible that when Wget tried to load
robots.txt, the server responded with a 200 status code causing Wget
to download the page you saw.

On Wed, Feb 18, 2015 at 7:10 PM, leoh Jones <address@hidden> wrote:
> Thanks for the reply.
> I am using debian8 (jessie) if that matters. Though I did have the same
> issue on a new version of ubuntu.
> I did not use the option --content-on-error  I just used "-m"
> I have no ~./wgetrc and no /etc/wget
> Hey, where is the official github repo?

Wget's development happens on the Savannah servers, not GitHub. You
can find the sources here:
http://git.savannah.gnu.org/cgit/wget.git

> I will try again on the mailing list. Here is the wget version on my debian
> machine
>
> $ wget --version
> GNU Wget 1.16 built on linux-gnu.
>
> +digest +https +ipv6 +iri +large-file +nls +ntlm +opie +psl +ssl/gnutls
>
> Wgetrc:
>     /etc/wgetrc (system)
> Locale:
>     /usr/share/locale
> Compile:
>     gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
>     -DLOCALEDIR="/usr/share/locale" -I. -I../lib -I../lib
>     -D_FORTIFY_SOURCE=2 -I/usr/include -g -O2 -fstack-protector-strong
>     -Wformat -Werror=format-security -DNO_SSLv2 -D_FILE_OFFSET_BITS=64
>     -g -Wall
> Link:
>     gcc -g -O2 -fstack-protector-strong -Wformat
>     -Werror=format-security -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -g -Wall
>     -Wl,-z,relro -L/usr/lib -lnettle -lgnutls -lz -lpsl -lidn -luuid
>     ftp-opie.o gnutls.o http-ntlm.o ../lib/libgnu.a
>
> Copyright (C) 2014 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://www.gnu.org/licenses/gpl.html>.
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
>
> Originally written by Hrvoje Niksic <address@hidden>.
> Please send bug reports and questions to <address@hidden>.
>
>
> On Wed, Feb 18, 2015 at 8:22 AM, Tim Ruehsen <address@hidden> wrote:
>>
>> On Wednesday 18 February 2015 07:45:53 leoh Jones wrote:
>> > Pardon me, if this email reaches you in error.
>> > email addresses taken from wget source.
>> > I was mirroring a webserver with wget -m <address>
>> > when it was done I went in to look at the files, and noticed that there
>> > is
>> > a robots.txt file. This was interesting, because the site mirrored
>> > doesn't
>> > have a robots.txt file.
>> > so then, I looked at the robots.txt file contents, which was that of the
>> > site 404 page.
>>
>> First of all, I can't reproduce it here with the latest version from git.
>>
>> Looks like the new feature --content-on-error is enabled. Did you use it ?
>> What do /etc/wgetrc and ~./wgetrc look like ? And very important: what is
>> the
>> output of 'wget --version' ?
>>
>> > Is this a bug? I signed up for the mailing list, for wget bug reports
>> > but
>> > never heard back. Or is this expected behavior?
>>
>> When you sign up for the mailing list, you should get an email very soon
>> with
>> further instructions. Just try it again.
>>
>> Tim
>
>



-- 
Thanking You,
Darshit Shah



reply via email to

[Prev in Thread] Current Thread [Next in Thread]