[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] wget produces erroneous robots.txt
From: |
Darshit Shah |
Subject: |
Re: [Bug-wget] wget produces erroneous robots.txt |
Date: |
Wed, 18 Feb 2015 19:19:39 +0530 |
Hi Leoh,
What you're seeing is entirely possible. Some misconfigured servers
tend to send a HTTP 200 response with a 404 Not Found page.
If the website that you were trying to mirror had a similar
configuration, then its possible that when Wget tried to load
robots.txt, the server responded with a 200 status code causing Wget
to download the page you saw.
On Wed, Feb 18, 2015 at 7:10 PM, leoh Jones <address@hidden> wrote:
> Thanks for the reply.
> I am using debian8 (jessie) if that matters. Though I did have the same
> issue on a new version of ubuntu.
> I did not use the option --content-on-error I just used "-m"
> I have no ~./wgetrc and no /etc/wget
> Hey, where is the official github repo?
Wget's development happens on the Savannah servers, not GitHub. You
can find the sources here:
http://git.savannah.gnu.org/cgit/wget.git
> I will try again on the mailing list. Here is the wget version on my debian
> machine
>
> $ wget --version
> GNU Wget 1.16 built on linux-gnu.
>
> +digest +https +ipv6 +iri +large-file +nls +ntlm +opie +psl +ssl/gnutls
>
> Wgetrc:
> /etc/wgetrc (system)
> Locale:
> /usr/share/locale
> Compile:
> gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
> -DLOCALEDIR="/usr/share/locale" -I. -I../lib -I../lib
> -D_FORTIFY_SOURCE=2 -I/usr/include -g -O2 -fstack-protector-strong
> -Wformat -Werror=format-security -DNO_SSLv2 -D_FILE_OFFSET_BITS=64
> -g -Wall
> Link:
> gcc -g -O2 -fstack-protector-strong -Wformat
> -Werror=format-security -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -g -Wall
> -Wl,-z,relro -L/usr/lib -lnettle -lgnutls -lz -lpsl -lidn -luuid
> ftp-opie.o gnutls.o http-ntlm.o ../lib/libgnu.a
>
> Copyright (C) 2014 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://www.gnu.org/licenses/gpl.html>.
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
>
> Originally written by Hrvoje Niksic <address@hidden>.
> Please send bug reports and questions to <address@hidden>.
>
>
> On Wed, Feb 18, 2015 at 8:22 AM, Tim Ruehsen <address@hidden> wrote:
>>
>> On Wednesday 18 February 2015 07:45:53 leoh Jones wrote:
>> > Pardon me, if this email reaches you in error.
>> > email addresses taken from wget source.
>> > I was mirroring a webserver with wget -m <address>
>> > when it was done I went in to look at the files, and noticed that there
>> > is
>> > a robots.txt file. This was interesting, because the site mirrored
>> > doesn't
>> > have a robots.txt file.
>> > so then, I looked at the robots.txt file contents, which was that of the
>> > site 404 page.
>>
>> First of all, I can't reproduce it here with the latest version from git.
>>
>> Looks like the new feature --content-on-error is enabled. Did you use it ?
>> What do /etc/wgetrc and ~./wgetrc look like ? And very important: what is
>> the
>> output of 'wget --version' ?
>>
>> > Is this a bug? I signed up for the mailing list, for wget bug reports
>> > but
>> > never heard back. Or is this expected behavior?
>>
>> When you sign up for the mailing list, you should get an email very soon
>> with
>> further instructions. Just try it again.
>>
>> Tim
>
>
--
Thanking You,
Darshit Shah