bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Bug-wget Digest, Vol 99, Issue 10: regarding wget not con


From: Tim Ruehsen
Subject: Re: [Bug-wget] Bug-wget Digest, Vol 99, Issue 10: regarding wget not converting links correctly
Date: Tue, 31 Jan 2017 16:16:10 +0100
User-agent: KMail/5.2.3 (Linux/4.9.0-1-amd64; KDE/5.28.0; x86_64; ; )

On Tuesday, January 31, 2017 2:28:46 AM CET Kun Zhou wrote:
> I am replying to this mailing list regarding to the second issue: wget not
> converting links correctly. I installed alpha release of wget ,
> 1.18.109-4734, on Arch Linux. When I run `wget -H -r -k -l 1
> econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB/TheoryF
> 
> 16.html`, an excerpt of the output from wget to the terminal is
> 
> 
> 
> _\--2017-01-30 21:19:03--
> http://econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB/Bernoulli.pdf _
> 
> _ Reusing existing connection to www.econ.ucsb.edu:80.
> HTTP request sent, awaiting response... 200 OK
> Cookie coming from econ.ucsb.edu attempted to set domain to
> faculty.econ.ucsb.edu

Just a side-note: The server not configured correctly... one site tries to set 
a cookie for a different site.

> Length: 479295 (468K) [application/pdf]
> www.econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB: Not a
> directorywww.econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB/Bernoulli.pdf:
> Not a directory
> 
> Cannot write to
> ‘www.econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB/Bernoulli.pdf’ (Not a
> directory)._

Your page references ‘www.econ.ucsb.edu/~tedb' as a link, it is downloaded 
(html content) and saved as file name 'www.econ.ucsb.edu/~tedb'. Any further 
attempt to create files as 'www.econ.ucsb.edu/~tedb/*' will show this error.

You can circumvent this in some cases using the -E option, This will save the 
file as 'www.econ.ucsb.edu/~tedb.html' and doesn't block further downloads.

> I can confirm that `Bernoulli.pdf` is still not downloaded and a number of
> links not converted.

Works with -E.

> Another relevant issue is that the host name `econ.ucsb.edu` and
> `www.econ.ucsb.edu` resolves to the same ip address, verified by the `dig`
> command on linux. However, wget fail to detect this fact and list list the
> two host names seperately, maybe this is a bug, or maybe just a feature. I
> have attached the complete wget output as a textfile in case it is useful.

Wget (or any other web client I know of) will make assumptions about site 
relationships by using dig or DNS. Such assumptions would be often wrong and 
turn out as a huge security issue. There are no rules in the DNS about trust 
relationship between two sites and the IP being the same for two sites doesn't 
tell you anything.

This gave me good results:
wget -d -olog -H -E -r -k -l 1 -D 'www.econ.ucsb.edu,econ.ucsb.edu' http://
econ.ucsb.edu/~tedb/Courses/GraduateTheoryUCSB/TheoryF16.html

The -D option reduces -H to the sites/domains given.

Tim

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]