bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Behaviour of spanning to accepted domains


From: Tim Ruehsen
Subject: Re: [Bug-wget] Behaviour of spanning to accepted domains
Date: Wed, 03 Jun 2015 09:18:41 +0200
User-agent: KMail/4.14.2 (Linux/4.0.0-1-amd64; KDE/4.14.2; x86_64; ; )

Hi Evan,

wget -rH -D $(cat trash/text.txt) williamstallings.com

is not what you want. Leave away the -H, else host-spanning is ON and -D will 
be ignored.

> I bring this up for one my two questions. Can someone recommend a better
> method of performance testing?

What you want to know is how many CPU cycles does wget need to perform a 
defined task (if you compare make sure exactly the same files are downloaded).
The measurement of the real time used depends on many time-variant side-
effects and thus two runs of wget are hardly comparable.

Use valgrind --tool=callgrind wget ...
You can use kcachegrind to display/analyse which part of wget took how many 
CPU cycles.

Regards, Tim

On Tuesday 02 June 2015 17:21:48 ekrell wrote:
> Greetings,
> 
> I recently used wget in such a way that the result disagreed with my
> understanding of what should have happened. This came about during a
> small programming exercise I am currently working on; I am attempting to
> see if a large number of domains (from '-D' option) would be processed
> more quickly by using the hashtable included in hash.c. While comparing
> the speed of my hashed implementation of host checking against an
> unmodified version of wget, the standard wget did not seem to respect my
> list of accepted domains.
> 
> For the hash table version, I did the following:
> In recur.c, I init a hashtable with all of the accepted domains from
> opt.domain. Ignoring (for the moment) increased memory usage, I assumed
> that this would surely be faster than the current method of checking the
> url's host.
> However, when performing the check inside host.c's accept_domain
> function, I realized that I would need to parse u->host to get just the
> domain component. This involves some overhead that may make hashing not
> worth it. Also, during this entire operation, I am assuming that if it
> would provide any significant improvement, it would have most likely
> been done before my decision to try it out. Nonetheless, I've enjoyed
> playing around with it.
> 
> My first couple tests were against my own website, using a list over
> 5000 domains. Both wget and wget-modified downloaded the same files, and
> at roughly the same speed. My website is so small, that I wanted
> something larger, but not so large that it would take more than a fre
> minutes. I know going around and mirroring random sites is perhaps not
> recommended behaviour (without a delay), but it worked.
> 
> I bring this up for one my two questions. Can someone recommend a better
> method of performance testing?
> 
> Having found my target website, I went ahead and ran the two wget
> versions, one after the other. When mine came out to be almost twice as
> fast, I knew to assume that something was amiss. Sure enough, wget has
> downloaded much more content.. and spanned to many more domains.
> 
> This is the command I ran for each:
> 
> <pathToVersion>/src/wget -rH -D $(cat trash/text.txt)
> williamstallings.com
> 
> Excusing the useless use of cat, text.txt contains the massive
> comma-separated list of domains.
> Each of those domains is a randomly generated numeric value, expect for
> the final one: williamstallings.com
> 
> Previously, whenever I ran this test against a (smaller) website, both
> versions of wget would only recursively download those specified by my
> single "real" domain in the list. However, this time (and I did it twice
> to make sure) original wget went on to download from over 20 other
> domains.
> 
> I would appreciate it if someone could explain what is going on here.
> Seeing as this behaviour exists with the version I obtained from
> git://git.savannah.gnu.org/wget.git as well as wget from the package
> manager, I am not proclaiming "found a bug!". I imagine that I just
> misunderstand what should have taken place, since I expected to only
> have the single directory from williamstallings.com
> 
> Thanks,
> Evan Krell

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]