[Bug-wget] Behaviour of spanning to accepted domains

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Behaviour of spanning to accepted domains

From:	ekrell
Subject:	[Bug-wget] Behaviour of spanning to accepted domains
Date:	Tue, 02 Jun 2015 17:21:48 -0500
User-agent:	Roundcube Webmail/1.1.1

Greetings,

I recently used wget in such a way that the result disagreed with myunderstanding of what should have happened. This came about during asmall programming exercise I am currently working on; I am attempting tosee if a large number of domains (from '-D' option) would be processedmore quickly by using the hashtable included in hash.c. While comparingthe speed of my hashed implementation of host checking against anunmodified version of wget, the standard wget did not seem to respect mylist of accepted domains.


For the hash table version, I did the following:

In recur.c, I init a hashtable with all of the accepted domains fromopt.domain. Ignoring (for the moment) increased memory usage, I assumedthat this would surely be faster than the current method of checking theurl's host.However, when performing the check inside host.c's accept_domainfunction, I realized that I would need to parse u->host to get just thedomain component. This involves some overhead that may make hashing notworth it. Also, during this entire operation, I am assuming that if itwould provide any significant improvement, it would have most likelybeen done before my decision to try it out. Nonetheless, I've enjoyedplaying around with it.

My first couple tests were against my own website, using a list over5000 domains. Both wget and wget-modified downloaded the same files, andat roughly the same speed. My website is so small, that I wantedsomething larger, but not so large that it would take more than a freminutes. I know going around and mirroring random sites is perhaps notrecommended behaviour (without a delay), but it worked.

I bring this up for one my two questions. Can someone recommend a bettermethod of performance testing?

Having found my target website, I went ahead and ran the two wgetversions, one after the other. When mine came out to be almost twice asfast, I knew to assume that something was amiss. Sure enough, wget hasdownloaded much more content.. and spanned to many more domains.


This is the command I ran for each:

<pathToVersion>/src/wget -rH -D $(cat trash/text.txt)williamstallings.com

Excusing the useless use of cat, text.txt contains the massivecomma-separated list of domains.Each of those domains is a randomly generated numeric value, expect forthe final one: williamstallings.com

Previously, whenever I ran this test against a (smaller) website, bothversions of wget would only recursively download those specified by mysingle "real" domain in the list. However, this time (and I did it twiceto make sure) original wget went on to download from over 20 otherdomains.

I would appreciate it if someone could explain what is going on here.Seeing as this behaviour exists with the version I obtained fromgit://git.savannah.gnu.org/wget.git as well as wget from the packagemanager, I am not proclaiming "found a bug!". I imagine that I justmisunderstand what should have taken place, since I expected to onlyhave the single directory from williamstallings.com


Thanks,
Evan Krell

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] Behaviour of spanning to accepted domains, ekrell <=
- Re: [Bug-wget] Behaviour of spanning to accepted domains, Tim Ruehsen, 2015/06/03
  - Re: [Bug-wget] Behaviour of spanning to accepted domains, ekrell, 2015/06/03
    - Re: [Bug-wget] Behaviour of spanning to accepted domains, Tim Ruehsen, 2015/06/03
    - Re: [Bug-wget] Behaviour of spanning to accepted domains, Tony Lewis, 2015/06/05
    - Re: [Bug-wget] Behaviour of spanning to accepted domains, Tim Rühsen, 2015/06/05
    - Re: [Bug-wget] Behaviour of spanning to accepted domains, Tony Lewis, 2015/06/07
    - Re: [Bug-wget] Behaviour of spanning to accepted domains, Tim Rühsen, 2015/06/07

Prev by Date: Re: [Bug-wget] [bug #45236] Memory disclosure in wget using incomplete UTF-8 sequences
Next by Date: Re: [Bug-wget] Behaviour of spanning to accepted domains
Previous by thread: [Bug-wget] [bug #45236] Memory disclosure in wget using incomplete UTF-8 sequences
Next by thread: Re: [Bug-wget] Behaviour of spanning to accepted domains
Index(es):
- Date
- Thread