bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Behaviour of spanning to accepted domains


From: ekrell
Subject: [Bug-wget] Behaviour of spanning to accepted domains
Date: Tue, 02 Jun 2015 17:21:48 -0500
User-agent: Roundcube Webmail/1.1.1

Greetings,

I recently used wget in such a way that the result disagreed with my understanding of what should have happened. This came about during a small programming exercise I am currently working on; I am attempting to see if a large number of domains (from '-D' option) would be processed more quickly by using the hashtable included in hash.c. While comparing the speed of my hashed implementation of host checking against an unmodified version of wget, the standard wget did not seem to respect my list of accepted domains.

For the hash table version, I did the following:
In recur.c, I init a hashtable with all of the accepted domains from opt.domain. Ignoring (for the moment) increased memory usage, I assumed that this would surely be faster than the current method of checking the url's host. However, when performing the check inside host.c's accept_domain function, I realized that I would need to parse u->host to get just the domain component. This involves some overhead that may make hashing not worth it. Also, during this entire operation, I am assuming that if it would provide any significant improvement, it would have most likely been done before my decision to try it out. Nonetheless, I've enjoyed playing around with it.

My first couple tests were against my own website, using a list over 5000 domains. Both wget and wget-modified downloaded the same files, and at roughly the same speed. My website is so small, that I wanted something larger, but not so large that it would take more than a fre minutes. I know going around and mirroring random sites is perhaps not recommended behaviour (without a delay), but it worked.

I bring this up for one my two questions. Can someone recommend a better method of performance testing?

Having found my target website, I went ahead and ran the two wget versions, one after the other. When mine came out to be almost twice as fast, I knew to assume that something was amiss. Sure enough, wget has downloaded much more content.. and spanned to many more domains.

This is the command I ran for each:

<pathToVersion>/src/wget -rH -D $(cat trash/text.txt) williamstallings.com

Excusing the useless use of cat, text.txt contains the massive comma-separated list of domains. Each of those domains is a randomly generated numeric value, expect for the final one: williamstallings.com

Previously, whenever I ran this test against a (smaller) website, both versions of wget would only recursively download those specified by my single "real" domain in the list. However, this time (and I did it twice to make sure) original wget went on to download from over 20 other domains.

I would appreciate it if someone could explain what is going on here. Seeing as this behaviour exists with the version I obtained from git://git.savannah.gnu.org/wget.git as well as wget from the package manager, I am not proclaiming "found a bug!". I imagine that I just misunderstand what should have taken place, since I expected to only have the single directory from williamstallings.com

Thanks,
Evan Krell



reply via email to

[Prev in Thread] Current Thread [Next in Thread]