bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #20398] Save a list of the links that were not followed


From: Jookia
Subject: [Bug-wget] [bug #20398] Save a list of the links that were not followed
Date: Thu, 07 May 2015 15:58:53 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:37.0) Gecko/20100101 Firefox/37.0

Follow-up Comment #5, bug #20398 (project wget):

I've found myself in need of this feature. I'm trying to download a website
recursively without pulling in every single ad and its HTML. I'd like to be
able to find out which URLs were rejected, why, and information about the
domains (host, port, etc.)

I've patched my copy of Wget to dump all of this in to a CSV file which I can
then tool through to get my desired results:



% grep "DOMAIN" rejected.csv | head -1
DOMAIN,http://c0059637.cdn1.cloudfiles.rackspacecloud.com/flowplayer-3.2.6.min.js,SCHEME_HTTP,c0059637.cdn1.cloudfiles.rackspacecloud.com,80,flowplayer-3.2.6.min.js,(null),(null),(null),http://redated/,SCHEME_HTTP,redacted,80,,(null),(null),(null)
% grep "DOMAIN" rejected.csv | cut -d"," -f4 | sort | uniq   
0.gravatar.com
1.gravatar.com
c0059637.cdn1.cloudfiles.rackspacecloud.com
lh3.googleusercontent.com
lh4.googleusercontent.com
lh5.googleusercontent.com
lh6.googleusercontent.com


I've included a patch made in a few hours that does this.

(file #33955)
    _______________________________________________________

Additional Item Attachment:

File name: 0001-rejected-log-Add-option-to-dump-URL-rejections-to-a-.patch
Size:14 KB


    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?20398>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]