bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls hel


From: Ben Smith
Subject: Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls help]
Date: Thu, 13 Nov 2008 20:36:28 -0800 (PST)

The -l1 option is not two ones.  It is a lowercase L and a one.  Is that why it is not working?

Without that option, the wget will not be recursive.  Thus, it will not follow the link to the cached file.


From: Yan Grossman <address@hidden>
To: Ben Smith <address@hidden>
Sent: Thursday, November 13, 2008 2:54:53 PM
Subject: Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls help]

So for all the emails, I am just trying to update you on what happened.
I let it run for a while and I noticed it was retrieving mostly pages from google, like sub folders on google.com, and also copying images files and so on.
I didn't really see it going for my website cache files on google.
Also, I only really need the HTML pages, nothing more.
Also, when I stopped the process and looked at my server I don't see any new folder where those files would have been copied too. Would it create a new folder by itself?

I really really appreciate your effort to help me. Thanks.
Below I am pasting some of the statutes fromt he process so you can see what happened

Saving to: `www.google.com/accounts/ServiceLogin?hl=en&continue=http:%2F%2Fwww.google.com%2Fhistory%2F?hl=en&nui=1&service=hist'

100%[==========================================================================>] 10,776      --.-K/s   in 0.04s

14:50:53 (261 KB/s) - `www.google.com/accounts/ServiceLogin?hl=en&continue=http:%2F%2Fwww.google.com%2Fhistory%2F?hl=en&nui=1&service=hist' saved [10776/10776]

--14:50:54--  https://www.google.com/accounts/ig.gif
Reusing existing connection to www.google.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 1647 (1.6K) [image/gif]
Saving to: `www.google.com/accounts/ig.gif'

100%[==========================================================================>] 1,647       --.-K/s   in 0s

14:50:54 (19.4 MB/s) - `www.google.com/accounts/ig.gif' saved [1647/1647]

--14:50:55--  http://www.google.com/ig?source=gapg&hl=en
Connecting to www.google.com|72.14.205.104|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.google.com/ig?source=gapg&hl=en'

    [ <=>                                                                       ] 25,578      --.-K/s   in 0.04s

14:50:55 (606 KB/s) - `www.google.com/ig?source=gapg&hl=en' saved [25578]

--14:50:56--  https://www.google.com/accounts/sierra.gif
Reusing existing connection to www.google.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 1525 (1.5K) [image/gif]
Saving to: `www.google.com/accounts/sierra.gif'

100%[==========================================================================>] 1,525       --.-K/s   in 0s

14:50:56 (19.1 MB/s) - `www.google.com/accounts/sierra.gif' saved [1525/1525]

--14:50:57--  https://checkout.google.com/?utm_campaign=gaia_em&utm_source=us-en-et-my_accounts&utm_medium=link&hl=en
Connecting to checkout.google.com|74.125.67.115|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://www.google.com/accounts/ServiceLogin?service=sierra&continue=https%3A%2F%2Fcheckout.google.com%2F%3Futm_campaign%3Dgaia_em%26utm_source%3Dus-en-et-my_accounts%26utm_medium%3Dlink%26hl%3Den%26upgrade%3Dtrue&hl=en_US&nui=1&ltmpl=default [following]
--14:50:58--  https://www.google.com/accounts/ServiceLogin?service=sierra&continue=https%3A%2F%2Fcheckout.google.com%2F%3Futm_campaign%3Dgaia_em%26utm_source%3Dus-en-et-my_accounts%26utm_medium%3Dlink%26hl%3Den%26upgrade%3Dtrue&hl=en_US&nui=1&ltmpl=default
Reusing existing connection to www.google.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 11720 (11K) [text/html]
Saving to: `www.google.com/accounts/ServiceLogin?service=sierra&continue=https:%2F%2Fcheckout.google.com%2F?utm_campaign=gaia_em&utm_source=us-en-et-my_accounts&utm_medium=link&hl=en&upgrade=true&hl=en_US&nui=1&ltmpl=default'

100%[==========================================================================>] 11,720      --.-K/s   in 0.02s

14:50:58 (565 KB/s) - `www.google.com/accounts/ServiceLogin?service=sierra&continue=https:%2F%2Fcheckout.google.com%2F?utm_campaign=gaia_em&utm_source=us-en-et-my_accounts&utm_medium=link&hl=en&upgrade=true&hl=en_US&nui=1&ltmpl=default' saved [11720/11720]

--14:50:59--  http://www.google.com/support/accounts/bin/answer.py?answer=48598&hl=en&fpUrl=https%3A%2F%2Fwww.google.com%2Faccounts%2FForgotPasswd%3FfpOnly%3D1%26continue%3Dhttp%253A%252F%252Fwww.google.com%252Fsearch%253Fq%253Dsite%25253Awww.snowbrasil.com%25252Ffotos%2526ie%253Dutf-8%2526oe%253Dutf-8%2526aq%253Dt%2526rls%253Dorg.mozilla%253Aen-US%253Aofficial%2526client%253Dfirefox-a%26hl%3Den
Connecting to www.google.com|72.14.205.104|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
www.google.com/support/accounts/bin/answer.py?answer=48598&hl=en&fpUrl=https:%2F%2Fwww.google.com%2Faccounts%2FForgotPasswd?fpOnly=1&continue=http%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3Dsite%253Awww.snowbrasil.com%252Ffotos%26ie%3Dutf-8%26oe%3Dutf-8%26aq%3Dt%26rls%3Dorg.mozilla%3Aen-US%3Aofficial%26client%3Dfirefox-a&hl=en: File name too long

Cannot write to `www.google.com/support/accounts/bin/answer.py?answer=48598&hl=en&fpUrl=https:%2F%2Fwww.google.com%2Faccounts%2FForgotPasswd?fpOnly=1&continue=http%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3Dsite%253Awww.snowbrasil.com%252Ffotos%26ie%3Dutf-8%26oe%3Dutf-8%26aq%3Dt%26rls%3Dorg.mozilla%3Aen-US%3Aofficial%26client%3Dfirefox-a&hl=en' (File name too long).
--14:51:01--  https://www.google.com/accounts/TOS?hl=en
Reusing existing connection to www.google.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.google.com/accounts/TOS?hl=en'

    [ <=>                                                                       ] 45,225      --.-K/s   in 0.02s

14:51:01 (1.87 MB/s) - `www.google.com/accounts/TOS?hl=en' saved [45225]

--14:51:02--  http://www.google.com/support/accounts?hl=en
Connecting to www.google.com|72.14.205.104|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: /support/accounts/?hl=en [following]
--14:51:03--  http://www.google.com/support/accounts/?hl=en
Reusing existing connection to www.google.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 16183 (16K) [text/html]
Saving to: `www.google.com/support/accounts/index.html?hl=en'

100%[==========================================================================>] 16,183      --.-K/s   in 0.04s

14:51:03 (388 KB/s) - `www.google.com/support/accounts/index.html?hl=en' saved [16183/16183]

--14:51:04--  http://www.google.com/prdhp?hl=en&tab=wf
Reusing existing connection to www.google.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.google.com/prdhp?hl=en&tab=wf'

    [ <=>                                                                       ] 11,591      --.-K/s   in 0.001s

14:51:04 (19.0 MB/s) - `www.google.com/prdhp?hl=en&tab=wf' saved [11591]

--14:51:05--  http://groups.google.com/grphp?hl=en&tab=wg
Connecting to groups.google.com|209.85.133.100|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `groups.google.com/grphp?hl=en&tab=wg'

    [ <=>                                                                       ] 28,659      --.-K/s   in 0.08s

14:51:05 (360 KB/s) - `groups.google.com/grphp?hl=en&tab=wg' saved [28659]

--14:51:06--  http://www.google.com/calendar/render?hl=en&tab=wc
Connecting to www.google.com|72.14.205.104|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://www.google.com/accounts/ServiceLogin?service=cl&passive=true&nui=1&continue=http%3A%2F%2Fwww.google.com%2Fcalendar%2Frender%3Fhl%3Den%26tab%3Dwc&followup=http%3A%2F%2Fwww.google.com%2Fcalendar%2Frender%3Fhl%3Den%26tab%3Dwc&hl=en [following]
--14:51:08--  https://www.google.com/accounts/ServiceLogin?service=cl&passive=true&nui=1&continue=http%3A%2F%2Fwww.google.com%2Fcalendar%2Frender%3Fhl%3Den%26tab%3Dwc&followup=http%3A%2F%2Fwww.google.com%2Fcalendar%2Frender%3Fhl%3Den%26tab%3Dwc&hl=en
Connecting to www.google.com|72.14.205.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14490 (14K) [text/html]
Saving to: `www.google.com/accounts/ServiceLogin?service=cl&passive=true&nui=1&continue=http:%2F%2Fwww.google.com%2Fcalendar%2Frender?hl=en&tab=wc&followup=http:%2F%2Fwww.google.com%2Fcalendar%2Frender?hl=en&tab=wc&hl=en'

100%[==========================================================================>] 14,490      --.-K/s   in 0.04s

14:51:08 (356 KB/s) - `www.google.com/accounts/ServiceLogin?service=cl&passive=true&nui=1&continue=http:%2F%2Fwww.google.com%2Fcalendar%2Frender?hl=en&tab=wc&followup=http:%2F%2Fwww.google.com%2Fcalendar%2Frender?hl=en&tab=wc&hl=en' saved [14490/14490]

--14:51:09--  http://www.google.com/reader/view/?hl=en&tab=wy
Connecting to www.google.com|72.14.205.104|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://www.google.com/accounts/ServiceLogin?hl=en&nui=1&service=reader&continue=http%3A%2F%2Fwww.google.com%2Freader%2Fview%2F%3Fhl%3Den%26tab%3Dwy [following]
--14:51:10--  https://www.google.com/accounts/ServiceLogin?hl=en&nui=1&service=reader&continue=http%3A%2F%2Fwww.google.com%2Freader%2Fview%2F%3Fhl%3Den%26tab%3Dwy
Connecting to www.google.com|72.14.205.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11866 (12K) [text/html]
Saving to: `www.google.com/accounts/ServiceLogin?hl=en&nui=1&service=reader&continue=http:%2F%2Fwww.google.com%2Freader%2Fview%2F?hl=en&tab=wy'

100%[==========================================================================>] 11,866      --.-K/s   in 0.04s

14:51:10 (291 KB/s) - `www.google.com/accounts/ServiceLogin?hl=en&nui=1&service=reader&continue=http:%2F%2Fwww.google.com%2Freader%2Fview%2F?hl=en&tab=wy' saved [11866/11866]

--14:51:11--  http://www.google.com/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg
Connecting to www.google.com|72.14.205.104|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.google.com/ig?hl=en&source=iglk [following]
--14:51:12--  http://www.google.com/ig?hl=en&source=iglk
Reusing existing connection to www.google.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.google.com/ig?hl=en&source=iglk'

    [ <=>                                                                       ] 25,578      --.-K/s   in 0.04s

14:51:12 (614 KB/s) - `www.google.com/ig?hl=en&source=iglk' saved [25578]

--14:51:13--  https://www.google.com/accounts/Login?continue=http://www.google.com/webhp%3Fhl%3Den&hl=en
Connecting to www.google.com|72.14.205.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10833 (11K) [text/html]
Saving to: `www.google.com/accounts/Login?continue=http:%2F%2Fwww.google.com%2Fwebhp?hl=en&hl=en'

100%[==========================================================================>] 10,833      --.-K/s   in 0.04s

14:51:13 (266 KB/s) - `www.google.com/accounts/Login?continue=http:%2F%2Fwww.google.com%2Fwebhp?hl=en&hl=en' saved [10833/10833]

--14:51:14--  http://www.google.com/intl/en_ALL/images/logo.gif
Connecting to www.google.com|72.14.205.104|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8558 (8.4K) [image/gif]
Saving to: `www.google.com/intl/en_ALL/images/logo.gif'

100%[==========================================================================>] 8,558       --.-K/s   in 0.02s

14:51:14 (419 KB/s) - `www.google.com/intl/en_ALL/images/logo.gif' saved [8558/8558]

--14:51:15--  http://www.google.com/advanced_search?hl=en
Reusing existing connection to www.google.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.google.com/advanced_search?hl=en'

    [ <=>                                                                       ] 36,833      --.-K/s   in 0.04s

14:51:15 (878 KB/s) - `www.google.com/advanced_search?hl=en' saved [36833]

--14:51:16--  http://www.google.com/preferences?hl=en
Connecting to www.google.com|72.14.205.104|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.google.com/preferences?hl=en'

    [ <=>                                                                       ] 18,495      --.-K/s   in 0.04s

14:51:17 (449 KB/s) - `www.google.com/preferences?hl=en' saved [18495]

--14:51:18--  http://www.google.com/language_tools?hl=en
Connecting to www.google.com|72.14.205.104|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.google.com/language_tools?hl=en'


On Thu, Nov 13, 2008 at 7:37 AM, Ben Smith <address@hidden> wrote:
Actually, I realized there's an easier way.  Just use this command:
Make a text file (filelist.txt), with all the addresses of the results pages:Then use this command (all on one line, no spaces after --exclude-domains until the space before --input-file):
wget -r -l1 -UFirefox -H -erobots=off --wait 1 --exclude-domains=images.google.com,
maps.google.com,news.google.com,mail.google.com,video.google.com,groups.google.com,
books.google.com,scholar.google.com,finance.google.com,blogsearch.google.com,
www.youtube.com,picasaweb.google.com,docs.google.com,sites.google.com,
www.snowbrasil.com,translate.google.com --input-file=filelist.txt

All of your cache files will end up in a single subdirectory named after the IP address that hosted the cached files.  When I tested it, it was 74.125.45.104, but that may vary.  They are easy to identify since they have cache in the filename and look similar to this:
address@hidden3Awww.snowbrasil.com%2Ffotos%2Fv%2Fcbdn2007%2FDSC01076_resize.JPG.html+site%3Awww.snowbrasil.com%2Ffotos&hl=en&ct=clnk&cd=20&gl=us&ie=UTF-8&client=firefox-a

From: Yan Grossman <address@hidden>
To: Ben Smith <address@hidden>
Sent: Thursday, November 13, 2008 2:34:56 AM
Subject: Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls help]

Thanks so much for responding. Do I need to write a script with these commands or do I run one at a time on the command line on my server?
Would you please just tell me what the syntax is so I only download the cache files?
Thanks so much

On Wed, Nov 12, 2008 at 9:30 PM, Ben Smith <address@hidden> wrote:
grep is a command line program that allows you to find lines in a text file that contain a certain target
more info/usage: http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?grep
sed is a command line program that allows you to replace text
more info/usage: http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?sed

Any Linux distro should have these, or if you're running Windows you can get them at:
http://gnuwin32.sourceforge.net/packages/grep.htm
http://gnuwin32.sourceforge.net/packages/sed.htm



From: Yan Grossman <address@hidden>
To: address@hidden
Sent: Wednesday, November 12, 2008 2:03:58 PM
Subject: Fwd: [Fwd: Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls help]



---------- Forwarded message ----------
From: Yan Grossman <address@hidden>
Date: Wed, Nov 12, 2008 at 10:49 AM
Subject: Re: [Fwd: Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls help]
To: Micah Cowan <address@hidden>


Thanks so much. But what does it mean "Then grep each of the results files to find the line with links to the

all cached pages.  You can pipe that output into sed"
I am not familiar with "grep" and "sed"

Could you please elaborate?

Thanks

On Wed, Nov 12, 2008 at 10:32 AM, Micah Cowan <address@hidden> wrote:
-------- Original Message --------
Subject: Re: [Bug-wget] Fwd: Trying to download HTML from Google's
Cache. Pls help
Date: Wed, 12 Nov 2008 10:00:34 -0800 (PST)
From: Ben Smith <address@hidden>
To: Micah Cowan <address@hidden>
References: <address@hidden>
<address@hidden>


Adding -UFirefox allows the download.  So you should first wget
-UFirefox all the listed results pages from Google:
http://www.google.com/search?q=site%3Awww.snowbrasil.com%2Ffotos&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=zle&q=site:www.snowbrasil.com/fotos&start=10&sa=N
http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=o6J&q=site:www.snowbrasil.com/fotos&start=20&sa=N

etc., up to start=570 (since there are 577 results).

Then grep each of the results files to find the line with links to the
all cached pages.  You can pipe that output into sed, which you can use
to remove everything but the links to the cached pages (replace the info
before, after, and between the cache links with a space).  Then simply
pipe that to wget -UFirefox, and you should get all your files.



----- Original Message ----
> From: Micah Cowan <address@hidden>
> To: Ben Smith <address@hidden>
> Cc: address@hidden
> Sent: Tuesday, November 11, 2008 3:27:05 PM
> Subject: Re: [Bug-wget] Fwd: Trying to download HTML from Google's Cache. Pls help
>
> Ben Smith wrote:
>
>> Subject: Re: [Bug-wget] Re: Bug-wget Digest, Vol 1, Issue 10
>
>>> When replying, please edit your Subject line so it is more specific
>>>  than "Re: Contents of Bug-wget digest..."
>
> It's helpful if you adhere to this guideline; otherwise it's hard to
> follow threads. (I've fixed the subject in my reply.)
>
>> It would be theoretically possible by using grep and sed to strip out
>> the links to the cached files and piping that to wget.  However,
>> Google appears to block access to results pages and cached pages via
>> wget.  I tried to download several using wget and got a 403 Forbidden
>> response.
>
> http://wget.addictivecode.org/FrequentlyAskedQuestions#not-downloading
> should be helpful for such problems (using -U is the most applicable
> suggestion, but you may also run into the others). Please also consider
> adding --limit-rate or --wait.
>

--
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/










reply via email to

[Prev in Thread] Current Thread [Next in Thread]