bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] download page-requisites with spanning hosts


From: Petr Pisar
Subject: Re: [Bug-wget] download page-requisites with spanning hosts
Date: Thu, 30 Apr 2009 10:14:10 +0200
User-agent: Mutt/1.5.16 (2007-06-09)

On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote:
> 
> The wGet command I am using:
> wget.exe -p -k -w 15
> "http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27330";
> 
> It has 2 problems:
> 
> 1) Rename file:
> 
> Instead of creating something like: "912.html" or "index.html" it instead
> becomes: "address@hidden&postdays=0&postorder=asc&start=27330"
>
That't normal because the server doean't provide any usefull alternative name
via HTTP headers which can be obtained using wget's option
"--content-disposition".

If you want get number of the page of the gallery, you need to parse the HTML
code by hand to obtain it (e.g. using grep).

However I guess better naming conventions is the value of "start" URL
parameter (in your example the number 27330).

> 2) images that span hosts are failing.
> 
> I have page-resuisites on, but, since some pages are on tinypic, or
> imageshack, etc.... it is not downloading them. Meaning it looks like
> this:
> 
> sijun/page912.php
>               imageshack.com/1.png
>               tinypic.com/2.png
>               randomguyshost.com/3.png
> 
> 
> Because of this, I cannot simply list all domains to span. I don't
> know all the domains, since people have personal servers.
> 
> How do I make wget download all images on the page? I don't want to
> recurse other hosts, or even sijun, just download this page, and all
> images needed to display it.
> 
That's not easy task. Especially because all big desktop images are stored on
other servers. I think wget is not enough powerfull to do it all on its own.

I propose using other tools to extract the image ULRs and then to download them
using wget. E.g.:

wget -O - 
'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330'
 | grep -o -E 'http:\/\/[^"]*\.(jpg|jpeg|png)' | wget -i -

This command downloads the HTML code, uses grep to find out all image files
stored on other servers (deciding throug file name extensions and absolute
addresses), and finally, it downloads such images.

There is little problem: not all of the images still exist and some servers
return dummy page instead of proper error code. So you can get non-image files
sometimes.

> [ This one is a lower priority, but someone might already know how to
> solve this ]
> 3) After this is done, I want to loop to download multiple pages. It
> would be cool If I downloaded pages 900 to 912, and each pages next
> link work correctly to link to the local versions.
> 
[…]
> Either way, I have a simple script that can convert 900 to 912 into
> the correct URLs, and pausing in between each request.
> 
Wrap your script inside counted for-loop:

for N in $(seq 900 912); do
    # variable N contains here the right number
    echo "$N"
done

Acctually, I suppose you use some unix enviroment, where you have available
powerfull collection of external tools (grep, seq) and amazing shell scripting
abilities (like colons and loops).

-- Petr

Attachment: pgpmNjPejmoqg.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]