bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] download page-requisites with spanning hosts


From: Jake b
Subject: Re: [Bug-wget] download page-requisites with spanning hosts
Date: Thu, 30 Apr 2009 03:31:21 -0500

On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar <address@hidden> wrote:
>
> On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote:
> > Instead of creating something like: "912.html" or "index.html" it instead
> > becomes: "address@hidden&postdays=0&postorder=asc&start=27330"
> >
> That't normal because the server doean't provide any usefull alternative name
> via HTTP headers which can be obtained using wget's option
> "--content-disposition".

I already know how to get the page number ( my python script converts
27330 to 912 and back ), but i'm not sure how to tell wget that the
output html file should be named.

> > How do I make wget download all images on the page? I don't want to
> > recurse other hosts, or even sijun, just download this page, and all
> > images needed to display it.
> >
> That's not easy task. Especially because all big desktop images are stored on
> other servers. I think wget is not enough powerfull to do it all on its own.

Are you saying because some services show a thumbnail, then click to
do the full image? I'm not worried about that, since the majority are
full size in the thread.

Would it be simpler to say something like: download page 912,
recursion level=1 ( or 2? ), except for non-image links. ( so it only
allows recursion on images, ie: downloading "randomguyshost.com/3.png"

But the problem that it does not span any hosts? Is there a way I can
achieve this, if I do the same, except, allow span everybody, recurse
lvl=1, and only recurse non-images.

> I propose using other tools to extract the image ULRs and then to download 
> them
> using wget. E.g.:

I guess I could use wget to get the html, and parse that for image
tags manually, but, then I don't get the forum thread comments. Which
isn't required, but would be nice.

> wget -O - 
> 'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330'
>  | grep -o -E 'http:\/\/[^"]*\.(jpg|jpeg|png)' | wget -i -
>
Ok, will have to try it out. ( In windows ATM so I can't pipe. )

> Acctually, I suppose you use some unix enviroment, where you have available
> powerfull collection of external tools (grep, seq) and amazing shell scripting
> abilities (like colons and loops).
>
> -- Petr

Using python, and I have dual boot if needed.

--
Jake




reply via email to

[Prev in Thread] Current Thread [Next in Thread]