bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] really no "wget --list http://..." ?


From: Ben Smith
Subject: Re: [Bug-wget] really no "wget --list http://..." ?
Date: Sun, 22 Mar 2009 13:46:34 -0700 (PDT)

You can run the downloaded file through the following command (replacing 
index.html with the appropriate name if necessary).  

cat index.html | sed 's/<a href="/\n<ahref="/g' | sed -e '/^[^<a href]/d' | sed 
's/<html>.*//' | sed 's/<a href="//' | sed 's/".*//'

All on one line.  It works for www.google.com



----- Original Message ----
> From: Micah Cowan <address@hidden>
> To: Denis <address@hidden>
> Cc: address@hidden
> Sent: Friday, March 20, 2009 1:14:44 PM
> Subject: Re: [Bug-wget] really no "wget --list http://..."; ?
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Denis wrote:
> > Micah,
> >   not to be dense, but is there really no way to "wget --list http://...";
> > a directory without downloading all its files ?
> > To browse any file system, local or remote, I want to be able to LIST it 
> first.
> > I gather that there's no www variant of a Unix-like file system
> > (tree structure independent of file contents => very fast ls -R)
> > but a WFS, web file system, would sure simplify life
> 
> HTTP has no concept of a directory, and provides no way to list it, so
> no. The WebDAV extensions _do_ provide such a thing, but they're not
> commonly implemented on web servers (especially without authentication),
> so there'd be little point in making Wget use that.
> 
> It _could_ be useful for wget to download a given URL, parse out its
> links, and spit them out as a list, but wget doesn't currently do that
> either. Even if it did, there could be no way to guarantee that that
> list represents the complete contents of the "directory", as all wget
> will see is whatever links happen to be on that one single page, so if
> it's not an automatically-generated index page, it's unlikely to be a
> very good representation of directory contents. But implementing that
> would not be a high priority for me at this time (patch, anyone?).
> 
> In the meantime, the usual suggestion is to have wget download the
> single HTML page, and then parse out the links yourself with a suitable
> perl/awk/sed script.
> 
> - --
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer.
> Maintainer of GNU Wget and GNU Teseq
> http://micah.cowan.name/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iEYEARECAAYFAknD3RQACgkQ7M8hyUobTrHFlgCfQTcSoCAkgVGPEcnBMI0GlojL
> jqAAn0cK+PcKDEuZwFKyEdCoA9EFQn3N
> =ujth
> -----END PGP SIGNATURE-----



      




reply via email to

[Prev in Thread] Current Thread [Next in Thread]