bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] mirroring a Blogger blog without the comments


From: Gisle Vanem
Subject: Re: [Bug-wget] mirroring a Blogger blog without the comments
Date: Fri, 25 Apr 2014 10:55:43 +0200

<address@hidden> wrote:

Even more general would be something like --next-urls-cmd=<CMD>, where you
could supply a command that accepts an HTTP response on stdin, and then
writes the set of URLs to stdout which should be crawled based on it.

You could use Lynx to extract all the links with: lynx -dump --listonly URL > urls.file

Edit / grep the 'urls.file' and use the 'wget -i' option to download what
you want. From 'man wget':

-i file
--input-file=file
    Read URLs from a local or external file.  If - is specified as
    file, URLs are read from the standard input.  (Use ./- to read from
    a file literally named -.)

    If this function is used, no URLs need be present on the command
    line.  If there are URLs both on the command line and in an input
    file, those on the command lines will be the first ones to be
    retrieved.  If --force-html is not specified, then file should con-
    sist of a series of URLs, one per line.

    However, if you specify --force-html, the document will be regarded
    as html.  In that case you may have problems with relative links,
    which you can solve either by adding "<base href="url">" to the
    documents or by specifying --base=url on the command line.

    If the file is an external one, the document will be automatically
    treated as html if the Content-Type matches text/html.  Further-
    more, the file's location will be implicitly used as base href if
    none was specified.

--gv



reply via email to

[Prev in Thread] Current Thread [Next in Thread]