bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Help request: Limit recursion, but unconditionally include al


From: Alexander Tobias Heinrich
Subject: [Bug-wget] Help request: Limit recursion, but unconditionally include all media files
Date: Mon, 21 Oct 2013 12:33:10 +0200

Hello wget users

This is not a bug report, but I understood, that this mailing list may also
be used for user questions.

I want to archive parts of a website (www.pokerstrategy.com) and make these
available locally including images, videos, PDFs etc. The page requires me
to login in order to access the content, but I figured out how to do that
already. The website exists in different languages and has different
sub-domains for each language. E.g. de.pokerstrategy.com,
fr.pokerstrategy.com etc. I'm only interested in one language. The website
is very big, so I don't want to download everything. Fortunately, the html
documents of all pages I'm interested in are in one folder (or its
subfolders): A small portion of this website which can be used to
demonstrate my problem is www.pokerstrategy.com/strategy/live-poker.
Unfortunately, media is distributed across a limited number of different
domains (static.pokerstrategycdn.com, peacock.pokerstrategy.com etc.) as
well as a different folder on the same server (
www.pokerstrategy.com/downloads).

So what I need to do is:
* from the start url decend into sub-folders (e.g. /strategy/live-poker ->
/strategy/live-poker/1022), but not ascend to parent or sibling folders
* download CSS styles too
* download any media (jpg, jpeg, png, gif, flv, wmf, avi, mpg, mpeg, pdf
etc.), even if located on different domains
* do not follow any cross-domain links/references EXCEPT if for media files
* make everything available offline, completely including styles and media.
Only links to files/documents that were not downloaded should still point
to the original url.
* adjust extensions if necessary
* use cookies.txt from local folder

I tried different options for wget, but now I'm stuck.

For example, I tried:
wget --tries=3 --retry-connrefused --no-clobber --load-cookies=cookies.txt
--convert-links --page-requisites --adjust-extension --recursive
--include-directories /strategy/live-poker,/download
http://www.pokerstrategy.com/strategy/live-poker

This correctly downloads only the html documents I want and also gets the
media files from the /download folder, but:
- does not modify the html so that <img>-Tags point to the downloaded files
(however, it does modify <a>-Tags that link to local html documents)
- does not get media files from other domains.

If for example I add --span-hosts, it simply gets too much (all documents
from different language versions of the website that I don't need).

Note: For the example URL I provided here you won't need to log in and thus
the  load-cookies option can be waived.

Any help would be greatly appreciated.

Kind regards,
Alexander


reply via email to

[Prev in Thread] Current Thread [Next in Thread]