bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

How to download and archive big website with a wiki and a forum thorough


From: Kulagin
Subject: How to download and archive big website with a wiki and a forum thoroughly and efficiently?
Date: Mon, 7 Dec 2020 02:49:39 +0200

I'm on Windows. I would like to download http://www.tekkenzaibatsu.com/ which
has different subdirectories like http://www.tekkenzaibatsu.com/wiki/ and
http://www.tekkenzaibatsu.com/forums/. The website will be closed later
this month, as the owner has announced a few days ago.

The http://www.tekkenzaibatsu.com/forums/ will need login support. I
already have working cookies.txt ready to feed into programs.

I googled this issue and there are 2 main programs that people suggest:
httrack and wget. I managed to login in both and to start downloading with
both. The problem with httrack is that it downloads with 20 kb/s speed.

Then I tried wget. After few hours of googling and reading manual I finally
managed to generate a big-big console command to finally start downloading
at adequate speed and what I need:

> wget -4 --recursive --page-requisites --adjust-extension --no-clobber
> --convert-links --random-wait -e robots=off --force-directories
> --load-cookies tekkenzaibatsu.com_cookies.txt --user-agent='Mozilla/5.0
> (X11; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0'
> http://www.tekkenzaibatsu.com/ -R
> "/editpost.php?action=editpost&postid=,/newreply.php?action=newreply&postid=,/showthread.php?postid=#post,/online.php/,/search.php?s=&action=showresults&searchid=,/search.php?s=,/private.php?action=newmessage&userid=,/member2.php?action=addlist&userlist=buddy&userid=,/newreply.php?action=newreply&threadid=,/newthread.php?action=newthread&forumid=*"


 The problem with wget is even, though, it will reject all the editpost,
create new thread, new reply, reply, profile, search and all other
repeating pages on the forum:
[image: image.png]

It will still first follow them, download them and only reject them at the
last stage, it will physically delete all these pages only afterwards(from
what I read and tested myself by trying to download pages in the -R list).
The problem with this is that there are 36,095 topics on the forum at the
moment, with many posts in these topics, which will result in around 500000
pages total, if not many more.

I would also like to preserve all the appended files to the forums posts,
on wiki pages, in articles and others, as well as all the images that are
displayed on the website and the forum, which aren't necessarily on the
same domain as the website(those [img]https://imguuur.com/nMi44[/img] and
all other). So the whole website, wiki and forums stay pretty much fully
functional.


I read how to mirror a phpDB forum on the archive.org:
https://www.archiveteam.org/index.php?title=PhpBB


And it suggests to use wget, which has the problem I described above, it'll
take me weeks or months do download these hundreds of thousands of unneeded
pages, and it could just might be infinite of them using this approach.


Is there a way to not follow all these unneded links in wget? Or maybe is
there some other tool which will allow me to download the website at
adequate speed(1 page a second should be just fine resulting in only 13
hours of work(assuming 50000 pages total) needed to download the whole
website) and not download all the other pages that are not needed?


Please help and thank you for your attention.

PNG image


reply via email to

[Prev in Thread] Current Thread [Next in Thread]