Re: How to download and archive big website with a wiki and a forum thor

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to download and archive big website with a wiki and a forum thor

From:	Jeffrey Walton
Subject:	Re: How to download and archive big website with a wiki and a forum thoroughly and efficiently?
Date:	Sun, 6 Dec 2020 21:33:32 -0500

On Sun, Dec 6, 2020 at 9:01 PM Kulagin <sergeant.coolagin@gmail.com> wrote:
>
> I'm on Windows. I would like to download http://www.tekkenzaibatsu.com/ which
> has different subdirectories like http://www.tekkenzaibatsu.com/wiki/ and
> http://www.tekkenzaibatsu.com/forums/. The website will be closed later
> this month, as the owner has announced a few days ago.
>
> The http://www.tekkenzaibatsu.com/forums/ will need login support. I
> already have working cookies.txt ready to feed into programs.
>
> I googled this issue and there are 2 main programs that people suggest:
> httrack and wget. I managed to login in both and to start downloading with
> both. The problem with httrack is that it downloads with 20 kb/s speed.
>
> Then I tried wget. After few hours of googling and reading manual I finally
> managed to generate a big-big console command to finally start downloading
> at adequate speed and what I need:
>
> > wget -4 --recursive --page-requisites --adjust-extension --no-clobber
> > --convert-links --random-wait -e robots=off --force-directories
> > --load-cookies tekkenzaibatsu.com_cookies.txt --user-agent='Mozilla/5.0
> > (X11; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0'
> > http://www.tekkenzaibatsu.com/ -R
> > "/editpost.php?action=editpost&postid=,/newreply.php?action=newreply&postid=,/showthread.php?postid=#post,/online.php/,/search.php?s=&action=showresults&searchid=,/search.php?s=,/private.php?action=newmessage&userid=,/member2.php?action=addlist&userlist=buddy&userid=,/newreply.php?action=newreply&threadid=,/newthread.php?action=newthread&forumid=*"
>
>
>  The problem with wget is even, though, it will reject all the editpost,
> create new thread, new reply, reply, profile, search and all other
> repeating pages on the forum:
> [image: image.png]
>
> It will still first follow them, download them and only reject them at the
> last stage, it will physically delete all these pages only afterwards(from
> what I read and tested myself by trying to download pages in the -R list).
> The problem with this is that there are 36,095 topics on the forum at the
> moment, with many posts in these topics, which will result in around 500000
> pages total, if not many more.
>
> I would also like to preserve all the appended files to the forums posts,
> on wiki pages, in articles and others, as well as all the images that are
> displayed on the website and the forum, which aren't necessarily on the
> same domain as the website(those [img]https://imguuur.com/nMi44[/img] and
> all other). So the whole website, wiki and forums stay pretty much fully
> functional.
>
>
> I read how to mirror a phpDB forum on the archive.org:
> https://www.archiveteam.org/index.php?title=PhpBB
>
>
> And it suggests to use wget, which has the problem I described above, it'll
> take me weeks or months do download these hundreds of thousands of unneeded
> pages, and it could just might be infinite of them using this approach.
>
>
> Is there a way to not follow all these unneded links in wget? Or maybe is
> there some other tool which will allow me to download the website at
> adequate speed(1 page a second should be just fine resulting in only 13
> hours of work(assuming 50000 pages total) needed to download the whole
> website) and not download all the other pages that are not needed?
>
>
> Please help and thank you for your attention.

I think the easiest way to grab a copy of the wiki in your case is to
ask the administrator to back it up and then make the download
available to users.

Here is the script we use to backup our wiki. The database is MySQL
and the backup software is Duplicity. The script performs a full
backup every 3 months, and incremental backups otherwise. We send the
backup to a remote server over SCP.

Once you have the backup you can restore select parts into a new
installation you create, like in a VM.

# cat /etc/cron.daily/daily-backup
#!/usr/bin/env bash

# Cleanup the database
mysqlcheck my_wiki --auto-repair --user=mwuser --password=XXXXX

# We need to do this because duplicity can't backup running MySQL database
mysqldump --single-transaction --routines --events --triggers
--add-drop-table --extended-insert -u mwuser -h 127.0.0.1
-pfeswecewrukahach my_wiki > /backup/wiki.sql

export PASSPHRASE=YYYYY

duplicity --full-if-older-than 3M --allow-source-mismatch --exclude
/root/.cache --exclude /mnt --exclude /tmp --exclude /proc / --exclude
/var/lib/mysql --exclude=/lost+found --exclude=/dev --exclude=/sys
--exclude=/boot/grub --exclude=/etc/fstab
--exclude=/etc/sysconfig/network-scripts/
--exclude=/etc/udev/rules.d/70-persistent-net.rules
sftp://<user@host:port>/wiki_backup

Jeff

[Prev in Thread]

Current Thread

[Next in Thread]

How to download and archive big website with a wiki and a forum thoroughly and efficiently?, Kulagin, 2020/12/06
- Re: How to download and archive big website with a wiki and a forum thoroughly and efficiently?, Jeffrey Walton <=

Prev by Date: How to download and archive big website with a wiki and a forum thoroughly and efficiently?
Next by Date: Re: Timeout never happens with https://www.tesco.com
Previous by thread: How to download and archive big website with a wiki and a forum thoroughly and efficiently?
Next by thread: [bug #59715] -q --quiet option is not clear enough in man pages
Index(es):
- Date
- Thread