sks-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Sks-devel] more database corruption


From: Yaron M. Minsky
Subject: Re: [Sks-devel] more database corruption
Date: Sun, 02 Nov 2003 15:56:41 -0500

Two quick thoughts:

1) if nothing else pans out, it might be time to start thinking about
the possibility of hardware problems.  I don't know of anyone else who
has seen routine database corruption anything like what you've seen, or
really even close.  It makes me think it might be hardware related.

2) The last thing in your your backup script looks like a run of "sks
dump".  I'm not sure what directory you're in when you do that (it looks
like ${TEST}/PTree, but that seems impossible, since "sks dump" should
simply fail when run from that directory), but if it's run from the main
database dump directory, then that's definitely a potential source of
corruption.  As I've mentioned before, running sks dump on a live
database could corrupt it or do god-knows-what.

y


On Sun, 2003-11-02 at 14:52, Dan Egli wrote:
> This is getting annoying. I looked over the server today and saw a lot of 
> messages (litterally thousands) in the failed_messages dir. That made no 
> sense so I moved some of them into the messages folder. They came right 
> back. That is strange, so I looked at the log. Database is corrupted 
> AGAIN. It seems to have happened at 3:00am this morning. Observe:
> 
> 2003-11-02 02:48:14 1 keys found
> 2003-11-02 02:48:41 Adding list of 1 keys from file 
> ./messages/msg-38760868.ready
> 2003-11-02 02:48:41 Applying 0 changes
> 2003-11-02 02:49:11 Adding list of 1 keys from file 
> ./messages/msg-10140124.ready
> 2003-11-02 02:49:11 Applying 0 changes
> 2003-11-02 02:59:52 Adding list of 1 keys from file 
> ./messages/msg-64306034.ready
> 2003-11-02 02:59:52 Applying 2 changes
> 2003-11-02 02:59:52 Adding hash 7B669B52ADB3D241956246551256B1F0
> 2003-11-02 02:59:52 Del'ng hash C9D6D9AC14E0AA17B32726433E2EEA32
> 2003-11-02 02:59:56 Sending LogResp size 2
> 2003-11-02 03:00:00 Calculating DB stats
> 2003-11-02 03:00:05 eventloop: Bdb.DBError("fatal region error detected; 
> run recovery")
> 2003-11-02 03:00:05 <command handler> error in callback.: 
> Bdb.DBError("fatal region error detected; run recovery")
> 2003-11-02 03:00:09 <mail transmit keys> error in callback.: 
> Bdb.DBError("fatal region error detected; run recovery")
> 2003-11-02 03:00:10 <command handler> error in callback.: 
> Bdb.DBError("fatal region error detected; run recovery")
> 2003-11-02 03:00:13 Error fetching key from hash 
> 7B669B52ADB3D241956246551256B1F0: Bdb.DBError("fatal region error 
> detected; ru\n recovery")
> 2003-11-02 03:00:13 0 keys found
> 
> I tried to think what could be happening at 3:00 am that could corrupt the 
> database. The only thing I can come up with is my automatic backup and 
> keydump script. But I cannot see how it would affect the main database 
> because the only operations that occur in the main database are db_archve 
> and db_archive -s followed by some cp commands. 
> 
> If it's of any help, here's my backup script. 
> 
> #!/bin/bash
> 
> function errorabort {
> 
>   echo "NON-ZERO exit status! Aborting Keydump! Deleting failed backup! 
> Alerting SysAdmin"
>   echo "The SKS Keyserver automated database backup and keyring dump sequence 
> encountered a fatal" > msg
>   echo "error on "`date`". This should be investigated immediately. Until 
> then, no further automated" >> msg
>   echo "backups or keydumps will take place. A file called BAD_DB was created 
> in the sks home" >> msg
>   echo "directory. The automated script will not run while this file exists. 
> When the database" >> msg
>   echo "problem has been corrected, remove this file to re-enable the 
> automatic backup and dumps." >> msg
>   mail dan -s "SKS Backup routine failure!!" < msg
>   rm -f msg
>   rm -f ${TEST}/PTree/*
>   rm -f ${TEST}/KDB/*
>   exit;
> 
> }
> 
> 
> PATH=$PATH:/usr/local/bin
> # before we do anything, check to see if BAD_DB exists. If so consider the 
> database unusable. Abort.
> if [ -f ${HOME}/BAD_DB ] ; then
>   exit 2;
> fi
> 
> 
> # step 1 - define environment variables
> TEST=${HOME}/test_backup
> DB=${HOME}/backup_db
> NEW=${HOME}/newdump
> OLD=${HOME}/olddump
> WORK=${HOME}/workdump
> 
> # step 2 - backup existing databases
> 
> cd $HOME/KDB
> # step 2.1 - remove old KBD logs
> rm -f `db_archive`
> [ $? -ne 0 ] && errorabort
> # step 2.2 - copy database files across
> cp `db_archive -s` ${TEST}/KDB
> [ $? -ne 0 ] && errorabort
> # step 2.3 - copy KDB logs across
> cp log.* ${TEST}/KDB
> # step 2.4 - remove old PTree logs
> cd ../PTree
> rm -f `db_archive`
> # step 2.5 - copy PTree databases across
> cp `db_archive -s` ${TEST}/PTree
> [ $? -ne 0 ] && errorabort
> # step 2.6 - copy PTree logs across
> cp log.* ${TEST}/PTree
> 
> # step 3 - validate DB files
> cd ${TEST}/KDB
> for DB in `db_archive -s` ; do
>   db_verify $DB
>   if [ $? -ne 0 ]; then
>     errorabort
>   fi;
> done
> 
> cd ${TEST}/PTree
> for DB in `db_archive -s` ; do
>   db_verify $DB
>   if [ $? -ne 0 ]; then
>     errorabort
>   fi;
> done
> 
> 
> 
> # step 4 - make keydump
> 
> rm -f ${WORK}/*
> sks dump 50000 $WORK
> [ $? -ne 0 ] && errorabort
> 
> cd ${WORK}
> for FILE in *.pgp; do
>   mv $FILE dungeon${FILE##sks-dump};
> done
> cd ..
> 
> rm -f ${OLD}/*
> mv ${NEW}/* ${OLD}
> mv ${WORK}/* ${NEW}
> 
> rm -f ${DB}/PTree/*
> rm -f ${DB}/KDB/*
> mv ${TEST}/KDB/* ${DB}/KDB
> mv ${TEST}/PTree/* ${DB}/PTree
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Sks-devel mailing list
> address@hidden
> http://mail.nongnu.org/mailman/listinfo/sks-devel
-- 
|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|

Open PGP --- KeyID B1FFD916
Fingerprint: 5BF6 83E1 0CE3 1043 95D8 F8D5 9F12 B3A9 B1FF D916






reply via email to

[Prev in Thread] Current Thread [Next in Thread]