|
From: | Cyril Russo |
Subject: | Re: [Duplicity-talk] What is the process for creating signature files ? [DOC-2] |
Date: | Wed, 11 Mar 2009 11:39:55 +0100 |
User-agent: | Thunderbird 2.0.0.19 (Windows/20081209) |
Kenneth Loafman a écrit :
Hi,Cyril Russo wrote:Cyril Russo a écrit :Hi, If you have a bit of time, can you explain in few lines how (and where in the code) the signature files are created ? I'm trying to split the signatures to a specified volume size, but I don't want to break anything, and a grep on the code with "signature" is very verbose. Sincerely, Cyril*Organization of a backup archive (TAR format)* The backup archive are (currently) using the well know Gnu's TAR format. When the files are scanned on the filesystem for backing up (using Rsync algorihtm for computing the smallest difference distance), they are cut in smaller part or blocks, that are then saved in the backup archive. The current processing on the file (encrypting / diffing / comparing) will be explained in better detailed in the next part. The block to be stored are either coming from file (in that case we name them /fileblock/) or from signature (in that case, we name them /sigblock/) The current work of reading the block from an existign tar archive is done by the file diffdir.py This files declares the following objects: /DirSig/ (used in rdiffdir) A simple class used to iterate the sigblock. /DirFull, DirFull_WriteSig/ (used in rdiffdir and duplicity main) A simple class to store the files' content in tar blocks Because it's easier to have common code used everywhere, the process compute the difference from the files found, and a virtual empty file (producing a difference equal to the file itself). A similar process is used when the files already exists, the virtual empty file becomes the previous version's file. The WriteSig version also compute the signature and write it to the given output file pointer /DirDelta/ (used in rdiffdir and duplicity main, it's the default implemation of DirFull) This is the actual code computing the difference between the given path's files and the given reference (either nothing, or a previous backup archive). The process compute both the file's content difference, and the file's information difference (has a file been added, deleted, unmodified or modified ?). The file's content goes to the backup archive, while the file's information goes to the signatures. /FileWithReadCounter /_(private)_ The name says it all. It keeps track of the amount read. FileWithSignature (private) A read only file class that computes the signature (from rsync algorithm) while it's being read. The computed signature for each block produce a simple code (depending on the block state: added, modified, deleted etc...) /TarBlock/ (private) /TarBlockIter/ (abstract, private) This class use a given (file) iterator on input, and matching the matching tar'ed block of the given size while iterating. The behviour depend on the following child classes: /DummyTarBlockIter / Doesn't read the file, but instead count the files passed in. /SigTarBlockIter/ This one returns the tar block from a signature's archive file /DeltaTarBlockIter/ This one returns the tar block for the files archive. That's all for this email, again, please spot the errors. This one doesn't explain anything about splitting the signature files, but, I hope, makes the understanding of the backup process clearer. I'll continue with explaining the backup algorithm in the next email (if I understand it correclty). For now and what I've understood, we could hack the Collection stuff to actually parse file with both "signature.gpg" and "sig000.gpg" as a valid signature files, and in the later case, start returning the signature archive collection. I still haven't found how to split the signatures during creation, but I hope it'll appear in the next email.Cyril, Thanks for all the docs you're writing. This has been sorely needed. I'm starting the design of Checkpoint/Restart and we may need to collaborate with you more on this. It appears that if we can cleanly synchronize the creation of difftar and sigtar files in parallel, then Checkpoint is merely the last full volume of each. A crash during the creation of a volume means that Restart would clean that up, and proceed from that point and complete. A subgoal is that a crash during the Nth volume would leave a fully restorable, and restartable, set of N-1 volumes. That may mean I'll have to address the manifest file as well. Thoughts and suggestions from anyone are welcome. I was just thinking about this, and in fact, it would even be better if all tar-block like output were serialized (à-la Java) difftar & sigtar & partar would be created (in parallel or not), but as soon as one of them reach a target limit (volume size for one, but we could set up a time limit too, to let duplicity works 24/7 on the background, in the future maybe, and create archive every hours), then all of them are serialized to the backend. This means that only one will be of the volume size limit (but the others might not) if the limit was the volume size. Upon failure, we can restart from where we left by opening the last, more up-to-date signature file, re-iterate the path, and continue on the first different item (either a new file not in signature, or the first modified file/dir). This would be seen like an iterative step by the code. Couldn't we even merge all of those in a single random access, tar-like file, so it's even easier ? This would break current backup code, but it's not that bad since current code still have to full backup from time to time, so the next full backup will catch up with the new format. I'm preparing the DOC-3 part with the archive description if I understand it. Cheers, Cyril |
[Prev in Thread] | Current Thread | [Next in Thread] |