duplicity-talk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Duplicity-talk] Version 0.6.0 Released - Checkpoint/Restart


From: Peter Schuller
Subject: Re: [Duplicity-talk] Version 0.6.0 Released - Checkpoint/Restart
Date: Thu, 18 Jun 2009 12:04:40 +0200
User-agent: Mutt/1.5.19 (2009-01-05)

> Bottom line is that I'm strongly thinking of rewriting or at least doing
> a major reorg on duplicity.  After going through all of this and getting
> a fairly good understanding of duplicity, I think it has out-grown it's
> original purpose and needs to either be pared down and simplified, or
> enhanced and functionally completed into something that will handle
> major backup tasks.
> 
> Some thoughts come to mind:
> - Named backups, i.e. named directory or keyed database entries.
> - Plugin style backends, i.e. we supply basic backends, others are able
> 
>   to be plugged in by just putting them in a plugin directory.
> - Get away from tar for backend data structure.
> - Allow incrementals to be folded forward into a new complete set.
> - Differential backups, folded incrementals would be necessary for this.

Although I still haven't gotte off my butt, I still want to look into
creating a backup system which is designed to be fundamentally simpler
than duplicity just because I feel that a backup system should really
not be as complex as duplicity; even after spending significant amount
of time on various bits of the code I still do not feel I have a good
overall understanding.

The design I currently have in mind is one which is simple enough that
in a pinch you can pretty much write a shell script on the spot to
perform a restore (at least assuming you can get a local file copy of
the backup repository somehow).

I think I may have mentioned parts of this before; but I was looking
at using a content addressible store where a file is addressible by
its hash (using a selectable hash algorithm, presumably SHA512 by
default). This means all you need for actual file contents is a place
to save a set of files by some name.

Each "backup" (the logical entity) would have associated with it a
complete meta data dump, which would be an (opionally) compressed
(optionally) par2:ed text file containing necessary meta data
information and path information.

Removal of no-longer-needed content in the storage area would be a
matter of taking all backup meta data and discovering the complete set
of files that are actually references from an active backup. One can
then remove all non-references files.

This scheme automatically supports, with no effort what-so-ever,
incremental, differential, incremental "folding" etc as an automatic
side-effect of the content addressable storage.

There is one major complication that I really don't like, which is
that one probably does not want to impose "infinite file size"
requirements on the storage backends. This leaves two likely senarios:

  - Have backends that don't support this handle this themselves. For example
    with S3 (and it's 4 gig limit) it can itself introduce a level of 
indirection.
    The problem here is that backends end up being more complex which is bad, 
since
    they should be as simple as possible and a dime a dussin. In addition, even
    for systems that don't have limits per say, you may have very valid reasons 
*not*
    to try to upload 500 gig files in one transfer.

  - Eat the problem of implementing this generally in the backup system.

I'm leaning towards the latter. Now, instead of looking at this as
"splitting files", we can instead change the role of the content
addressable storage to be a "block storage" rather than "file
storage". The really annoying part is that you may now have to
introduce a level of indirection between the meta data and the block
storage, unless you use a sufficiently large block size that you can
comfortable have a list of hashes in your meta data - but that feels a
bit iffy.

The cool thing though is that this does mean semi-efficient handling
of some simple cases like "stuff got appended to a really large file",
even if nowhere near as efficient as rsync.

And of course there is the obious issues of hash collissions and
potential security impacts.

Encryption can be supported pretty easily at the storage level; the
content addressability need not be affected by using a "secret" prefix
to block contents when hashing (assuming a proper cryprographics
doesn't shoot that down).

Another cool bit is that the format is efficiently self-verifying;
simply running an incremental backup *without* regarding file
modification times, ends up being a complete verification of all
hashes etc. So one can easily recover from "hmm, time of day may have
been fubar:ed on this machine - lemme re-synch".

Opinions?

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <address@hidden>'
Key retrieval: Send an E-Mail to address@hidden
E-Mail: address@hidden Web: http://www.scode.org

Attachment: pgpPnWlTsZdgz.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]