[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [rdiff-backup-users] How much metadata to store
From: |
Ben Escoto |
Subject: |
Re: [rdiff-backup-users] How much metadata to store |
Date: |
Mon, 02 Dec 2002 23:38:08 -0800 |
>>>>> "DG" == dean gaudet <address@hidden>
>>>>> wrote the following on Mon, 2 Dec 2002 15:08:39 -0800 (PST)
DG> i noticed improved performance by enabling noatime,nodiratime in
DG> the mount options for the mirror fs... but this was ages ago
DG> with 0.6.x or 0.7.x i forget which. these options eliminate
DG> disk writes to update the atimes on files/directories which are
DG> accessed -- and directories are considered accessed by
DG> opendir().
DG> i suspect that the real benefit is in not having to traverse the
DG> mirror filesystem to get the filelist...
Yes, makes sense. Plus it would probably be easier to write assuming
that all the metadata is stored in just a separate file, that way we
wouldn't have to keep switching back and forth. I haven't thought
about corruption issues much though - what happens when the computer
crashes while rdiff-backup is writing the meta-data file, or what
happens when the mirror gets out of sync with the metadata.
I don't think there will be any insurmountable problems,a but
there may be tricky cases. In fact, this could raise the complexity
level a few notches. Right now when updating the destination
directory rdiff-backup tries to change the mirror and the increments
"simultaneously" by writing everything first and then moving both
files into position one after another. If on the off chance something
occurs in the meantime, I think rdiff-backup tries to back out the
process, and failing that something reasonable still happens. With
the metadata file there would be 4 things that should happen
"simultaneously": writing to the mirror, making an increment file,
writing to the current metadata store, and writing to the metadata
increment.
DG> and even better would be if you could avoid recalculating all
DG> the signatures and retransmitting them. it seems like you could
DG> keep a copy of the mirror metadata on the mirror and the
DG> primary, and use a signature comparison of the two at the
DG> beginning of the backup to speed up the file selection. this
DG> would help a mirror scale to hundreds of primaries (i suspect
DG> that the code today won't scale because the mirror has to parse
DG> all of its files for every primary it has a mirror of).
When doing profiling I've never noticed signature calculation time to
be significant. Of course it could be, for instance if there is one
huge file which changes all the time. But I'm not sure if speeding up
signature calculation would actually help anyone. (Of course tell me
if you've noticed something - that noatime trick is good to know.)
DG> it'd be pretty cool to do a filesystem extension which allows
DG> you to store an md5/sha1 of the file as an extended attribute
DG> which is removed whenever the file is modified :)
Good idea. Or instead of removing it, this increasingly improbable
filesystem could just have meta-metadata: the hash could be dated, and
we could assume the hash was up-to-date if the hash date (measured in
nanoseconds of course (ignoring the fact that my computer clock seems
to lose 5000000000 nanoseconds every day)) matched the file
modification date. Also maybe some rsync signature data could be
stored as metadata.
DG> it sure is convenient to have all the files available in the
DG> mirror and to push the compression/packing problems onto the
DG> filesystem. (*)
Can we come up with some rule for when we would want to avoid the
filesystem and when we want to use it? Right now it seems we have two
extremes:
Old rdiff-backup <----------------------------------------> duplicity
and are discussing moving rdiff-backup further to the right.
Duplicity bypasses the filesystem entirely and can be used against,
for instance, an ftp server. Original rdiff-backup assumed that the
destination file system be used to store all data/metadata.
Now the rule I had in mind was that we can bypass the file system
if it doesn't provide necessary services (like certain metadata
functionality). But now we are discussing bypassing it to speed
things up. There's nothing wrong with that necessarily, but it might
be nice to have some more conception ground, to make sense of these
choices.
DG> (*) i'd even extend this to encryption. but i'm not sure there
DG> are any really secure encrypted filesystems on free unix
DG> yet... on linux, using the encrypted loopback mount is not
DG> secure for large filesystem because such a filesystem has a vast
DG> amount of predictable data (consider that a typical linux
DG> install has about ~1GB of exe/lib/etc. data which is easy to
DG> predict), which allows "known-plaintext" attacks against the
DG> cipher.
I think encryption is fundamentally different. Even if you had an
encrypted file system, if you don't trust the remote host, the data
would have to be sent to you encrypted, and then you would decrypt
it. Plus to avoid giving away, for instance, the size and number of
various files, the system would have to send you blocks of encrypted
data. So there would be no way to apply the rsync algorithm unless
the signatures were pre-computed.
About known-plaintext, is this a big issue? Whenever they do one
of those RSA or whatever challenges they tell everyone the message is
"The password is xxxxxxxxxx", and the key still ends up getting
brute-forced. Anyway I think keys are generally expected to be
resistant to these kinds of attacks.
--
Ben Escoto
pgpPuUdkGvM8U.pgp
Description: PGP signature