Re: [rdiff-backup-users] Proposal: Storing excess file information

rdiff-backup-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] Proposal: Storing excess file information

From:	Bud P . Bruegger
Subject:	Re: [rdiff-backup-users] Proposal: Storing excess file information
Date:	Mon, 2 Dec 2002 11:10:15 +0100

Hi Ben and Dave,

[Ben, apologies for still haven't gotten back on your patch--been interrupted
in my planned work, out of town a lot etc.]

I thought of throwing some more ideas in the discussion.  Thinking aloud...

Assumed objectives:
===================

  * simple (to implement)
  * fast
  * no tool lockin (but migration is infrequent)

Implementation Idea:  
====================

Python Shelves 
(http://python.org/doc/current/lib/module-shelve.html)

I recently used shelves for the first time and quite like it.  Basically, you
put multiple pickled python objects in a file and keep a (string) index on them
for fast extraction.  From python, it looks just like a dictionary, but it is
persistent and file based.  Underneath, it uses some (which is platform
dependent, I believe) dbm implementation (gdbm or similar).  I believe (but
could be wrong) that cdb proposed by Dave is basically an efficient and small
implementation of the dbm api.  (If I remember correctly, c stands for
constant??? and it is optimized for querying, not bothering about updates???).
I don't believe there are big performance differences between different dbm
implementations--and surely, all of them are MUCH faster than a text file based
approach.  

So applies to the problem at hand, file metadata would be defined in an python
class with attributes such as ownerName, grpName, permissions, etc.  These are
stored in a shelve (dictionary API) using the path to the file as key.  

So if you want to ask what metadata a certain file has, you get very fast
answers that are ready to use python objects.  It is easy:

* apart from the open and close, you deal with it like a dictionary 

* it is already part of python, no additional code to distribute 

* no need to parse, marshal, bridge the impedance gap between some database
  format and objects...

Another advantage of shelves is that the matadata can become huge, they do not
need to fit into memory.

As for the tool lockin, it is somewhat less ideal:

* it is surely Python specific

* since the choice of underlying dbm depends on the platform, the index file
  cannot simply be copied

* BUT: migration (platform, tool) is not all that frequent such that it may be
  acceptable to cover these issues with a migration tool.  

So the migration tool I had in mind would again be simple:

* dumper that uses an existing XML pickler (by Gnosis?) to write out the whole
  metadata object (shelve/dictionary) as an XML file.  The XML may not be as
  you would hand design it--but it comes free of charge and it XML--so that
  will make any manager on the teams happy..

* a parser that reads in this xml file to get back the dictionary/shelve.
  While I haven't tried it, I believe this is again just a call to the existing
  XML unpickler that comes in the same package.  

So anyhow, you have and XML solution that basically comes free of charge.  

Variation
=========

If it is all right to always fit all metadata in memory, simply pickling a
dictionary object may have some advantages:

* pickles seem to be platform independent
* a configuration option could decide the pickling format:
  - ascii (less space efficient)
  - binary 
  - xml (see above)

Lower Case File Names
=====================

Already having a metadata repository of some kind, it should be easy to also
treat filenames as metadata.  This would decouple filenames in the backup
storage from those in the original filesystem. 

Examples of filename choices in the backup repository include:

* Use a randomly generated filename (some number? Maybe rather ugly and
  difficult to debug use with other tools)

* a simple coding scheme where a filename is mapped in two parts:
  - lowercase name
  - capitalization info
  An example is quicker:
    FileName -> filename_0_4
    Where 0 and 4 are the string indices of characters to capitalize

There may be much better solutions...  And obviously, there is some quoting
that has to be done with each...


Compression
===========

If compression is really necessary (pickling is probably not overly space
hungry), this could be done before/after opening/closing the shelve file, and
unpickling/pickling a pickle file, respectively.  

I was thinking of a more transparent approach using PyVFS
(http://www.pycage.de/coding_pyvfs.html) but that seems to work only on Unix
platforms.  


Side Note
=========

A metadata approach seems to open the way for compressed mirrors...


Summary
=======

I believe that KISS is the best feature of the above implementation ideas,
followed by speed and flexibility (configurable pickle format).  The major
downside is surely the implementation specific file format (particularly for
shelves).  If pickles would be used instead of shelves, this drawback could be
configured away (xml pickling) at the cost of some more filespace overhead.  

Well, just my two cents...

cheers
--bud

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [rdiff-backup-users] Proposal: Storing excess file information, Bud P . Bruegger <=
- Re: [rdiff-backup-users] Proposal: Storing excess file information, Ben Escoto, 2002/12/02
  - Re: [rdiff-backup-users] Proposal: Storing excess file information, Bud P . Bruegger, 2002/12/03

Next by Date: [rdiff-backup-users] Error building 0.10.1
Next by thread: Re: [rdiff-backup-users] Proposal: Storing excess file information
Index(es):
- Date
- Thread