duplicity-talk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Duplicity-talk] Speeding up duplicity


From: Yuri D'Elia
Subject: [Duplicity-talk] Speeding up duplicity
Date: Thu, 07 Feb 2013 15:30:37 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130116 Icedove/10.0.12

Hi everyone.

I was trying to use duplicity to perform a full system backup of our system here (~7TB of data to be backed up) on our EMC ATMOS "local cloud" storage.

The first problem I encountered if that duplicity performs poorly on large files (>2gb), due to the small block size.

I updated this bug report:

  https://bugs.launchpad.net/duplicity/+bug/897423

but I wanted to know if you see anything against this approach beyond increased diff size (that is, is the backup compatible with deltas that have a different block size, etc?), and if there's anything I can do to get this patch integrated.

With this patch I can now increase the block size and run duplicity on files up to ~50g (tested so far) without peaking CPU usage.

Second problem is that local metadata is still quite big. I end-up with nearly 80GB of metadata (~1% of original), which is ok, but I would like to reduce to less than 1GB (if possible) for practicality. In this case I was also thinking about introducing a --min-blocksize option too.

Network bandwidth though is still not optimally utilized. I'm using the --ansync-upload option, but only one upload at a time is performed, which limits the upload speed to one node of the ATMOS storage.

Since we're talking about weeks of backup still, I really need to be able to create concurrent upload requests, so that an upload request get handled by potentially different nodes on the storage. Currently it takes ~10 days to perform a full backup, which is unacceptable as I would like to compute daily deltas and at least perform a full backup once a month.

I would like pointers from developers here, if you think that this approach is feasible with the current code (I was perusing the async scheduler before, but thought to ask before going forward). I would simply like to allow the produce to create volumes up to the requested number (say, --async-requests 5), so that upload can happen concurrently.

Another issue I have is that duplicity is not threaded, and thus sometimes stalls on CPU even when I have >32cpus available. I can read at ~200MB/s from my local storage, but not with duplicity due to cpu contention. I would like to separate the compression/encryption stage here (which is performed by gpg currently) in a simple pipeline instead, so that I can arbitrarily choose the compressor and get better system CPU usage though separate processes. Again, this sounds simple enough, but would such a patch being accepted? Any major problems?

I will be also releasing the ATMOS backend, which uses the atmos-python (although modified) code from http://code.google.com/p/atmos-python/ as soon as is battle-tested enough. If anybody is interested in testing, please don't hesitate to ask.

Also, any pointer in using duplicity for large-scale backup would be appreciated. I was using it for some small systems, but since we had the ability to use the EMC ATMOS storage from another facility, it suddenly became quite useful for this larger task.

Bests.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]