gnumed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnumed-devel] Approaches to maintain clinical data uptime


From: Tim Churches
Subject: Re: [Gnumed-devel] Approaches to maintain clinical data uptime
Date: Thu, 04 May 2006 04:02:06 +1000
User-agent: Thunderbird 1.5.0.2 (Windows/20060308)

James Busser wrote:
> 
> On May 3, 2006, at 5:59 AM, Karsten Hilbert wrote:
> 
>> On Wed, May 03, 2006 at 01:04:08AM -0700, Jim Busser wrote:
> 
>>> - can Syan or Karsten, if the work involved is limited, attempt a
>>> load test of writing the dump file, so we might know how long this
>>> would require and what its size might be...
>> This is entirely dependant on the database content. The dump
>> file may grow huge, indeed (several GB, eventually)
> 
> this would mean it cannot be copied onto single CDs, hopefully they
> would fit a DVD else each dump would have to be split over several
> pieces of media, which would be an incredible workflow pain (and
> pertinent to office procedures).

You should plan to have to use DVDs eventually. If you store a lot of
poorly-compressible BLOBs in the database (eg JPEG image files), then
possibly quite soon. Double layer writable DVDs can hold over 8GB,
though, can't they?

> Also pertinent to workflow is the amount of time required for the backup
> dump to be written to disk so that the media could be taken away. Sure,
> it can be written instead at 0200h. It wouldn't matter provided the
> dumps had been encrypted AND the purpose of that media was purely
> notary, but if it may have to be the source of restoration (because
> there was no slave, or because the master and slave got corrupted) then
> that media would have needed to be taken away "live". So where office
> activity might end at 5:30 pm daily, the amount of time required for the
> dump to be generated, notarized, encrypted and written to media
> (hopefully automated) becomes of great interest to the last employee who
> may need to be scheduled to wait 10 vs 20 vs 40 minutes to be able to
> leave, if the office policy is built around being able to take away the
> media. But if the dump does not depend on a "clean" work stoppage it
> could be scheduled to begin earlier, if any server performance penalty
> would be tolerable.

Yes, if the pg_dump takes 10 minutes and the encryption and burning to
disc another 20 minutes, then just start the pg_dump at 5:00pm, and
remember that patient transactions after 5pm will be on the next day's
back-up.

Can I mention another interesting option? Have a look at Amazon S3
(Simple Storage Service)? Amazon is offering online storage of any file
or BLOB of up to 5GB in size (and any number of them as you wish, so no
total data limit), accessible via simple Web APIs (and they provide a
Python library to help, but it also works fine with just HTTP and curl).
Cost is US$0.15 per GB per month for storage, and US$0.20 per GB for
upload or download - all billed to you credit card. They promise
best-effort reliability using the same distributed, highly redundant
storage as Amazon uses itself. After applying for an account, you can
then store data in "buckets", and control who can access those buckets
(remembering that Amazon staff can potentially access everything since
they run the servers). Thus you could set up a series of shell scripts
which run pg_dump, encrypt it, upload it to Amazon S3 in a bucket where
it remains for a month. Next day the encrypted dump is written to a
different bucket on S3, and so on, until after one month the first one
is overwritten (more elaborate GFS "rotation"/archiving schemes would be
possible and desirable, of course). If each encrypted DB dump is about
1GB, then you are storing on average 30GB per month, and uploading 30GB
- at a total cost of, um, US$10.50 per month. Not bad for off-site
backup storage which is likely to be accessible 24x7 if you need to
retrieve a backup to restore - in a different location, if needed.
Limiting factors are your Internet upload and download speeds, which are
governed by your Internet connection (plus any volume limits of charges
which your Internet provider might charge - but often uploads are
unmetered). If you have, say, a 512/512 symmetrical ADSL Internet
connection, you should be able to upload at 150MB per hour, so even a
1GB DB dump could be uploaded overnight. I suspect your DB dumps will be
much smaller than that.

If anyone is interested in coding up a module in Python to handle PG
dumps, encryption and uploading/downloading/overwriting of encrypted DB
dump archives to Amazon S3, could they talk to me offline (hint, hint,
Syan) because we would love to include such a facility in NetEpi. No
reason why the same module for handling this, dual-licensed under
Mozilla and GPL , couldn't be included in both NetEpi and GNUmed and/or
offered as a stand-alone module which would be useful for lots of
projects. Only a very small portion would be PostgreSQl-specific (just
the pg_dump command), so it would be easy to allow it to handle MySQL,
Interbase etc as well, and to provide a command-line/shell scripting
interface as well as exposing a Python API. The chef of such would
receive the people's acclaim forever, as they say on Iron Chef.

Note that competing services to Amazon S3, which are expected to match
the S3 pricing, will also be starting soon eg http://www.omnidrive.com,
and Google are rumoured to be launching "Gdrive" soon, presumably with a
scriptable API as well as a Web interface. There are lots of other
online storage providers, but they have all been one or two or three
orders of magnitude more expensive than Amazon S3.

Of course, you can also abuse Gmail and use it to store encrypted DB
dumps - Gmail allows over 2.5GB per account - but the maximum message
attachment size is only 10MB, so you would need to split DB dumps over
several files, and of course such use is in violation of the Gmail usage
agreement and easily detectable by Google (they just need to calculate
the entropy of a mail message attachment to decide whether it is an
encrypted file) and they may elect to delete such files or lock accounts
 which contain lots of them and not much human readable email. By
contrast, Amazon S3 is designed from the outset for things like
encrypted backup data storage and Amazon **want** users to use it for such.

Tim C




reply via email to

[Prev in Thread] Current Thread [Next in Thread]