|
From: | Jörg F. Wittenberger |
Subject: | Re: [Chicken-hackers] pastiche db drop |
Date: | Tue, 04 Feb 2014 13:26:56 +0100 |
User-agent: | Mozilla/5.0 (X11; Linux armv7l; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 |
As this thread is already a bit off-topic I feel compelled to add a
"me too" story here as well. You know what: the whole Askemos/BALL thing we have been working on for the past decade is made that way (and I actually felt sorry when I read about the incident thinking "could not have happened here"). Every object has some script code (though we call that code a mutually audited contract), which in turn is an object itself. Every BLOB is stored under it's SHA256 hash. Every object's state is made from BLOBs. Either plain data or a sqlite3 database (both might have a substructure, which would be Merkle tree). Every object has a list of replica. So in case anything breaks we'll just replace the machine and wait for the system to restore the state from those other peers (kinda bittorrent style). To keep a backup we simply keep a copy of the state's hash elsewhere. E.g. the http://askemos.org website: simply a sqlite3 database plus a tree of BLOBs we felt we wanted to store in a WebDAV directory without the help of SQLite. To keep a backup there's a button. At that point I feel I should add a THANK YOU to you guys: in practice it's being built by Chicken!! /Jörg BTW: These days I found the first "competition" on the web. http://ethereum.org want's to build the closest system I know of; at least they are after the code=contract analogy too. Just I'm not so optimistic about the performance they'll get out of it. After all ethereum is built on Bitcoin technology. Askemos/BALL has no memory overhead (like this block chain) and transactions take about 0.5 seconds over WAN. I don't know yet how much faster than Bitcoins 10 minutes plus ethereal will become. Am 03.02.2014 15:23, schrieb Alaric
Snell-Pym:
On 03/02/14 14:13, John Cowan wrote:On the other hand, that was the *only* time the system went down in a serious way. It was a Mickey-Mouse-watch design: if you drop it, it stops; but if you pick it up and shake it, it works again. In particular, if the web and FTP sites were messed up, I would just say "Wait an hour for reprocessing", and everything would be right again.Yeah! I once had the delight of looking after a somewhat distributed system (it was an online service composed of many complex parts, with various back-end shared components such as databases and file stores and "business logic" RPC servers as well as various front-end systems), on a shoestring (read: nearly no hardware budget, growing usage requirements, growing feature requirements, no proper sysadmins: just me writing the code and maintaining the software/hardware/network). With firefighting being a constant threat to my time, I tried hard to do as much as I could to put fallbacks and retries into the system wherever I could, so that the (frequent) component failures didn't translate into observed system failures very easily! A big part of that was making as many actions as possible asynchronous, and putting them into persistent queues, while daemons pulled jobs from the queues in such a way that a failed attempt re-queues the job, but incrementing a try counter so a bad job doesn't wedge the queue forever. This meant it was hard to overwhelm the system with load spikes (they just consumed disk space in the queue), and if components went down, jobs just waited until the component came back up. I should write all of the tips and tricks I used up in a blog post some day! I did some fun stuff with system monitoring to figure out where bottlenecks or deadlocks were, which I've talked about a bit at http://www.snell-pym.org.uk/archives/2012/12/27/logging-profiling-debugging-and-reporting-progress/ - but not so much on the fault-tolerance side. ABS |
[Prev in Thread] | Current Thread | [Next in Thread] |