monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Re: Scalability question


From: Nathaniel Smith
Subject: Re: [Monotone-devel] Re: Scalability question
Date: Sun, 13 Aug 2006 14:43:42 -0700
User-agent: Mutt/1.5.12-2006-07-14

On Wed, Aug 09, 2006 at 08:45:51AM -0700, Steven E. Harris wrote:
> Nathaniel Smith <address@hidden> writes:
> 
> > To make our life simpler, we always send stuff in such an order that
> > we can simply stream it straight to the db and consistency is always
> > maintained; that way we can simply hit "commit" at any time.
> 
> So the sender orders items with respect to the dependency graph such
> that any referenced items arrive before any items referencing them?

Yes.

> > Well, merkle tries or the rsync algorithm
> 
> I've been thinking about this for several days now, and it's probably
> fodder for a new thread.
> 
> Can you explain how monotone uses Merkle Tr(e|i)es during
> synchronization? Specifically, what are the items or data stream being
> summarized by the hash tree? My first guess is the revision graph,
> flattened somehow, but feel free to correct and elaborate on that.

The merkle trie synchronization algorithm is a set synchronization
algorithm -- given a set of items here and a set of items there, each
with a unique id, it lets each side figure out what it needs to send
so that the other side will end up with everything in either set.

We use it 4 times, all in parallel:
  -- on keys
  -- on epochs
  -- on certs
  -- on revisions
The first three of these are genuinely sets; the last one has some
topology that in principle could be exploited for more efficient
synchronization, but it's not clear how to actually do this in
practice.  We just treat them as a homogenous set, and then sort
everything between the merkle refinement and the actual sending.

> > to make transmissions idempotent
> 
> I can see how using hash trees avoids having to retransmit data
> already held on both sides, but I take idempotent to mean that
> transmitting the data (or doing anything) twice has no distinguishable
> effect as opposed to doing it only once. Did you mean "avoid sending
> more than we need to" or "don't worry if we send more than we need
> to"?

Yeah, robustness against accidentally sending things multiple times is
indeed nice.  I more meant the property that everything is stateless,
and you can continue interrupted transfers easily and cheaply (this is
like how if you need to do a file transfer over a lousy connection,
you use rsync, because if you just keep repeating the command long
enough it eventually finishes).

Now that you mention it, I'm not actually sure why I called that
"idempotent"..  But anyway, yeah -- content addressing makes storage
easier and reduces the severity of bugs, stateless protocols make
everything easier and reduce the occurrence of bugs, the ability to
magically pick up where one left off, without any special code paths,
definitely reduces user frustration and increases error recovery.

> > plus content-addressing
> 
> This one has proved to be a stumper trying to find references on the
> Web, trying to avoid having to ask here. My best guess is that you're
> referring to using hash values for identity, so that the same content
> maps to the same identity value, making it easy to discover equal
> content. But the phrase also suggests being able to look something up
> with a special key. Can you clarify?

That's it exactly.  See
  http://en.wikipedia.org/wiki/Content_addressed_storage
for instance (though that is indeed somewhat different than what we
do... citeseer might be better than google in this case, if you really
want background reading).

-- Nathaniel

-- 
"If you can explain how you do something, then you're very very bad at it."
  -- John Hopfield




reply via email to

[Prev in Thread] Current Thread [Next in Thread]