monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Scalability question


From: Nathaniel Smith
Subject: Re: [Monotone-devel] Scalability question
Date: Fri, 4 Aug 2006 20:10:16 -0700
User-agent: Mutt/1.5.12-2006-07-14

On Fri, Aug 04, 2006 at 05:19:04PM -0400, Jonathan S. Shapiro wrote:
> On Fri, 2006-08-04 at 14:47 -0500, Timothy Brownawell wrote:
> > We have seen some slowness, yes. Our current thinking is to store our
> > rosters as table rows.
> 
> Yes. This seems good, but it raises a question at the replication layer.
> If the fundamental unit of replication at the sync layer is rows, you
> can end up in a situation where a replication does not complete, leaving
> you with 4 out of 5 rows. This needs to be detectable.
> 
> Does MTN sync at row granularity, or at some higher granularity? How
> does it deal with incomplete replication?

Monotone makes a very strict separation between internal and external
representations.  The external representation is heavily optimized to
be simple and understandable -- there are just files, manifests,
revisions, and certs.  Files are file snapshots, revisions are tree
deltas located in history, manifests are end-to-end checks on the
delta application logic, and certs are signed metadata on revisions.
All of these have as-straightforward-as-possible, documented,
human-readable textual form (except certs, but this is a bug we'll
probably fix in the next n months, and even they're just
revision id/key/value/signer/signature tuples).

This external representation is the normative one; if some piece of
information is not in those formats, then it is not tracked.

Internally, we use some potentially totally different-looking
representations -- manifests don't exist, "rosters" do, we open-code
the revision graph on disk, files are not stored as snapshots but as
delta chains, etc., etc.  But that's all internal implementation
details, and subject to change at any time.

The advantages of this separate partly are just the ordinary
advantages of decoupling -- they let us change internal structures
without disturbing external ones (and since the external ones are
hashed, changes there are very expensive), and we can optimize each
separately (one for clarity and documentability, the other for
performance).

_But_, the other major purpose this distinction serves is that it
supports our paranoia.  Our general rule is that if someone could
possibly go wrong, we have to check that it doesn't.  Checks like
"does this revision actually make sense?" (e.g., no patches to files
that don't exist, no renaming two files to the same name, etc. etc.)
happen at the boundary between an individual monotone process and the
rest of the world, when we convert between our internal formats and
our external formats.  Data within a database is assumed to be subject
to disk corruption, but that's it; whenever we talk to the user or the
network, we assume arbitrary things can go wrong.

So this is a really really long-winded way of saying, network
operations always and only happen with external formats -- file
snapshots, revisions, and certs.  (Manifests are not transmitted over
the network, only their hashes, because as mentioned above they're
just a sanity-check on the revision logic.)

> I have a related question about certificates: at the end of a sync, how
> do I know that I have in hand all of the relevant certificates
> concerning some given file? For Q/A certificates I probably can't, but
> for file type information I really need to know this. How is this
> handled in the mtn architecture?

Certs are on revisions, not on files.  (Files and directories can have
arbitrary attributes attached to them, which are versioned and part of
our model of filesystem trees.  This is probably where you'd put file
type information anyway.)

At the end of a sync, you don't know you have all relevant certs.  We
actually enforce that when you run a sync operation, you don't get
your prompt back until both sides have committed (in the ACID sense)
everything transferred to disk, and the whole communication is MAC'ed,
so you can be reasonably certain you really got everything the other
side wanted to send you, and vice-versa.  But that's just frosting;
we wouldn't lose anything fundamental without those guarantees,
because the whole design assumes that we only ever have a partial
window onto history.  In particular, maybe you got everything the
server had, but there's no way to stop someone from having just issued
a new cert, that they haven't even told the server about yet...

We simply maintain consistency during a sync -- we don't store a
revision until we have all the files it references, and we don't store
a cert until we have the revision it references.  An interrupted sync
can leave you with less complete information than usual, but it's only
by a matter of degree; nothing breaks.

> > This lets us really store one as only the rows
> > that are different from its parent(s?), which will speed up
> > taking/applying deltas.
> 
> Storing deltas like this was our early approach in OpenCM. We did it
> naively, and the results were *horrible*. To make a long story short,
> you want to have some constant upper bound on how many deltas may need
> to be applied in order to reconstruct any given object.
> 
> Both OpenCM and XDFS (and a lot of other systems) resolve this by
> periodically storing completely expanded versions -- roughly every 20th
> revision is stored in expanded form (the number "20" was determined by
> measurement).
> 
> Does the MTN storage layer do something comparable to this?

We actually use a relatively naive method at the moment -- linear
chaining with no breaks.  However, we use backwards chaining -- leaf
versions are stored in full.

Surprisingly, this does _not_ cause problems, itself.  It turns out
no-one ever complains that it takes a long time to check out a version
from 5 years ago.

Where we run into scalability issues with this scheme is that during
sync, we have to send forward deltas over the network, and handling
the delta reversing while maintaining incremental database
consistency is expensive.

This issue is driving work currently:
  http://venge.net/monotone/wiki/DeltaStorageStrategies

> If you do, then the only issue remaining is what to do with really big
> objects like Change/Manifest objects. The problem here is: even in the
> local case it takes a surprising amount of time to fetch a 20,000
> records from a database. You end up (de)serializing at several layers,
> and the costs add up quickly.

Fetching 20,000 individual objects from a database, in a context where
you don't have other IO going on, is going to kill you on seeks if
nothing else...

> However, there is a catch hiding here for schema design. If the blobs
> stored in the repository are (logically) expanded, you end up with one
> set of protocols, schemas, and interaction patterns. If the blobs stored
> in the repository are (logically) deltas, you end up with a different
> architecture.
>
> OpenCM initially said "we (logically) store deltas", and immediately ran
> into linear chaining issues. We very quickly moved to a "we (logically)
> store objects", and then went ahead and did deltas in the store -- but
> that was purely an internal implementation decision within the store.

Right.  The object/snapshot approach has pretty much swept the field
in modern VCSes, with the sole exception of darcs.

> This has implications for the sync protocol, because it means that the
> sync layer may want to manage/reduce bandwidth by generating deltas on
> the fly.
> 
> In OpenCM, this appears in the client/server protocol, but a similar
> issue would seem to exist in the mtn sync protocol. After trying a bunch
> of other things, OpenCM ended adopting the following protocol:
> 
>   requestor:  sendme(sha1:SomethingOld, sha1:SomethingNew)
> 
> where the responder can respond with *either* a completely expanded
> object or a delta relative to the object named by "sha1:SomethingOld"

We've actually been moving away from this approach.  We used to do
this, but we found that in practice the sending side always knew what
should be sent (so you can skip some latency by just sending it
instead of sending the information used to derive this knowledge and
then waiting for a request), and it's pretty much always the same
thing (so you can use this information when choosing efficient storage
layer designs).

This is because files are always transferred as a consequence of
transferring revisions, and when you send a revision you know that
the recipient already has all the files that were in the revision's
parent, and the revision tells you exactly what files are new and
which are changed relative to that parent.

-- Nathaniel

-- 
  /* Tell the world that we're going to be the grim
   * reaper of innocent orphaned children.
   */
-- Linux kernel 2.4.5, main.c




reply via email to

[Prev in Thread] Current Thread [Next in Thread]