[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Monotone-devel] long RFC: "contexts"
From: |
graydon hoare |
Subject: |
[Monotone-devel] long RFC: "contexts" |
Date: |
Tue, 25 May 2004 17:01:43 -0400 |
User-agent: |
Mozilla Thunderbird 0.5 (X11/20040208) |
hi,
I've been having a number of off-list discussion about monotone's
ancestry graph, metadata, and ability to "behave like arch".
these discussions, and some of the recent hacking (esp. on netsync) have
been suggesting to me that monotone might benefit from having a
"changeset" or "context" added as a "first class" (named) object.
the idea -- in case you missed it in all the other VC systems! -- would
be to add a textual object to monotone which describes (all at once) the
contents of a number of certs and a certain amount of
currently-synthetic information:
manifest: <manifest-sha1>
date: <contents-of-current-date-cert>
author: <contents-of-current-author-cert>
summary: "line of text"
parent: <first-parent-context-sha1> {
manifest: <manifest-sha1>
renames: [<filename>, <filename>]
adds: [<filename>, <file-sha1>] ...
dels: <filename> ...
patches: [<filename> <file-sha1> <file-sha1>] ...
}
parent: <second-parent-context-sha1> {
manifest: <manifest-sha1>
renames: [<filename>, <filename>] ...
adds: [<filename>, <file-sha1>] ...
dels: <filename> ...
patches: [<filename> <file-sha1> <file-sha1>] ...
}
remainder is changelog
^D
we would then hash this blob of text to produce a "context ID", and you
could attach certs to either context IDs or manifest IDs. the context
*contains* its ancestry (as a "fact"), and the current concept of
"approval" would be rephrased as certifying a context ID as a member of
a branch. there are a bunch of reasons for wanting to do this. I'll list
them here, I'd like you to read them and think about them before leaping
to an immediate knee-jerk reaction. I know it looks like an arch
changeset; that's intentional. they have made some valid points. this
will involve a fair bit of reorganization to implement this on my side,
but I think it's a good idea. such a change will:
- kill, finally, any worries about cycles or accidental shared
lineage in the ancestry graph. you might share storage, but you will
essentially *never* (save a collision in SHA1) share context
ancestry. there would be no more manifest ancestry.
- kill, finally, any of the seeming paranoia that monotone can't or
doesn't reason about "first class" changesets. so far I've been
reasonably comfortable with the idea of managing content alone, but
I get a lot of feedback suggesting the desire to see a written,
tangible, formal object (with a name) called "a change". this would
be it. the only remaining "missing" concept would be "file GUIDs",
which I consider mostly meaningless anyways; imo if you have enough
shared history to have a shared GUID, you probably have enough to
work out the naming relationship by tracing through rename history.
- give a name to a particular change. this makes it easier to talk
about cherry-picking commands; easier to list in a sort of
"what am I about to get during this update" command; and easier to
write as arguments for aprove, disapprove, and similar commands.
- require a somewhat robust printer/parser which can serialize this
information to a human-friendly and email-friendly form. this will
make it easier to interoperate with the patch-and-email approach.
nearly every hacker I've ever discussed VC with says this sort of
email interoperability is a practical necessity.
- make a clear future distinction between certs which are about
a change (context certs) and certs which are about a particular
tree state (manifest certs). this difference is evident for example
in the difference between approval (context) and testresults
(manifest), but it's not really as clear at the moment.
- simplify future interoperability issues. if we import CVS archives
at the moment, we will get some cycles. this is just a symptom, it
will get worse if we try to read or write other VC formats. this
change forces monotone to keep *a* level of history which is a
strict DAG, which is how most systems organize their history
anyways. it may even become possible (?) to read or write linear
sub-DAGs as arch archives, if we're careful.
- add a small extra dimension of integrity checking: the synthetic
analysis of pair of manifests should match the written contents
of a context edge. though you could also see this as an extra
dimension for integrity *failure*. good or bad. in any case, it'll
do away with separate rename certs, which are a bit of a hack.
- remove a small class of potential bug where you have, for example,
two disagreeing "rename" certs on the same edge. this is currently
possibile in monotone, and I suspect is not handled nicely.
- trade some space for speed:
- I'd have an excuse to unpack and index the fields which I know
the substructure of (author, date, ancestor, etc.) which would
speed and simplify a lot of local operations.
- things like netsync or log require analysis of two manifests to
synthesize the change edge. netsync would speed up a fair bit if
it could hazard a guess at the prerequisites for a change the
instant it received the "change object", rather than waiting for
constructability of the pre-image followed by set-wise analysis, as
it currently does.
yet, despite the seeming "increase in space", it would ...
- take no more space. all these items are generated each time we do
a commit already, but as *separate* certs. the certs aren't free:
generally there are about 300 extra bytes of crytographic data
along for the ride on each one. that makes a commit cost about
1500 bytes in crypto; this data object would probably weigh no more
than that, possibly even less.
now, the downsides are:
- the user would see "divergence" slightly more often. for example, if
njs and I both merge the same fork, we'd see two different context
IDs, which (as far as "heads" is concerned) would be different
nodes. but they would have the same manifest, so "heads" (or "merge")
could be made smart enough to say "different context, same content"
and not make you do any extra work.
- there would be a certain distinction between "core" and "auxiliary"
metadata: the stuff mentionned in the context will have a seeming
primacy over additional, 3rd party certs hung on the side. the
experience so far seems to suggest that nobody ever sticks 3rd party
author, date, or rename certs on a manifest anyways, so I'm not sure
how much would be lost there.
- compressing history gets a bit harder. you either need to keep a full
context graph on hand, or make an auxiliary cert or context which
says "this set of contexts is included here". on the other hand, that
sort of facility is potentially something arch interoperability would
need anyways, and is something commonly requested, as a "trail" left
by a cherry-picking command. so maybe it makes sense anyways.
there might be more. I'd appreciate some public discussion now that I've
sort of stewed on the issue for a couple weeks.
I had some more outrageous approaches in mind too -- overhaul the whole
manifest format, switch to versionned directories, etc. -- but I find
myself unable to imagine the complete extent of implications for those,
and unable to justify them given the greatly increased scope of work.
this approach at least seems, er, small enough and similar enough to
what we already do, yet sufficient to cover the main points.
anyways, no matter what I will do any such work on a branch and provide
some sort of sensible migration path from existing DBs, but it might at
worst require re-issuing all existing manifest certs.
(aside: yes, technically this could be more lightweight. really all we
*need* to do for most of the "hard" goals is to make a context which
contains parent context IDs and manifest ID, and then you can do all
the rest hanging certs on the context ID. but I thought I'd cheat and
kill multiple "efficiency" birds with one stone. feel free to reject
the latter concept and argue that all we need is a simple context
object)
-graydon