[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Monotone-devel] Re: long RFC: "contexts"
From: |
Jerome Fisher |
Subject: |
[Monotone-devel] Re: long RFC: "contexts" |
Date: |
Thu, 27 May 2004 07:48:55 +0200 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040514 |
graydon hoare wrote:
the idea -- in case you missed it in all the other VC systems! --
would be to add a textual object to monotone which describes (all at
once) the contents of a number of certs and a certain amount of
currently-synthetic information:
What do you mean by "currently-synthetic information"? I think you're
referring to storing the changes to each parent, which currently will be
derived from manifest comparison and file rename certs. However, I'd
like to be sure I'm not misinterpreting this.
manifest: <manifest-sha1>
date: <contents-of-current-date-cert>
author: <contents-of-current-author-cert>
summary: "line of text"
parent: <first-parent-context-sha1> {
manifest: <manifest-sha1>
renames: [<filename>, <filename>]
adds: [<filename>, <file-sha1>] ...
dels: <filename> ...
patches: [<filename> <file-sha1> <file-sha1>] ...
}
parent: <second-parent-context-sha1> {
manifest: <manifest-sha1>
renames: [<filename>, <filename>] ...
adds: [<filename>, <file-sha1>] ...
dels: <filename> ...
patches: [<filename> <file-sha1> <file-sha1>] ...
}
remainder is changelog
^D
I have a few problems with this specific textual representation of a
context (commas, braces, etc.), and the names of some elements, but I
don't think that needs to be discussed yet. I think your main aim was in
showing what information would be included, anyway.
I think that getting the definition of a context right the first time is
quite important. Context definitions and IDs are going to be so
pervasively used that it will be very difficult to change them in future
without great disturbance. I think it's best to keep only essential
information, and eliminate - as much as is practical - that which does
not directly relate to the primary goals. I consider these goals to be:
- Uniquely identifying a location in the history DAG.
- Allowing the associated changes to be accurately determined.
- Allowing the resultant state to be determined.
Essential properties, as I see it:
(1) Referencing each parent context.
- In the case of merges, this partially addresses the question of
how the new state was reached.
- It has the effect that contexts with different ancestry will have
different IDs, which is more or less essential for reasons that have
been covered in other mails.
(2) Specifying, for each parent context, whatever changes were performed
to get from its state to the new state that CAN'T be derived by simply
comparing those states.
(currently only renames)
- This provides a partial set of changes between states. Extra
information regarding these types of changes would otherwise be lost.
- It has the effect that changes to the same parent(s) resulting in
the same new state but produced in different ways will result in
different context IDs. This is almost certainly a good thing.
- See "EXPLICIT CHANGES" below.
(3a) Specifying, for each parent context, whatever changes were
performed to get from its state to the new state that CAN be derived by
simply comparing those states.
(currently adds, dels and patches)
- This allows the full set of changes to be known immediately.
- It's redundant if you can determine every parent's state and the
new state.
- You never have to go through the expense of working out the
changes through state comparison. This speeds up operations like netsync
and log.
OR
(3b) Referencing the absolute representation (manifest) of the new state.
- This allows the new state to be to be known immediately.
- It's redundant if you have full knowledge of the changes and can
determine the state of one parent (if there are any parents).
- You never have to go through the expense of applying the changes
to a parent state to determine the new one.
- It allows for the stripping of old contexts, manifests and file
data to save space.
Additional properties in your proposal:
(4) Specifying the author of the change, the author's idea of time when
making the change, the author's summary of the change, and the author's
full description of the change.
- This has the effect that exactly the same changes to the same
parent(s) will result in multiple nodes in the history DAG if any of
these attributes differ. This will happen quite often (especially with
people auto-merging), and I consider this to be unnecessary and probably
bad. The "badness" suspicion is mostly gut feeling, but I'm thinking
about being able to correct, append to or enhance this change metadata
later - this shouldn't have to use a completely different system like
certs, and certainly shouldn't result in a change of context ID.
(5) Referencing, for each parent context, the absolute representation
(manifest) of its state.
- I don't see how this is useful at all unless the parent's context
is stripped or not yet downloaded, and then what do you want with the
manifest ID? I think I'm missing something (I have no clue about the
internals of netsync, or anything else in monotone for that matter).
Only one of (3a) and (3b) is strictly necessary. As each provides very
important benefits, I think they should both remain.
Unless there's a good reason to have them that I'm not aware of, I think
(4) and (5) are unnecessary, and in the case of (4) possibly evil.
So I would suggest:
- Removing the "manifest" field from the "parent" sections.
- Removing the "date", "author" and "summary" fields, and the changelog
area.
- Attaching the "date", "author", "summary" and "changelog" information
to the context independently (using certs).
EXPLICIT CHANGES
I think it's important to note that it's highly desirable to store as
well as possible the changes that _were actually_ performed, not merely
changes that _can be_ performed to get from one state to another. It's a
subtle but important distinction. The only place where we currently
recognise this is in the support of "rename". It would be possible to
define rename in terms of "add" and "delete", but we would then lose
important information on what the author of the change actually did.
In future, for example, we might have:
replaces: [<filename>, <file-sha1>] ...
for completely replacing a file (meaning that the files are not
related, they just have the same path - diffs and auto-merges don't make
sense).*
copies: [<original_filename>, <copy_filename>]
for cloning a file. This is important for merging as well as
documenting the author's intention.
cherrypicks: [<context>, <parent_context>] ...
for auto-merging all changes from an edge into the current state.
Unlike the other examples, this potentially affects multiple files.
And 3rd party change types like:
xyzzypatches: [<filename>, <xyzzypatch-sha1>] ...
for when a file's changes have been stored in a magic patch format
that accurately documents exactly what a user did (e.g. renamed this
variable, added a parameter to this function). Generation of these
patches would be done by the author's tools (e.g. a refactoring editor).
It would not necessarily be possible to extract the same information on
what was changed, how and why by generic textual comparison (e.g. diff)
of the former and latter states.
Note that the order in which changes are applied is significant, and the
same change type could be used multiple times with different change
types in between. It may be clearer (though less efficient) to define
change types in the singular and list them one by one separately.
* The "replaces" change type could equally well be represented by a
"dels" of the filename, followed by an "adds" of the same filename with
the new hash. It's just an example.
be it. the only remaining "missing" concept would be "file GUIDs",
which I consider mostly meaningless anyways; imo if you have enough
shared history to have a shared GUID, you probably have enough to
work out the naming relationship by tracing through rename history.
I agree with this, though currently it's not possible to do things like
"resurrect" a file in a way that allows accurately tracking of that file
through history (though unreliable heuristics could be used). There are
ways to do this perfectly without file GUIDs, though (e.g. through new
change types).
- make a clear future distinction between certs which are about
a change (context certs) and certs which are about a particular
tree state (manifest certs). this difference is evident for example
in the difference between approval (context) and testresults
(manifest), but it's not really as clear at the moment.
I'm still not convinced that there's a need for manifest certs... I
think even testresults certs should apply to a context. Branch certs can
only sensibly apply to a context, not a manifest; different branches can
be completely different projects; completely different projects can have
completely different procedures for determining testresults. Of course,
this example isn't very clever (it's unlikely that you'd get the same
manifest in different projects), but there are several other reasons I
don't think it makes sense to apply any certs to a manifest.
- I'd have an excuse to unpack and index the fields which I know
the substructure of (author, date, ancestor, etc.) which would
speed and simplify a lot of local operations.
This information could equally well be extracted from certs for
indexing, right?
- take no more space. all these items are generated each time we do
a commit already, but as *separate* certs. the certs aren't free:
generally there are about 300 extra bytes of crytographic data
along for the ride on each one. that makes a commit cost about
1500 bytes in crypto; this data object would probably weigh no more
than that, possibly even less.
I don't remember whether I brought this up before, but I think that
having a way to bundle certs together is quite important. These "cert
bundles" would contain several properties, and be timestamped and signed
as a whole. There are a number of reasons I'd like this, the least
important of which being that it would reduce the signature overhead.
- there would be a certain distinction between "core" and "auxiliary"
metadata: the stuff mentionned in the context will have a seeming
primacy over additional, 3rd party certs hung on the side. the
experience so far seems to suggest that nobody ever sticks 3rd party
author, date, or rename certs on a manifest anyways, so I'm not sure
how much would be lost there.
I think an awful lot would be lost in flexibility and simplicity. I can
think of a whole lot of custom certs I'd like to add myself at commit
time. I'd certainly mourn the loss of a consistent approach to metadata.
Jerome
(Graydon: Sorry about the bad quoting in my last email, I was a little
overexcited)