[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Monotone-devel] Re: Improving the performance of annotate
From: |
Nathaniel Smith |
Subject: |
Re: [Monotone-devel] Re: Improving the performance of annotate |
Date: |
Wed, 19 Jul 2006 00:32:16 -0700 |
User-agent: |
Mutt/1.5.11+cvs20060403 |
On Tue, Jul 18, 2006 at 09:48:10PM -0700, Graydon Hoare wrote:
> This would attack most of our worst remaining bottlenecks:
>
> - Turning off the manifest hash check during netsync would make
> netsync a no-hash, no-xdelta operation; it would still spend a
> bit of time applying revs to an in-memory roster and working out
> markings, but the internal i/o costs could go away. Manifest checks
> could be run offline / periodically for users who don't mind
> relaxing a bit.
Can we just agree not to do this until after we've done all the other
parts, and seen how much bottleneck remains? :-)
> - Annotate, log, etc. could apply a restriction to head rosters
> at first and then walk backwards through database queries
> against the affected nodes alone, skipping the parsing
> and hashing of intermediate rosters *and* skipping between
> revs that actually affect the file(s) in question.
>
> You'd effectively be getting "per file DAGs" out of this, since the
> changes-to-markings *are* the DAG edges you care about.
I don't quite follow this part. Above you seemed to say we would just
re-code the contents of revisions into db rows; but revisions contain
no information that lets you "skip between revs that actually affect"
a given file. I guess you store some more information than that?
Like what?
> I'm not sure how far off this scheme is from what the existing "per file
> DAGs" work does; I've not had time to investigate it. But I think it's
> completely compatible with how we do things now, and would not involve
> changing network protocol or external formats. So I'm quite happy if
> someone wants to spend time playing with it.
There isn't much per-file DAG work, really, certainly not in code :-).
Thomas Moschny has done some prototyping in python; I don't know how
functional it is. For me, "per file DAG" just means "some way to get
those back pointers that let you skip things". Your plan seems to
also tackle a few other problems:
-- make it cheap to write rosters to the db (no more
serialize + xdelta overhead)
-- make it cheap to get information on just a single node from a
historical roster
If I were going to do a simplest-thing-that-could-possibly-work thing
for only the original per-file DAG problem, what I'd do is something
like:
-- add two new fields to each node in the roster. One says "this
node is marked in this roster", the other says "*(my parents) is
<rev1> [<rev2>]".
-- in the previous, "marked" and "*()" are defined as in
unique-*-merge, treating each node as a whole as a single scalar
(and for directories, probably treating their entire contents as
part of that scalar too).
-- to traverse, look at this info for whatever rev you want to start
with. Use the *(my parents) links to repeatedly jump back to the
next relevant roster.
-- the "this node is marked" field is important both so we can tell
whether the node we examine is interesting, and so that these
fields are cheap to calculate incrementally in the marking phase.
-- using unique-*-merge means that the graph is guaranteed to be
non-surprisingly formed (in particular, you cannot have nodes with
arbitrary numbers of parents, which a multi-*-merge reduction
would allow! Also, only merge nodes can have multiple parents).
This seems like a nice property if people want to, say, display
single file history graphs.
-- whatever code is consuming this graph gets to decide how to
filter it further; "annotate" will probably ignore all nodes that
do not involve content changes, for instance.
I don't know how well this would work, just that I can't prove it
doesn't, and I can't think of anything simpler that has that property
:-). (For instance, the cost of loading each of the rosters might
still be prohibitive; this scheme does nothing at all to speed up
annotating the ChangeLog file. But maybe skipping things helps
enough that it doesn't matter, or maybe then there would be an
obvious way to fetch single nodes from a roster.)
> You're absolutely right that rosters are a "local cache"; they were
> designed that way, and we expressly kept their format private-to-the-db
> so that we could change it as efficiency demanded. Efficiency is now
> demanding, and changing the format to suit is quite legitimate.
Mm-yep. If we're going to pay the abstraction tax, we should at least
get the benefits :-).
-- Nathaniel
--
"On arrival in my ward I was immediately served with lunch. `This is
what you ordered yesterday.' I pointed out that I had just arrived,
only to be told: `This is what your bed ordered.'"
-- Letter to the Editor, The Times, September 2000