monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Re: Improving the performance of annotate


From: Nathaniel Smith
Subject: Re: [Monotone-devel] Re: Improving the performance of annotate
Date: Wed, 19 Jul 2006 00:32:16 -0700
User-agent: Mutt/1.5.11+cvs20060403

On Tue, Jul 18, 2006 at 09:48:10PM -0700, Graydon Hoare wrote:
> This would attack most of our worst remaining bottlenecks:
> 
>   - Turning off the manifest hash check during netsync would make
>     netsync a no-hash, no-xdelta operation; it would still spend a
>     bit of time applying revs to an in-memory roster and working out
>     markings, but the internal i/o costs could go away. Manifest checks
>     could be run offline / periodically for users who don't mind
>     relaxing a bit.

Can we just agree not to do this until after we've done all the other
parts, and seen how much bottleneck remains? :-)

>   - Annotate, log, etc. could apply a restriction to head rosters
>     at first and then walk backwards through database queries
>     against the affected nodes alone, skipping the parsing
>     and hashing of intermediate rosters *and* skipping between
>     revs that actually affect the file(s) in question.
> 
> You'd effectively be getting "per file DAGs" out of this, since the 
> changes-to-markings *are* the DAG edges you care about.

I don't quite follow this part.  Above you seemed to say we would just
re-code the contents of revisions into db rows; but revisions contain
no information that lets you "skip between revs that actually affect"
a given file.  I guess you store some more information than that?
Like what?

> I'm not sure how far off this scheme is from what the existing "per file 
> DAGs" work does; I've not had time to investigate it. But I think it's 
> completely compatible with how we do things now, and would not involve 
> changing network protocol or external formats. So I'm quite happy if 
> someone wants to spend time playing with it.

There isn't much per-file DAG work, really, certainly not in code :-).
Thomas Moschny has done some prototyping in python; I don't know how
functional it is.  For me, "per file DAG" just means "some way to get
those back pointers that let you skip things".  Your plan seems to
also tackle a few other problems:
  -- make it cheap to write rosters to the db (no more
     serialize + xdelta overhead)
  -- make it cheap to get information on just a single node from a
     historical roster

If I were going to do a simplest-thing-that-could-possibly-work thing
for only the original per-file DAG problem, what I'd do is something
like:
  -- add two new fields to each node in the roster.  One says "this
     node is marked in this roster", the other says "*(my parents) is
     <rev1> [<rev2>]".
  -- in the previous, "marked" and "*()" are defined as in
     unique-*-merge, treating each node as a whole as a single scalar
     (and for directories, probably treating their entire contents as
     part of that scalar too).
  -- to traverse, look at this info for whatever rev you want to start
     with.  Use the *(my parents) links to repeatedly jump back to the
     next relevant roster.
  -- the "this node is marked" field is important both so we can tell
     whether the node we examine is interesting, and so that these
     fields are cheap to calculate incrementally in the marking phase.
  -- using unique-*-merge means that the graph is guaranteed to be
     non-surprisingly formed (in particular, you cannot have nodes with
     arbitrary numbers of parents, which a multi-*-merge reduction
     would allow!  Also, only merge nodes can have multiple parents).
     This seems like a nice property if people want to, say, display
     single file history graphs.
  -- whatever code is consuming this graph gets to decide how to
     filter it further; "annotate" will probably ignore all nodes that
     do not involve content changes, for instance.

I don't know how well this would work, just that I can't prove it
doesn't, and I can't think of anything simpler that has that property
:-).  (For instance, the cost of loading each of the rosters might
still be prohibitive; this scheme does nothing at all to speed up
annotating the ChangeLog file.  But maybe skipping things helps
enough that it doesn't matter, or maybe then there would be an
obvious way to fetch single nodes from a roster.)

> You're absolutely right that rosters are a "local cache"; they were 
> designed that way, and we expressly kept their format private-to-the-db 
> so that we could change it as efficiency demanded. Efficiency is now 
> demanding, and changing the format to suit is quite legitimate.

Mm-yep.  If we're going to pay the abstraction tax, we should at least
get the benefits :-).

-- Nathaniel

-- 
"On arrival in my ward I was immediately served with lunch. `This is
what you ordered yesterday.' I pointed out that I had just arrived,
only to be told: `This is what your bed ordered.'"
  -- Letter to the Editor, The Times, September 2000




reply via email to

[Prev in Thread] Current Thread [Next in Thread]