monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Improving the performance of annotate


From: Eric Anderson
Subject: Re: [Monotone-devel] Improving the performance of annotate
Date: Wed, 19 Jul 2006 00:32:34 -0700

Daniel Carosone <address@hidden> writes:
 > Subject: Re: [Monotone-devel] Improving the performance of annotate
 > 
 > > 1) only parsing the
 > > portion of the roster that was relevant to the file being annotated.
 > 
 > my suggestion and preference here would be storing the roster details
 > in open sql, or at least caching the relevant details in sql. This is
 > similar to the 'per-file DAG' information discussed previously.  The
 > idea of inventing and storing our own additional indexes when we
 > already have a storage layer with these capabilities just seems
 > incongruous.

The index would be pretty simple, 4 byte file id, 4 byte offset into
roster, but putting it properly into SQL is definitely the more
elegant path.  You could have a per-file, per-revision roster entry,
or you could have a table of (type, name, content_hash, ident, birth,
path_mark, content_mark); the main downside is that you get a lot of
compression out of eliminating the redundant hashes that show up in
the roster text: 

 > The implications of this on storage size certainly need some
 > examination.

./mtn get_roster 341e4a18c594cec49896fa97bd4e74de7bee5827 | gzip -9v | wc -c 
==> 61241
./mtn get_roster 341e4a18c594cec49896fa97bd4e74de7bee5827 | perl -e 'use 
Compress::Zlib; while(<STDIN>) { if (/^$/o) { print compress($buf,9); $buf = 
""; } else { $buf .= $_ } } print compress($buf,9)' | wc ==> 195291
(the second line splits the roster into records and compresses each separately)

195291/61241 = 3.2; which is a substantial size increase; I don't know
how to get the db size used up by the other revision data, but it
could be smaller.

 > > 2) skipping the version hash check in database.cc
 > 
 > At the moment, you're skipping the check of the roster as its
 > retreived before you start parsing it, right? 

I'm skipping the check that roster_id = sha1(roster_data)

 > So you save lots of
 > checking of rosters you don't necessarily end up using.  Furthermore,
 > most of your remaining time is spent, uh, "express-parsing" rosters to
 > see if they're relevant. If we could find relevant rosters quickly,
 > the remaining saving for both changes could be much less significant.

The current annotate code considers every roster to be relevant.
There is a comment in the code about this:

  // If we have content-marks which are *not* equal to the current
  // rev, we can jump back to them directly. If we have only a
  // content-mark equal to the current rev, it means we made a
  // decision here, and we must move to the immediate parent revs.
  //
  // Unfortunately, while this algorithm *could* use the marking
  // information as suggested above, it seems to work much better
  // (especially wrt. merges) when it goes rev-by-rev, so we leave it
  // that way for now.

But indeed, if you could skip the non-relevant rosters it could be a
big win.

 > Some thoughts on this to throw into the philosophical debate that may
 > follow:
 > 
 >  - I'm very supportive of the validate-everything approach taken by
 >    monotone, the reasons previously stated are and will remain sound.
 >    It does come at a cost, and some operations may not wish to pay
 >    that cost or need the assurances it provides.

Me too; I generally consider skipping validation when validation time
is a large fraction of total runtime and the validation isn't doing
much.  In this case it's validating that sqllite didn't retrieve the
wrong bits, and that the disk/fs didn't give a silent
corruption. That's why in this case I think that having a option to
disable the check and to turn it off by default for annotate could be
a good choice.

 > From: Graydon Hoare <address@hidden>
 > Subject: [Monotone-devel] Re: Improving the performance of annotate
 > 
 > [ nifty rewriting of the roster storage approach ]

Any guess as to how difficult that would be?  Given your description,
I don't think I would be qualified to make the change.  If it would
take a while to make this change, is it worth transitioning through a
relatively ugly hack to make it tolerably fast while work on the
correct fix progresses?
        -Eric




reply via email to

[Prev in Thread] Current Thread [Next in Thread]