monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] [patch] roster-deltas (40% pull speedup)


From: Nathaniel Smith
Subject: [Monotone-devel] [patch] roster-deltas (40% pull speedup)
Date: Mon, 21 Aug 2006 01:22:20 -0700
User-agent: Mutt/1.5.12-2006-07-14

The branch I've been working on the last few weeks,
.roster-no-hash.roster-delta, is just about ready for mainline.

The change is very large -- 32 files changed, 2387 insertions(+), 1073
deletions(-) -- and messes about with some very important pieces.  In
particular, if there are bugs in roster reconstruction, that could
jeopardize our sanity checking generally.  So, I'd really appreciate
review of this code before it lands.

The basic goal is to remove the very expensive writing and hashing
of roster full-texts that we currently do for every revision.  The
basic problem here is that each and every revision has a full roster
associated with it.  All of these rosters together, for
net.venge.monotone*, amount to somewhere around 1.5 gigabytes of
data.  That's a lot of data to serialize into a textual format in
memory, then run xdelta over.

The basic strategy we use is:
  -- Write a roster-specific "delta" format; basically a souped up
     cset.  roster_deltas are to csets as rosters are to manifests.
     These are much, much cheaper to calculate and apply than xdeltas,
     especially when you factor in the cost of serialization and
     parsing.
  -- Hash the _deltas_, rather than the full-texts in the database.
     This gives us protection against disk corruption, and that's all
     we need for rosters anyway.
  -- Wrap a write-back LRU cache around roster access in the
     database.  We used to have "pending", "vcache", and "rcache" that
     we would check in before going to disk; these are now unified,
     with a single memory bound, and the cache can store fully parsed
     roster_t's.

On my machine, this branch as it stands gives a 41.8% reduction in cpu
cycles (measured by oprofile) used by the client during initial pull
of net.venge.monotone*.  The rough breakdown is something like:
roster-delta's alone, without a specialized roster cache (but with
delayed writing of roster full-texts), give ~25%; adding the
write-back LRU cache gives another ~5%; and then teaching
make_roster_for_merge to read parent rosters directly out of the cache
rather than copying them gives the rest.

It probably does significantly better on more straight-line histories,
such as those produced by cvs_import; the code that generates rosters
for merge revisions is significantly less efficient than the code that
produces rosters for non-merge revisions, and this code is now taking
35% of time on nvm*.


In more detail, we make the following changes:
  - In the database layer, refer to rosters by local ids, rather than
    SHA1 of the full roster text:
    - Rearrange database.hh into individual sections that make sense.
    - Add the ability to bind integers (i.e., use
      sqlite3_bind_int64) to our query interface.
    - Modify schema so that rosters are referred to by an INTEGER
      PRIMARY KEY, and have a checksum that is on the delta or data
      stored directly in each row, rather than the data one can
      eventually reconstruct from that row.
    - Factor formerly-shared reconstruction code into two different
      functions; one for files/manifests, and one for rosters.
    - To facilitate this factoring, Move reconstruction path finding
      algorithm out to new file graph.{hh,cc} (more generic graph
      algorithms could and probably should move here).

  - A new delta format for rosters:
    - New files roster_delta.{hh,cc}, implementing a cset-like format,
      but expressed in terms of node_id's rather than paths, using
      full attrs, and including markings.
    - Add new methods on roster_t's that are more direct and less
      checked, for these deltas to use.
    - In all the places in the unit_tests where we generate rosters
      for testing, also test the roster_delta functionality.

  - Unified roster cache:
    - Clean up lru_cache.hh to better match local coding style, remove
      unused functionality, and fix some bugs.
    - Clone it to lru_writeback_cache.hh, and then modify this to
      implement dirty-tracking and writeout of dirty objects when they
      are flushed from the cache.
    - The old pending_writes is now pending_files, and database.cc now
      uses lru_writeback_cache.hh to implement pending rosters.
    - A new get_roster method in the database interface, that returns
      a const shared_ptr to the cached roster -- if you only need
      read-only access to a roster, this avoids a copy.
    - make_roster_for_nonmerge now uses this.
    - Note: maintaining cache coherency through transaction begin /
      commit / rollback is somewhat tricky; these paths would be good
      to double-check.

  - Migration code for the above:
    - New command 'db regenerate_rosters'.
    - 'db migrate' simply fixes schema, and empties the roster tables;
      users must run 'db regenerate_rosters' after 'db migrate'.
    - 'db migrate' is now much smarter about telling users what extra
      fixups they may need to run.
    - Add detection of this new possible case to the database schema
      checking code, so that monotone detects it at startup and
      refuses to run until rosters have been regenerated.

  - Other:
    - Resurrect the manifest_data type, for code that actually
      produces manifests (e.g., write_manifest_of_roster, which used
      to claim to return a roster_data).

Still needs to be done:
  - The tests database_check_for_normalization and
    _MTN_case-folding_security_patch both have hard-coded, invalid
    databases; the migration to the new roster format notices this
    invalid-ness and fails, preventing the test proper from working.
    Need to migrate these forward by hand somehow, probably, I guess?
  - Need to add the appropriate subtest to schema_migration -- just
    waiting until I'm sure we don't be making any more changes to the
    format.

Big question: are we confident that these changes cannot lead to data
corruption?  In particular, our defenses against broken revisions are
all done using rosters.  The kind of thing I'm worried about is, we
put a roster in the database, and due to a bug in the
delta/application code, we later get a different one back out.  We use
this corrupted roster to generate a corrupted revision, or to validate
an incoming revision, and we end up spreading around nonsensical
revisions (as has happened in the past, pre-rosters).  Is this a
reasonable worry?  What defenses can we make against it?

One thing we could do is re-check manifest ids whenever we reconstruct
a roster.  I don't _think_ this would have a noticeable effect on
performance; we would only have to do this on cache misses, and cache
misses are 1) already expensive, 2) pretty rare, at least on pull.
(It wouldn't help 'annotate' any, but that's a different story.)  The
main problem is that it's rather annoying to get to the manifest_id
from get_roster... if we do this, perhaps we should store the
manifest_id in the rosters/roster_deltas tables, next to the on-disk
checksum?

(I've attached the patch, but for the database.{cc,hh} changes you may
find it much easier to just check out the branch and look at them,
since whole chunks of those files got rewritten.)

-- Nathaniel

-- 
"If you can explain how you do something, then you're very very bad at it."
  -- John Hopfield

Attachment: roster-no-hash.roster-delta.patch
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]