|
From: | Zack Weinberg |
Subject: | [Monotone-devel] some observations about roster node IDs |
Date: | Thu, 28 Jun 2007 15:07:47 -0700 |
Performance testing on the file_path/split_path changes led me to thinking about node_id lookups. Some things I've noticed: regenerate_caches (on a monotone database) spends nearly seven percent of runtime in dynamic_cast<> implementation, going from node_t to dir_t and/or file_t. We are currently doing node ID lookups with a hybrid_map, which is both a STL map (i.e. red-black tree) and a hashtable. Lookups do not show up in my profiles, but I suspect they would for other workloads. Node IDs are 32-bit integers, so I decided to figure out how dense they are in the number space. The attached graphs show what that looks like; start by looking at nvm.png, which has one dot for each node ID appearing in each revision found by 'mtn automate select b:net.venge.monotone'. Revisions increase vertically, in toposort order. You can clearly see some milestones in development; for instance, the sudden spike in directories near roster #3200 probably represents the new test system landing. 'all.png' plots the same thing but for *all* revisions in my local database, alphabetically by revision ID; this goes some way to explaining the huge gap between nids 100 and 4000 (give or take) -- revisions not on nvm have taken that part of the number space. So this doesn't look all that dense, but the thing to realize is that most of the gaps are small. 'gaphist.png' shows a histogram of the base-10 log of the size of all gaps (in the n.v.m data) and 'runhist.png' shows the same histogram but for the runs (i.e. consecutive in-use numbers). Note the differing y-axis scales. ---- Some conclusions: * It would not be totally crazy to use two vectors for the node ID -> node_t map: one for true node IDs, one for temp node IDs. (There are no temp node IDs in any of the graphs, of course. Temp node IDs start at 2^31, so clearly we cannot use one vector for both.) The true-node-ID vector would have a lot of unused entries, but I suspect its memory usage would actually be better than what we are currently doing, and it doesn't cost too much to skip 3000 unused slots in an iteration over an 8000-element vector. One does worry about scaling. Some sort of sparse array structure might work better. * We should seriously consider separating dir from file nodes so we're not blowing so much time in dynamic_cast. I made a stab at this locally, it gets messy but not, I think, *too* messy. * We might want to think about packing node IDs better. I can think of several algorithms of varying degrees of cleverness, but suspect someone else has a better sense of how to do it. Data sets and/or crazy shell one-liners to extract them from your very own database on request. Be warned that they get huge; the data set corresponding to 'all.png' has 12 billion records! zw
nvm.png
Description: PNG image
all.png
Description: PNG image
gaphist.png
Description: PNG image
runhist.png
Description: PNG image
[Prev in Thread] | Current Thread | [Next in Thread] |