[Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)

monotone-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)

From:	Graydon Hoare
Subject:	[Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)
Date:	Tue, 22 Aug 2006 14:24:00 -0700
User-agent:	Thunderbird 1.5.0.5 (Windows/20060719)

Nathaniel Smith wrote:

The change is very large -- 32 files changed, 2387 insertions(+), 1073
deletions(-) -- and messes about with some very important pieces.  In
particular, if there are bugs in roster reconstruction, that could
jeopardize our sanity checking generally.  So, I'd really appreciate
review of this code before it lands.


First off: fantastic work!

I only have minor nit-picky comments at this point; as far as "the plan"we've been discussing, this looks basically perfect, I'm surprised yougot it done so fast.

I haven't carefully gone over the transaction commit / writeback / flushordering in database.cc, which I think you probably want. I'll trytaking a whole-file read soon, but I'd be very surprised if any bugs inthere would avoid tripping a fatal assertion due to a missing db row.

Good testcase: download a cvs archive from sourceforge, import it, andsync it back and forth a few times.


Nit picking follows:

--- lru_writeback_cache.hh      7e0017442df3ec9e2a386acb3a424ffffd60a6a7
+++ lru_writeback_cache.hh      7e0017442df3ec9e2a386acb3a424ffffd60a6a7
+/**
+ * @brief Template cache with an LRU removal policy.
+ * @class LRUCache
+ *
+ * @par
+ * This template creats a simple collection of key-value pairs that grows
+ * until the size specified at construction is reached and then begins
+ * discard the Least Recently Used element on each insertion.


Modify this comment to describe the dirty set.

--- roster_delta.cc     1d44ccf2b5bf2911ce0fe1264bed6a33c285173f
+++ roster_delta.cc     1d44ccf2b5bf2911ce0fe1264bed6a33c285173f
@@ -0,0 +1,483 @@
+#include <set>
+#include <map>
+
+#include <boost/lexical_cast.hpp>
+
+#include "safe_map.hh"
+#include "parallel_iter.hh"
+#include "roster_delta.hh"
+#include "basic_io.hh"
+#include "paths.hh"
+
+using boost::lexical_cast;
+

+namespace+{

+
+  struct roster_delta_t
+  {
+    typedef std::set<node_id> nodes_deleted_t;
+    nodes_deleted_t nodes_deleted;
+    typedef std::map<std::pair<node_id, path_component>, node_id> dirs_added_t;
+    dirs_added_t dirs_added;
+    typedef std::map<std::pair<node_id, path_component>, std::pair<node_id, 
file_id> > files_added_t;
+    files_added_t files_added;
+    typedef std::map<node_id, std::pair<node_id, path_component> > 
nodes_renamed_t;
+    nodes_renamed_t nodes_renamed;
+    typedef std::map<node_id, file_id> deltas_applied_t;
+    deltas_applied_t deltas_applied;
+    typedef std::set<std::pair<node_id, attr_key> > attrs_cleared_t;
+    attrs_cleared_t attrs_cleared;
+    typedef std::set<std::pair<node_id, std::pair<attr_key, std::pair<bool, attr_value> 
> > > attrs_changed_t;
+    attrs_changed_t attrs_changed;

Stylistic: I think I'd prefer all the typedefs first, then the slotdefs. Also wrap at 80 cols please.

Semantic: I'm slightly concerned that this is the non-invertablerepresentation. The version I was working on (which, granted, ishorribly incomplete) was invertable. You can only reconstruct backwards.Does this concern you at all? The only inversion cases I could imaginewere just that: imaginary. I guess it's private so we can always revisit.

+
+    // nodes_deleted are automatically removed from the marking_map; these are
+    // all markings that are new or changed
+    typedef std::map<node_id, marking_t> markings_changed_t;
+    markings_changed_t markings_changed;
+
+    void
+    apply(roster_t & roster, marking_map & markings) const;
+  };
+
+  void
+  roster_delta_t::apply(roster_t & roster, marking_map & markings) const
+  {
+    // detach everything that should be detached
+    for (nodes_deleted_t::const_iterator i = nodes_deleted.begin(); i != 
nodes_deleted.end(); ++i)
+      roster.detach_node(*i);
+    for (nodes_renamed_t::const_iterator i = nodes_renamed.begin(); i != 
nodes_renamed.end(); ++i)
+      roster.detach_node(i->first);
+    // delete the delete-able things
+    for (nodes_deleted_t::const_iterator i = nodes_deleted.begin(); i != 
nodes_deleted.end(); ++i)
+      roster.drop_detached_node(*i);
+    // add the new things
+    for (dirs_added_t::const_iterator i = dirs_added.begin(); i != 
dirs_added.end(); ++i)
+      roster.create_dir_node(i->second);
+    for (files_added_t::const_iterator i = files_added.begin(); i != 
files_added.end(); ++i)
+      roster.create_file_node(i->second.second, i->second.first);
+    // attach everything
+    for (dirs_added_t::const_iterator i = dirs_added.begin(); i != 
dirs_added.end(); ++i)
+      roster.attach_node(i->second, i->first.first, i->first.second);
+    for (files_added_t::const_iterator i = files_added.begin(); i != 
files_added.end(); ++i)
+      roster.attach_node(i->second.first, i->first.first, i->first.second);
+    for (nodes_renamed_t::const_iterator i = nodes_renamed.begin(); i != 
nodes_renamed.end(); ++i)
+      roster.attach_node(i->first, i->second.first, i->second.second);

Stylistic: a bit more whitespace and wrapping, please. Also I'm shiftingcommenting style towards full sentences starting with a capital letterand ending with a period. I know, ghastly. What's happening?!

+
+    // roight, all that tricky tree-rearranging done, just have to do some
+    // individual node edits now


"roight"?

+  void
+  delta_only_in_to(node_t new_n, roster_delta_t & d)
+  delta_in_both(node_t old_n, node_t new_n, roster_delta_t & d)


Smells like duplicated code from roster.cc. Is it refactorable?

+    for (roster_delta_t::nodes_deleted_t::const_iterator i = 
d.nodes_deleted.begin(); i != d.nodes_deleted.end(); ++i)
+    for (roster_delta_t::nodes_renamed_t::const_iterator i = 
d.nodes_renamed.begin(); i != d.nodes_renamed.end(); ++i)
+    for (roster_delta_t::dirs_added_t::const_iterator i = 
d.dirs_added.begin(); i != d.dirs_added.end(); ++i)


... wrap.

--- roster_delta.hh     1939a0f9715abe7b3e01f3c6887fecd5fbb05323
+++ roster_delta.hh     1939a0f9715abe7b3e01f3c6887fecd5fbb05323
@@ -0,0 +1,30 @@
+#ifndef __ROSTER_DELTA_HH__
+
+//////////
+// Experimental roster delta stuff


Axe this comment.

         lru_cache.h                                                    \
+        lru_writeback_cache.h                                          \

Do we really need lru_cache.h to exist? Why not use lru_writeback_cachein all cases and ignore the dirty bits when irrelevant?

-#undef _REENTRANT
-#include "lru_cache.h"
+#include "lru_cache.hh"

If you're using lru_cache.hh (not .h), make sure you change the makefileand commit it. Or just toss lru_cache.{h,hh} as discussed above.

@@ -77,9 +78,10 @@ namespace
 {
   struct query_param
   {
-    enum arg_type { text, blob };
+    enum arg_type { text, blob, s64 };
     arg_type type;
     string data;
+    ::s64 integer;
   };

Hm. I accept that sqlite prefers signed integers, but I sure don't. Canwe push the conversion to and from unsigned down inside this interface,so it only happens in database.cc?

@@ -748,6 +784,12 @@ database::fetch(results & res,
                               SQLITE_STATIC);
           }
           break;
+        case query_param::s64:
+          {
+            sqlite3_bind_int64(i->second.stmt(), param,
+                               idx(query.args, param - 1).integer);
+          }
+          break;
         default:
           I(false);
         }


I guess here, in particular. You could isolate it in one line here.

@@ -783,10 +825,19 @@ database::fetch(results & res,

i->second.count++;- E(want_rows == any_rows || want_rows == nrow,

-    F("wanted %d rows got %d in query: %s") % want_rows % nrow % 
query.sql_cmd);
+  I(want_rows == any_rows || want_rows == nrow);
 }

Is there a reason for this hunk? I prefer the former message: moredetail for diagnostics.

+unsigned long
+database::roster_size_estimator::operator()(cached_roster const & cr)
+{
+  I(cr.first);
+  I(cr.second);
+  // do estimate using a totally made up multiplier, probably wildly off
+  return cr.first->all_nodes().size() * 175;
+}

Maybe pick this more precisely and move it to constants.{cc,hh} (moregenerally, maybe those numbers should be come hooks). For example:Assume dir nodes make up a trivial portion of the roster, and file nodesgenerally have no attributes. Then:


       - 40 bytes of sha1
       - average path component length, say 10 bytes
       - 5 * sizeof(void*) for internal fields
         (+/- a factor of 2 depending on how strings, shared_ptrs, maps
          are implemented)

Should be more like 70-80 bytes.

-database::remove_version(hexenc<id> const & target_id,
-                         string const & data_table,
-                         string const & delta_table)


I wonder if it's wise to remove this function. Isn't it useful?

--- database.hh 5fd8600ce422d017284baa8fa0e38b6899a32e02
+++ database.hh 441546750699828b5f3c25d69e27427009b29d6d
@@ -30,6 +30,9 @@ int sqlite3_finalize(sqlite3_stmt *);
 #include "selectors.hh"
 #include "vocab.hh"

+// hmm, would be better not to include this everywhere...

+#include "lru_writeback_cache.hh"
+


Indeed. Maybe move the caches to a pimpl (can be done later, I don't mind).

Also might want to take the opportunity to remove the hilarious commentabout reverse deltas being the most confusing part of the program.Hohoho happy days.

============================================================
--- numeric_vocab.hh    87e847c2ff9af6c90dbe248c244859e4cab7d34e
+++ numeric_vocab.hh    319e12738bcd5ad090a7a0d994a10f9ab7a2a8ff
@@ -20,6 +20,11 @@ typedef uint64_t u64;
 typedef uint32_t u32;
 typedef uint64_t u64;

+typedef int8_t s8;

+typedef int16_t s16;
+typedef int32_t s32;
+typedef int64_t s64;
+


Signed integral types are the devil.

--- roster.cc   d03806d489e8b30fc4564b9a4f025669770403d6
+++ roster.cc   4996a8586b27a298548151e97c3b59e966bcb704
+void
+roster_t::set_delta(node_id nid, file_id const & new_id)


s/set_delta/set_content/

@@ -2439,7 +2486,6 @@ parse_marking(basic_io::parser & pa,
           pa.str(k);
           pa.hex(rev);
           attr_key key = attr_key(k);
-          I(n->attrs.find(key) != n->attrs.end());
           safe_insert(marking.attrs[key], revision_id(rev));
         }


Slightly nervous about this. Why is it ok?

--- roster.hh   43a7100415b7fe11037a50fbd8cdd0cf260b926d
+++ roster.hh   2f33287325fd542a21217aeba673015b0c271912
@@ -14,6 +14,7 @@

#include <boost/shared_ptr.hpp>+#include "basic_io.hh"

You can just forward-declare the namespace basic_io with incompletetypes parser and printer in it.


Otherwise looks good.

-graydon

[Prev in Thread]

Current Thread

[Next in Thread]

[Monotone-devel] [patch] roster-deltas (40% pull speedup), Nathaniel Smith, 2006/08/21
- Re: [Monotone-devel] [patch] roster-deltas (40% pull speedup), Eric Anderson, 2006/08/21
  - Re: [Monotone-devel] [patch] roster-deltas (40% pull speedup), Nathaniel Smith, 2006/08/21
    - Re: [Monotone-devel] [patch] roster-deltas (40% pull speedup), Eric Anderson, 2006/08/22
- [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup), Graydon Hoare <=
  - Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup), Nathaniel Smith, 2006/08/23
    - Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup), Eric Anderson, 2006/08/23

Prev by Date: Re: [Monotone-devel] [patch] roster-deltas (40% pull speedup)
Next by Date: [Monotone-devel] RFC: Unit test improvements
Previous by thread: Re: [Monotone-devel] [patch] roster-deltas (40% pull speedup)
Next by thread: Re: [Monotone-devel] Re: [patch] roster-deltas (40% pull speedup)
Index(es):
- Date
- Thread