monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] database compaction


From: Markus Schiltknecht
Subject: [Monotone-devel] database compaction
Date: Wed, 17 Oct 2007 21:35:35 +0200
User-agent: Icedove 1.5.0.12 (X11/20070730)

Hi,

I've recently been trying to implement some of the ideas from the database compaction wiki page [1]. I'd like some feedback before continuing. You can find code in the new branch n.v.m.experiment.db-compaction. Note, that there are currently *three heads* for different things. Please do not merge them, yet!


== Compact Heights ==

As the wiki says, uleb128 wouldn't work too well, because compressed heights couldn't be compared with memcmp(). As I've had to do a lot with JPEGs recently, I've stumbled across Huffman encoding. Given a hand-crafted Huffman Tree, the memcmp() property could be maintained. But even though that idea fascinated me, and I've already begun to compile statistics about distribution of height values, I had to realize that the potential gain is very limited. As an example, my current working database is 618.1 MiB. Trimming all heights to a single-byte value - which is way too optimistic - results in a 612.8 MiB file. Maximum savings: less than 1% :-( Thus overly complicated things like Huffman are definitely not worth the effort. I didn't write any code here.


== Putting heights in the revisions table ==

I'm not sure why exactly heights haven't been stored in the revisions table from day one on, but I certainly think it's a good idea. It saves lookups and increases locality in case one needs the heights. And I don't think it hurts much if you don't. To check confirm or reject these assumptions, please refer to revision 45501e62.

Does anybody know any hard facts about sqlite3's per-row header size? There's the row id, which is a 64bit. If that's the per-row header size we store 8 bytes of headers and 40 bytes of hex hashes. On my working database with 54311 revisions, that should give a 2.5 MiB, plus yet another index on heights(revision) we save.

I didn't complete the schema migration, but some SQL code is already there. Manual migration results in 610.8 MiB, a saving of about 1.2%. Not that much either, but I'd vote for that small change anyway.


== Storing hashes in binary ==

Now, this one could really save us space and disc I/O. Every hash currently takes 40 bytes because it's stored in hex. Binary hashes could save 50% for every hash in the database. And we maintain lots of hashes there!

I've added encode_hexenc() and decode_hexenc() calls in database.cc where appropriate. I didn't change the internal type of revision_id. So this really only changes the database storage format. One could possibly remove lots of these hex conversions if we were using binary hashes internally, too.

Anyway, please see revision fd4fcc55, which already passes most tests and does a perfect schema migration. I think I've converted pretty much all hashes. This time no estimates, only bare numbers: 565.6 MiB, 8.4% saving here.

BTW: querying the database needs some more keystrokes to not cripple the terminal with binary output. But it's simple enough, IMO. An example:

  SELECT hex(id) FROM revisions
    WHERE id = x'fd4fcc55bcd79b26fa77ebb12e07b89b49e6826b';


== Using rowids or lookaside tables ==

The wiki has several paragraphs about using rowids or lookaside tables. I don't particularly like that idea, because a) it increases the number of lookups needed, b) savings are dubious and c) it is database specific (okay, that probably doesn't count...). I might rethink b) as soon as we switch to longer hashes, but 20 bytes hardly justify the extra row headers (and indices) necessary.


== A better, combined index for revision_certs ==

As Ralf and Lapo recently figured out, the indices on revision_certs aren't optimal. In revision 212d7e28 I've replaced the combined unique index on (name, id, value, keypair, signature) with one on (name, value, id, keypair, signature). That allows us to drop the extra index on (name, value) called revision_certs__name_value index.

That change saves us about 2.7% at 600.9 MiB. Complete schema migration is already included.



As a quick check, I've again run the commit and annotate benchmarks. See below for the results. For the commit benchmark, all three variants show small performance gains, mostly below 5% - but only rarely worse than reference. The annotate benchmark gives a slightly different picture: while 'better-index' and 'included-heighs' aren't any faster, rather slightly slower, the binary hashes make for a > 10% gain. (I'd have expected the 'included-heights' to gain on annotate, because I thought fewer seeks were necessary. But as long as the database fits in memory, that might not really be true. Other explanations?).

I'd like to merge these changes and implement a single schema migration step. That shouldn't be too much work anymore.

What do you think? Any objections, comments, questions?

Regards

Markus


[1]: http://www.venge.net/mtn-wiki/DatabaseCompaction

[2]: the benchmark results. These were the monotone revisions measured:

ref:              4c5b60a4cb72f99d6ec676bb80b7cdc041099685
included_heights: 45501e62d710ba58d60fcbd203966dc653911a17
better_index:     212d7e289d89fac869924e7753a06871fb73c204
bin_hashes:       fd4fcc55bcd79b26fa77ebb12e07b89b49e6826b

examples/commit.sh, for the 5 MB test case I've run 20 iterations, for 50 MB 10, and 500 only 3 iters:

address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# ../compare.py 5MBtxt-{ref,included_heights,better_index,bin_hashes}-memtime.csv 5MBtxt-ref-memtime.csv 5MBtxt-included_heights-memtime.csv 5MBtxt-better_index-memtime.csv 5MBtxt-bin_hashes-memtime.csv p
commit-avg-resident-MiB                   3.47                                
3.47                            3.49                          3.46  0.99
    commit-avg-size-MiB                   9.41                                
9.43                            9.40                          9.40  0.99
commit-max-resident-MiB                   5.66                                
5.62                            5.65                          5.66  0.00
    commit-max-size-MiB                  10.95                               
10.93                           10.94                         10.96  0.00
     commit-num-samples                 180.35                              
174.95                          182.05                        181.30  0.69
     commit-system-time                   0.03                                
0.02                            0.02                          0.02  0.11
       commit-user-time                   0.44                                
0.43                            0.44                          0.43  0.65
       commit-wall-time                   0.98                                
0.94                            0.97                          0.98  0.64
address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# ../compare.py 5MBrand-{ref,included_heights,better_index,bin_hashes}-memtime.csv 5MBrand-ref-memtime.csv 5MBrand-included_heights-memtime.csv 5MBrand-better_index-memtime.csv 5MBrand-bin_hashes-memtime.csv p
commit-avg-resident-MiB                    3.50                                 
3.53                             3.54                           3.59  0.87
    commit-avg-size-MiB                    9.45                                 
9.45                             9.49                           9.51  0.90
commit-max-resident-MiB                    5.73                                 
5.71                             5.73                           5.74  0.05
    commit-max-size-MiB                   11.02                                
11.00                            11.02                          11.04  0.00
     commit-num-samples                  187.60                               
180.85                           179.45                         178.05  0.57
     commit-system-time                    0.02                                 
0.02                             0.03                           0.02  0.64
       commit-user-time                    0.46                                 
0.45                             0.45                           0.44  0.01
       commit-wall-time                    1.01                                 
0.97                             0.96                           0.96  0.64
address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# ../compare.py 50MBtxt-{ref,included_heights,better_index,bin_hashes}-memtime.csv 50MBtxt-ref-memtime.csv 50MBtxt-included_heights-memtime.csv 50MBtxt-better_index-memtime.csv 50MBtxt-bin_hashes-memtime.csv p
commit-avg-resident-MiB                    2.87                                 
2.79                             2.67                           2.67  0.21
    commit-avg-size-MiB                    9.17                                 
8.90                             9.01                           8.72  0.07
commit-max-resident-MiB                    5.54                                 
5.52                             5.53                           5.55  0.03
    commit-max-size-MiB                   10.81                                
10.82                            10.83                          10.85  0.29
     commit-num-samples                  307.00                               
278.20                           298.20                         291.40  0.41
     commit-system-time                    0.02                                 
0.03                             0.03                           0.03  0.61
       commit-user-time                    0.47                                 
0.46                             0.46                           0.44  0.04
       commit-wall-time                    1.65                                 
1.56                             1.68                           1.58  0.61
address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# ../compare.py 50MBrand-{ref,included_heights,better_index,bin_hashes}-memtime.csv 50MBrand-ref-memtime.csv 50MBrand-included_heights-memtime.csv 50MBrand-better_index-memtime.csv 50MBrand-bin_hashes-memtime.csv p
commit-avg-resident-MiB                     2.72                                
  2.85                              2.70                            2.90  0.16
    commit-avg-size-MiB                     8.83                                
  9.02                              8.63                            9.04  0.14
commit-max-resident-MiB                     5.76                                
  5.71                              5.73                            5.77  0.02
    commit-max-size-MiB                    10.93                                
 10.93                             10.94                           10.95  0.38
     commit-num-samples                   284.60                                
285.00                            284.20                          284.00  1.00
     commit-system-time                     0.03                                
  0.04                              0.03                            0.04  0.85
       commit-user-time                     0.46                                
  0.47                              0.47                            0.44  0.38
       commit-wall-time                     1.64                                
  1.55                              1.59                            1.56  0.80
address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# 
../compare.py 
500MBtxt-{ref,included_heights,better_index,bin_hashes}-memtime.csv
                        500MBtxt-ref-memtime.csv 
500MBtxt-included_heights-memtime.csv 500MBtxt-better_index-memtime.csv 
500MBtxt-bin_hashes-memtime.csv     p
commit-avg-resident-MiB                     9.62                                
  9.62                              9.80                            9.65  0.84
    commit-avg-size-MiB                    15.56                                
 15.54                             15.73                           15.60  0.80
commit-max-resident-MiB                    20.76                                
 20.76                             20.79                           20.33  0.69
    commit-max-size-MiB                    26.30                                
 26.26                             26.15                           26.04  0.78
     commit-num-samples                   512.00                                
505.00                            481.33                          482.00  0.34
     commit-system-time                     0.06                                
  0.07                              0.08                            0.08  0.13
       commit-user-time                     1.35                                
  1.34                              1.32                            1.32  0.06
       commit-wall-time                     2.70                                
  2.71                              2.54                            2.57  0.32
address@hidden:/home/markus/projects/monotone/nvm.cbench/myresults# 
../compare.py 
500MBrand-{ref,included_heights,better_index,bin_hashes}-memtime.csv
                        500MBrand-ref-memtime.csv 
500MBrand-included_heights-memtime.csv 500MBrand-better_index-memtime.csv 
500MBrand-bin_hashes-memtime.csv     p
commit-avg-resident-MiB                     10.00                               
    9.56                              10.02                             9.93  
0.59
    commit-avg-size-MiB                     15.95                               
   15.52                              15.88                            15.85  
0.58
commit-max-resident-MiB                     22.55                               
   22.18                              22.43                            22.69  
0.90
    commit-max-size-MiB                     27.97                               
   27.69                              27.83                            28.18  
0.84
     commit-num-samples                    490.67                               
  499.67                             505.67                           477.33  
0.77
     commit-system-time                      0.09                               
    0.09                               0.07                             0.08  
0.23
       commit-user-time                      1.27                               
    1.27                               1.27                             1.28  
0.73
       commit-wall-time                      2.64                               
    2.67                               2.69                             2.55  
0.79


examples/annotate.sh, with 5 iterations for each binary. All binaries got a copy of my working database, migrated to their schema.

address@hidden:/home/markus/projects/monotone/nvm.cbench# ./compare.py 
{ref,included-heights,better-index,bin-hashes}/a-ref-memtime.csv
                          ref/a-ref-memtime.csv 
included-heights/a-ref-memtime.csv better-index/a-ref-memtime.csv 
bin-hashes/a-ref-memtime.csv     p
annotate-avg-resident-MiB                 20.35                              
20.27                          20.31                        19.78  0.05
    annotate-avg-size-MiB                 26.15                              
26.09                          26.11                        25.59  0.05
annotate-max-resident-MiB                 31.22                              
31.18                          31.21                        31.21  0.00
    annotate-max-size-MiB                 36.69                              
36.68                          36.68                        36.70  0.00
     annotate-num-samples               3193.00                            
3204.40                        3256.60                      2993.40  0.00
     annotate-system-time                  1.08                               
1.06                           1.07                         0.99  0.05
       annotate-user-time                  4.25                               
4.28                           4.24                         5.66  0.00
       annotate-wall-time                 18.55                              
18.66                          18.76                        16.83  0.00
address@hidden:/home/markus/projects/monotone/nvm.cbench# ./compare.py 
{ref,included-heights,better-index,bin-hashes}/b-ref-memtime.csv
                          ref/b-ref-memtime.csv 
included-heights/b-ref-memtime.csv better-index/b-ref-memtime.csv 
bin-hashes/b-ref-memtime.csv     p
annotate-avg-resident-MiB                 10.06                              
10.16                          10.28                         9.54  0.01
    annotate-avg-size-MiB                 15.96                              
16.08                          16.13                        15.45  0.01
annotate-max-resident-MiB                 16.71                              
16.87                          16.70                        15.34  0.00
    annotate-max-size-MiB                 22.16                              
22.32                          22.14                        20.85  0.00
     annotate-num-samples               1183.60                            
1290.20                        1232.80                      1034.00  0.01
     annotate-system-time                  0.24                               
0.24                           0.25                         0.21  0.28
       annotate-user-time                  0.82                               
0.82                           0.79                         0.97  0.00
       annotate-wall-time                  7.01                               
7.43                           7.07                         5.96  0.00
address@hidden:/home/markus/projects/monotone/nvm.cbench# ./compare.py 
{ref,included-heights,better-index,bin-hashes}/c-ref-memtime.csv
                          ref/c-ref-memtime.csv 
included-heights/c-ref-memtime.csv better-index/c-ref-memtime.csv 
bin-hashes/c-ref-memtime.csv     p
annotate-avg-resident-MiB                 11.15                              
11.43                          11.16                        11.14  0.66
    annotate-avg-size-MiB                 17.04                              
17.34                          17.05                        17.02  0.58
annotate-max-resident-MiB                 19.32                              
19.60                          19.31                        17.68  0.00
    annotate-max-size-MiB                 24.79                              
25.02                          24.72                        23.14  0.00
     annotate-num-samples               1438.40                            
1393.60                        1427.60                      1305.40  0.40
     annotate-system-time                  0.38                               
0.37                           0.38                         0.35  0.28
       annotate-user-time                  1.46                               
1.46                           1.47                         1.76  0.00
       annotate-wall-time                  8.42                               
8.18                           8.29                         7.19  0.10
address@hidden:/home/markus/projects/monotone/nvm.cbench# ./compare.py 
{ref,included-heights,better-index,bin-hashes}/d-ref-memtime.csv
                          ref/d-ref-memtime.csv 
included-heights/d-ref-memtime.csv better-index/d-ref-memtime.csv 
bin-hashes/d-ref-memtime.csv     p
annotate-avg-resident-MiB                 23.02                              
23.33                          23.12                        22.54  0.10
    annotate-avg-size-MiB                 28.84                              
29.16                          28.93                        28.37  0.09
annotate-max-resident-MiB                 32.78                              
32.57                          32.75                        32.39  0.00
    annotate-max-size-MiB                 38.16                              
38.07                          38.14                        37.88  0.00
     annotate-num-samples               3800.20                            
3878.00                        3903.20                      3529.40  0.01
     annotate-system-time                  0.95                               
1.04                           1.02                         0.98  0.23
       annotate-user-time                  7.52                               
7.55                           7.60                         8.98  0.00
       annotate-wall-time                 21.77                              
22.05                          22.19                        19.44  0.00








reply via email to

[Prev in Thread] Current Thread [Next in Thread]