monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] mtndumb & public cert_id


From: Zbigniew Zagórski
Subject: Re: [Monotone-devel] mtndumb & public cert_id
Date: Wed, 17 Dec 2008 13:14:55 +0100

2008/12/17 Thomas Keller <address@hidden>:
> Zbigniew Zagórski schrieb:
>> Hello,
>>
>> I'm slowly hacking monotone dumb transport for my personal
>> use (i don't have time to polish it to publishable state)
>> and found that it's very slow in creating merkle tree of
>> revisions and certs. It's because there is no access to
>> cert_id (it's SHA1 of all cert values joined with colon).
>>
>> For dumb protocol and any other tool that could simulate
>> monotone server it's a must to have possibility to query for:
>>
>> - packet of specific cert
>> - ids of certificates for given selector (or only revision)
>>
>> Currently cert packet api provids only "get all cert packets
>> of revision". It's usually redundant information because we
>> usually need only cert_id and packet is needed only if
>> algrorithm decides that it's to be transfered.
>>
>> So my proposition is to add new public entity 'cert_id' and
>> commands to retrieve it. For sake of simplicity it will be the
>> same as database cert_id (hash field of revision_certs table).
>
> I never dived much into the merkle tree thing, ...but if I understand you
> correctly this needs the signature / sha1 of a single cert in order to
> determine if it needs to be send over the wire or not, right?

Yes.

Synchronization by merkle trees works in two phases:
 - find-the-difference
 - to the sync only missing/extra elements

All i'm talking is the first step. Finding differences.
In first step you must collect all ids (revision ids, cert ids, key ids) that
exist in database (let's say L).

Remote tree set built from remote site (lat's call it R).

 - pushing you have to push elements from set "L-R" to remote.
 - when pulling you have to retrieve elements from set "R-L"

And regarding pushing, actual data is retrieved and serialized only for pushed
set and not for all certs.

> And you're
> basically looking for something which is faster than packets_for_certs
> REVID which only takes revisions, but not selectors, and which outputs
> more things than you actually need, correct?

No in fact i look for "get all certs" and "give me specific cert" ...

> While adding new commands for this sounds reasonable at a first glance
> (100% backwards compatible with any automation implementation), I wonder
> if it wouldn't be better to just hack packets_for_certs (in a
> backwards-compatible way) f.e. by changing its first parameter, the
> revision ID, to a selector. Now you then still get all cert packets in
> return, not just the IDs which you need for the merkle tree, but is this
> really such a huge speed penalty?

Well after thinking a little bit i need to clarify. The biggest
penalty is cost of thousands of automate calls. It's rather big when
it comes tho  thousands (or tens of thousands) invocations. And it's
kind of waste when you _always_ want all and only id, but not the packet.

[BTW mtndumb is quite stupid, so it synchronizes whole database, so
there is no point for restricting cert set or revision set. ]

Regarding timing:

For my private database (~1200 revs):
   old approach      15s    (~1200 automate calls)
   new approach      2s     (~15 automate calls)

For net.venge.monotone db (14348 revs):
   old approach      180s   (~14400 automate calls)
   new approach      10s    (~70 automate calls)

These are _very_ manual tests and take into account whole process
running, including spawning python, monotone. BTW, on linux timings
are usually twice as faster for old approach so win32 is
also an issue.

All i want is to have ~constant number of automate calls that
return me bulk data from which i can safely build merkle tree.

With old approach i need 2+N calls (two to get revisions and toposort
them, N to get all certs for each revision).

With new i have 3 calls (all revs, toposort, all certs).

[Ah also there are keys, but it's minor issue]

BTW, that's the whole problem with merkle tree: you must get _all_
_ids_ of entities in  order to start synchronization. This is same
with netsync and this is why nuskool  has emerged.

-- 
Zbigniew Zagórski
/ software developer / geek / happy daddy /

reply via email to

[Prev in Thread] Current Thread [Next in Thread]