circle-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [circle] How are data items stored/forwarded?


From: Paul Campbell
Subject: Re: [circle] How are data items stored/forwarded?
Date: Mon, 4 Oct 2004 08:52:44 -0700
User-agent: Mutt/1.5.6i

On Mon, Oct 04, 2004 at 08:59:12AM +0200, Thomas Voegtlin wrote:
> nice to see some activity on this list again!
> I have been moving to a new place, new job, etc...
> I'm afraid I will not have time to look closely at 
> circle for a few weeks (or months)
> 
> if you find a bug or want to improve something, 
> please feel free to submit a patch

The reason that I'm asking so many questions (and checking into various
things) is that I wrote a lot of code and tinkered with the idea of writing
my own DHT from scratch in the twisted framework. I had mostly written
the core. Then I started looking closely at thecircle again, and I realized
that even after I finish rewriting the core, I've STILL got the work of
doing the user interface, all the services, etc., etc.

It became pretty clear to me that it would be easier/faster to simply
refactor thecircle in Twisted after all.

I started with node.py this week. Sorry Thomas, but your giant polling thread
had to go! I know you are so disappointed. :)

I've completely removed threading from the core code (node.py) and done a lot
of work moving things in and out of various classes to straighten it up.
Right now, it looks like this:

UDPRPC -- Code to run the RPC's. I did away with the packet format from the
original thecircle (decided not to really preserve much except calling
semantics to minimize everything). Now the packet format is like this:

Byte #1: Future use flag, Request/Response flag, Long/Short packet Flag, Pkt #
Byte #2-5: Nonce (ticket #)
And then the rest is the usual safe_pickled info, although you don't need
to encode the nonce in the packet anymore, and I'm removing all references
to the nonces outside of the UDPRPC class.

This substantially shrinks the packet size down smaller than TCP packets,
EVEN with the UDP header encapsulating it.

The protocol works like this:
1. Every RPC must have a unique nonce (ticket # in a lot of the existing
code base). That is how the receiver separates them. Although there is
absolutely nothing wrong with intermixing messages. There is a large time
window however on receiving a complete message (30 seconds).

2. The response *IS* the ACK. So no acking is necessary.

3. Out of order packets and lost packets are not recovered. Basically, the
usual semantics is to retransmit anyways and packet loss rates are typically
very low, so all that TCP ordering/acking silliness is unnecesary.

4. For short messages that fit within a single UDP packet (I arbitrarily
limited the payload to 560 bytes or less since with the header, this nicely
fits within the 576 byte limit on lots of machines), simply send the packet
as-is with a packet # of zero.

5. For "long" packets, send the packets in order, starting with packet #0
as the first packet. Set the "LONG" bit on all packets except for the very
last one. Increment the packet number (4 bits) for each packet sent. This
limits the total message length to 8959 bytes which I felt is more than
sufficient for an RPC protocol (you should be doing something more like
TCP if you want to send more per packet).  All "long" packets must have a
560 byte payload or the protocol will toss it. The LAST packet in the chain
has the "SHORT" bit set (actually, the long bit is simply 0 instead). I didn't
fix it yet, but if the message length just so happens to be exactly a
multiple of 560 bytes, it will end up sending a "short" packet with a length
of zero.

I moved the entire "handler" code into a separate subclass of class Node,
and I allowed two formats for functions (via "add_handler"). The first
format is as before. The second format is to add a bare function. I did this
for a simple reason. After creating a nice dispatcher, the existing code
totally BREAKS the idea of a dispatcher with the giant "perform standard
function" call which is a giant nest of if/elif's! This simple change to the
dispatcher added flexibility and got rid of the standard function.

I loaded all of the various standard calls into a separate class (subclassed
to Node) as well for clarity.

The "call" semantics have changed drastically. Now instead of getting a thread
return and then calling "get_reply" later (often busy waiting), the code
uses the twisted way...simply make the call and you get a "deferred" back.
Then handle the deferred in continuation-passing style.

I'm slowly moving all the "bare" functions from the node.py module back into
the Node class where they belong as well.

I added code to the Peer and Link classes to do their OWN polling & updating
on a per-instance basis. Also, since the Peer class holds everything necessary
to make activate/deactivate hashtable functions, I let the Peer polling tasks
handle that function as well. Also, I moved the invariant checking code out
into Peer/Link where it belongs.

One thing I've noticed a lot is that there is a LOT of invariant checking. And
when you start refactoring the code, you start to notice that "hey, how many
times do I really need to check if it's a valid IP address?" all over the
code. So I've done away with a lot of invariant checking, when it is simply
redundant and doesn't break semantics.

There are no more locks. Everything is single-threaded in the reactor-style
so far. So that should help performance dramatically.

I made a small change (although I haven't really used it yet) in the
safe_pickle code that I stole from banana/jelly in twisted's Perspective
Broker (and yes, I considered using PB before scratching the idea for a lot
of reasons). It now contains a code book. If safe_pickle reads a string that
is in the thesaurus, it codes it as 128+index instead of "S<string>". That
means that all of the usual handler calls ("who are you") are coded in a
single byte instead of a short string. I left the parameters alone so far.
The only trick here is that the code book has to be identical everywhere
unless I code something into the pickled results for version-identification
later on.

So...the reason that I'm asking a lot of questions about the data model and
the search semantics is that those are the last routines in node.py that I
have left to refactor. Everything else is done enough that I can probably
test the code out on a couple machines within a week or two. After that,
it's mostly a matter of refactoring everything else to both get rid of the
threading code and also to adapt it to the twisted way of doing things
(mostly, using continuation-passing-style). And then after that...I can
componentize it to allow for hot loading/unloading of code, changing plugins
on the fly, etc., etc.

I'm not entirely sure that I'm even going to stick with thecircle's current
data model and node structure (it's just a ring with some vaguely
Kademlia-like features right now as far as I can tell). Just that for the
first few trials it may be easier to simply refactor the code rather than
come up with something entirely new and different.

Once I get that far along, if I can find a copy of the old "circleget" code
or make something up similar that relies on just the node.py code and I get
it working more or less with a few nodes, I'll be happy to release a
preliminary version of the code. Then it's "just" a matter of refactoring
everything else with the new structure.

I'm not entirely throwing out PB by the way. I think that Perspective
Broker is a truly awesome and simple way to handle the "daemon/client"
setup that thecircle is moving towards. PB would make the transition almost
trivial since it already has security semantics and handles both remote
objects and remote procedure calls. It is much more tightly integrated than
my own RPC code. It should work fine both over pipes and TCP/IP (where it
was designed). Also, twisted's web interface (and/or Nevow) would make the
HTML interface to thecircle a bit easier to construct.

So in short, if I can get the base "node.py" code off the ground along with
simple translations the other key modules, twistifying thecircle should bring
a lot of benefits.


BTW...I looked at the idea of writing node.py itself as a component and trying
to "componentize" the handler interface. I can't really figure out a good
way to do it yet. If anyone has any ideas, I'll be happy to swap some code.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]