circle-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[circle] More structural changes


From: Paul Campbell
Subject: [circle] More structural changes
Date: Fri, 15 Oct 2004 21:50:49 -0700
User-agent: Mutt/1.5.6i

Well, I've gotten some time to work on thecircle code again. The more I
looked, the more I realized that once node.py is twistified, the rest is
almost "candy"...direct conversion is easy.

Well, I've got the core node.py nearly done. However, I've decided that more
structural changes are in order.

New organization:

dht.py:
 Defines a component architecture (IDHT) and a base class.
ring.py:
 Defines a pure ring structure.
gfinger.py:
 Defines a global finger structure. Calls ring.py for basic functionality.

These are defined based on some serious thinking about DHT routing tables.
Here's the short version:

1. Most of the theoretical papers on DHT routing tables focus on whether the
optimal minimal structure has either O(1) fingers or O(log N) fingers to
achieve O(log N) routing length or O(log N/log log N) length.

2. There are a few that try to push for a complete list of all N nodes.
Several use gossip protocols to achieve this. This achieves O(1) routing at
a cost of O(N log N) messages (yuck!) under the assumption that most users are
behind high speed connections (not likely!)

3. Borrowing ideas from "XRing", neither one is optimal. In practice, the
bandwidth required to maintain a few thousand links is in the 1-2 kbps range,
which is easily tolerable for virtually ANY Internet service. At a limit
of a few thousand links, it is also easily possible to achieve a 2 hop routing
length for 1 million nodes, or 3 hop routing length for a slightly larger size
than that.

4. The bandwidth to achieve sub-2 hop routing becomes prohibitive for most
connections.

5. The point where the minimal architectures cross the "3 hop" boundary is at
just 100 peers (assuming a base of e...so ln(N)/ln(ln(N)) = 3 when N=100).

So...maintain a very large and dense peer neighborhood set of left/right
neighbors. This is made as reliable as possible. Something close to 1000
nodes. Then, maintain a table of essentially random peers using the
remaining bandwidth of a few thousand peers. Theoretically, one could
create a finger list like Chord but in reality (as per the XRing paper)
but in reality it's not really necessary.

The ring DHT implementation represents the dense neighborhood. However, the
usual thecircle polling has to be removed. Reason: with on the order of 1000
peers, if every neighbor was polled, we'd run into serious bandwidth issues;
imagine 1000 peers all simultaneously polling each other. Each node would have
to handle 1000 poll requests EACH (for a total of 1,000,000 messages)!

To avoid this, each node is responsible for it's immediate successor (just
like normal). If it detects a change in the successor, it broadcasts this
change. It divides the load somewhat on the broadcast. The broadcast is done
like this:

1. Broadcast to right[1] which is the second right-hand neighbor (right[0]
represents the closest neighbor...who gets it's information from the polling
function).
2. Broadcast to right[2] which is the next neighbor. However, it is also
responsible for passing the information on to right[3].
3. Broadcast to right[4]. Right[4] is responsible for contacting right[5],
and right[6]. In turn, right[6] is responsible for contacting right[7].
4. Broadcast to right[8]. Right[8] in turn creates yet another binary tree
of broadcasts to right[9] to right[15].
...and so forth until every successor within range receives the message.

Exactly one message is generated for each successor/predecessor when a peer
change occurs. This reduces the polling load to just one message per neighbor
change...which gets us down to the bounds mentioned above.

The whole point of creating the broadcast tree is simply that although the
tree isn't necessary (one could simply forward on the information around the
ring in both directions), it takes O(K) steps where K is the number of
neighbors for the information to reach all destinations. Using the broadcast
tree, O(log(base-2) K) steps are necessary. Some nodes are more loaded than
others on a PER-peer change basis, but it will average out since they take
turns being either leaves or inside nodes in the broadcast tree.

As to the global finger list, there are three ways this can be handled:

1. Peers can simply generate random polls to search the DHT space for
responses. It works, but....err, we're back to the problem we had with
handling large numbers of peers (lots and LOTS of polling messages). The
polls don't have to be random but the results will be similar either way.
2. Peers can do space-walks around the ring. The idea here is to query a
known peer for a random address that is known to be in it's peer list. Then
query that peer, and so forth until "wrapping" around the ring back to the
local peer list. Again...same result as #1. The load is only lower simply
because key lookups are not necessary (as it would be by simply making
blind stabs).
3. Peers can establish a broadcast net. It can be done by gossipping (as
with XRing); the a somewhat diminished message load occurs...O(log N)
messages. Or they can do something more deterministic. I'm thinking about the
idea of affine points...say a peer has a 160 bit ID. It truncates it at 150
bits. Then the points at (0, 1, 2, 3, ...1023)+<postfix> represent the
peers that it maintains communication with (it's "global" list). It
becomes a matter of constructing a distributed broadcast between those points
updating each other about any status changes. And the format is again a
broadcast tree as above. Except that this time, the exact address may require
a lookup (with asymptotically 2-3 hops).

gfinger represents this approach. At this time, the one issue I see with
gfinger is that somehow it should also organize itself according to the
physical latencies of the underlying network. If nodes organize themselves
individually, this is easy to handle. Simple execute a localized search
in the neighborhood of the appropriate point. However...this will again
cause overloads due to polling.

The end result is that for a "small" DHT (less than about 1000 nodes),
ring.py will directly track all available peers in a complete routing table
and all messages will take exactly one hop. As the network expands beyond
about 2000 nodes, gfinger.py will take just one hop until the network expands
into millions of nodes. At that point, it will take a maximum of 3 hops. At
this point, given the reported sizes of the file sharing networks, it will be
quite a while before an optimal 3-hop network has to be contemplated.

At the node.py level, node searches work like this:

For each DHT implementation, do a lookup for a key. Use the nearest result
from all defined DHT's (two above but more could be added if improvements
are made).

Use that result to search again recursively until the appropriate node is
found (or the same result gets returned...indicating a terminal point has been
reached).


As to the organization of the data cache itself, I'm still working on it
conceptually. I believe that cache.py is the right step. The data cache
should be taken out of node.py since it is a "core" module but shouldn't be
part of node.py directly. I was originally enamoured with the "Beehive"
proactive caching idea, but I've discovered two problems with it:

1. Since the above DHT implementation drops the number of hops down to 1-3,
it completely screws up the basic theory behind "Beehive". "Beehive" relied
on the fact that a search took log(N) steps to reach the destination. In
the above design, to proactively cache even a single data item, it would have
to be replicated across a thousand nodes. I was thinking of something a little
finer-grained than that!

2. The authors are completely hung up on the idea of broadcasting almost exact
information among all replicants. I don't believe this amount of tight
modelling is strictly necessary. It should be possible for a given node to
recognize that item(X) is causing a query overload and to replicate the item
further out across the network by pushing it backwards along all of it's
peer links that point in the opposite direction from the "home" node of the
data item independently of the activities taking place elsewhere for that
data item. At worst, neighborhood statistics are necessary.

3. It doesn't handle nodes with differing capacities. It seems to require
one to work with the lowest common denominator among differing node resources.

I have a lot of problems with the existing 'thecircle' data cache:
1. It leaves a "trail of breadcrumbs" from sources to the eventual nodes
responsible for storing it. This is because it caches the data items locally,
and then gradually forwards them along peer links towards the nodes
responsible for storing the data items. If it was implemented on Chord
for instance, replication (without the K-factor replication built into
thecircle) would naturally end up O(log N).

2. I haven't exactly figured out how thecircle implements file searching,
but it at least appears to me that it maintains a set of "directories" which
store keywords. This structure is essentially a basic inverted index, right?
If so, then if it wasn't for the "salt" trick to mildly permute entries across
a few nodes, thecircle would saturate in a few spots. As it stands, I don't
think even this will save it for long. It has little hot spot protection (either
store-based hot spots or query-based ones), except for the fact that it likes
to cache everything almost everywhere.

Physically, what I'm beginning to end up with is a collection of about a dozen
modules that are somewhat self-contained and a central "node.py" module that
acts as a controller/container object gluing all the pieces together. So far
I have:

The DHT itself:
dht.py
ring.py
gfinger.py

The networking functions:
rpc.py
udprpc.py
safe_pickle.py

The glue functions:
localserver.py (manages responses to RPC's)
cache.py (data handling functions)
node.py (everything else)
peer.py (manages data cache related to other nodes)

naming.py (actually a catch-all for utility functions)

Right now, I'm keeping them in a subdirectory called "core" since everything
but safe_pickle.py was originally in a single module called "node.py".




reply via email to

[Prev in Thread] Current Thread [Next in Thread]