circle-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[circle] Load balancing and hashing keywords


From: Paul Campbell
Subject: [circle] Load balancing and hashing keywords
Date: Mon, 12 Apr 2004 12:07:53 -0700
User-agent: Mutt/1.5.6i

Has anyone paid attention to the recent list of load balancing articles:

For instance:
http://iptps04.cs.ucsd.edu/papers/karger-load-balance.pdf

This one points out the fact that virtual peers are a bad thing in general
because they drive up the load on the individual nodes (more neighbors
to keep track of).

There is one slight refinement to be had. There are essentially two types of
load swapping. In the first type, one swaps load with a nearest neighbor.
In the second type, one swaps with a random node some place else on the
ring. The second type is far more expensive than the first since the random
node has to first offload it's load onto it's neighbors, then transfer the
entire necessary load after executing both a disconnect and a join operation.

It also turns out that in spite of the expense, the second type is a necessity.
See:
http://dbpubs.stanford.edu:8090/pub/2004-18

It seems to me that most of the "virtual node" ideas mostly just accomplish
the same thing as the technique outlined above does directly. And it gets
around the "Britney Spears" problem completely.

Also, I have a second suggestion along the same lines. It seems to me that
the load on an index/search network with an underlying DHT-type setup need
not bother with hashing since it does nothing one the load balance problem
is fixed. So entries can be indexed directly. Hashing has value in some other
algorithms but not for pure index/search functions.

Also, I believe that there are basically 3 "load imbalance" problems to be
solved:
1. Nonuniform key/hash space. Even with perfectly organized nodes and uniform
hashing, it is recognized that some nodes will have a load of O(ln(N)) with
high probability.

2. Nonuniform queries of key/hash space. Most suggestions to solve this
involve passive caching. However, this doesn't really solve the problem. Since
key words normally follow a Zipfian distribution, this reference shows how to
cache pre-emptively to the point where query response times can actually
exceed traditional (say DNS, or HTML) internet queries:
http://www.cs.cornell.edu/People/egs/beehive/

3. The final problem is the "Britney Spears" effect mentioned previously, where
a single particular key is highly overloaded compared to the others. I have
one further suggestion with this one. A DHT (or in this case, simply an
inverted index) stores entries of the form:
key, value1, value2, value3, value4...

Borrowing from:
http://dbpubs.stanford.edu:8090/pub/2004-18

One could instead index the entries as:
key+value1, value2, value3, value4...

The size of the key/value list entries is limited to a small number of
entries. This makes it easy to pass entries around and to respond to queries
and store functions. One can both search by the original key, and also search
a range of hash values (which subdivides the nodes for common word keys,
again the "Britney Spears" problem) simply by changing the semantics of the
(key+value1, value list) / (key, value list) definitions.

Note also that recently there's been a lot passed around in academia about
the idea of using LSI/VSM as a sort of document fingerprint rather than
outright keyword searches. I don't believe that this is the right direction
to go in from reading this paper:
http://portal.acm.org/citation.cfm?id=277632&dl=ACM&coll=GUIDE




reply via email to

[Prev in Thread] Current Thread [Next in Thread]