[Vrs-development] Initial simplified cluster manager -- proposal (for di

vrs-development

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Vrs-development] Initial simplified cluster manager -- proposal (for di

From:	Eric Altendorf
Subject:	[Vrs-development] Initial simplified cluster manager -- proposal (for discussion Sunday)
Date:	Sun, 14 Jul 2002 04:13:53 -0700
User-agent:	KMail/1.4.1

It's pretty simple, other than two places that require a
two-phase-commit (TPC) transaction protocol.  I threw that in
because A) it's important, and B) we need to figure out how we're
going to do transactions.  I'm going to look through the JBoss code
to see how they handle transactions; see if there's anything we can
use there.  Also, let me know if anyone else knows of any free
transaction processing systems / transaction monitors / etc that we
could leverage.  I don't think we want to write our own custom TP
system...(a) it would take a long time, and (b) it would probably
not be correct. :-)

I'm actually not 100% sure that the following plan will yield
correct behavior as nodes go on and offline, as I am not an expert
in this stuff.  It's also a bit sloppy, as I wrote it up at 4AM. 
However, I think it is approximately the idea we want.  It's at 
least in the right direction. :-)

I'd like to talk about this at the meeting tomorrow.  If possible I'd 
like to try to meet a bit before the auth meeting -- e.g. 16:00 UTC.  
I have to do stuff tomorrow afternoon (my time) so the sooner I get 
the meeting over with the better.

---

Plans for an initial, preliminary, simplified cluster manager
(proposal for a task to be completed in the next month or so)

The cluster manager (CM) is a service that runs on each node of a
VRS cluster.  The CM is primarily responsible for maintaining
information about nodes in the system.  All nodes are equal;
specifically, the CM process on each node is (for now) essentially
identical.  No one node or CM has special privileges or
responsibilities above the others.

Each CM should maintain a (static during runtime) list of
authentication information (e.g. public keys) for nodes that may
join the cluster.  These lists can be hand-edited when a cluster
administrator wants to allow a new node or remove (disallow) an
existing node.  The administrator should ensure that all nodes have
the same authentication list, so that when a machine wants to sign
on to the cluster, it can connect to and authenticate with any of
the nodes currently in the cluster.  Adding a new node to the
authentication information list requires generating the various
authentication keys, adding the private key to the node and the
public key to the authentication information list.

The CM should also maintain a (dynamic) list of nodes which are
currently active in the cluster.  This list must always be
consistent among the various nodes currently in the cluster, as
various services (in particular, the resource manager) will use
transactions that depend on *all* the nodes in the cluster agreeing
on an update.  If there is any ambiguity in terms of which nodes are
currently connected to the cluster and which are not, the
transaction manager won't know which nodes to contact and from which
to require a 'commit' signal.

When a machine (call it A) wants to sign in to the cluster, it must
contact one of the nodes currently in the cluster (call it B), and
send its authentication information.  Assuming that machine A on the
cluster authentication information list and its authentication info
is valid, node B will transactionally update the active node list on
ALL machines currently on the active node list.  (Cue TPC protocol
code here. :-)

It may happen that during a sign-in (for example) that one or more
of the nodes does not respond.  In this case, the transaction
obviously fails, and cannot be successfully retried until the active
node list has been modified to remove the dead node.

Thus, when a node (node A) notices that another node (node B) has
disappeared from the cluster (by whatever means seem appropriate;
e.g.  no ping response), node A must then coordinate a transaction
among all nodes except node B to update to the active node list to
remove node B (re-cue TPC protocol code :-).  If it had been
attempting a transaction (e.g. sign-in) before, it may retry that
transaction at that point.


-- 
"First they ignore you.  Then they laugh at you.
 Then they fight you.  And then you win."             -Gandhi

[Prev in Thread]

Current Thread

[Next in Thread]

[Vrs-development] Initial simplified cluster manager -- proposal (for discussion Sunday), Eric Altendorf <=

Prev by Date: Re: [Vrs-development] Distributed Filesystem
Next by Date: [Vrs-development] Re: IRC
Previous by thread: [Vrs-development] Sandbox
Next by thread: [Vrs-development] Repost (two phase commit protocol intro)
Index(es):
- Date
- Thread