Re: [Gluster-devel] Glusterd: A New Hope

On Fri, Mar 22, 2013 at 7:09 AM, Jeff Darcy <address@hidden> wrote:

The need for some change here is keenly felt
right now as we struggle to fix all of the race conditions that have
resulted from the hasty addition of synctasks to make up for poor
performance elsewhere in that 44K lines of C.

synctasks were not added for performance at all. glusterd being single threaded was incapable of serving volfile in GETSPEC command or assign a port in PORTMAP query when the very process it spawned (glusterfs/glusterfs) would ask glusterd, and wait for the result from glusterd before "finishing daemonizing" (so that a proper exit status be returned), and glusterd would wait for glusterfsd to return before it got back to epoll() and pick the portmap/getspec request -- resulting in a deadlock.

Making it multi-threaded was inevitable if we wanted to even make "basic" behavior right - i.e "gluster volume start" return success only if glusterfsd successfully started or fail if it could not start (we would _always_ return success).

But this is yet another example of how retrofitting threads on a single threaded program can cause problems. It's not unusual to see races. Most of them are fixable with a "general scheme of locking" practices applied in a few places.

That being said, I'm open to exploring using other projects which have a "good fit" with rest of glusterfs. It would certainly be nice to make it "someone else's problem".

Avati

Delegating as much as
possible of this functionality to mature code that is mostly maintained
elsewhere would be very beneficial. I've done some research since those
meetings, and here are some results.

The most basic idea here is to use an existing coordination service to
store cluster configuration and state. That service would then take
responsibility for maintaining availability and consistency of the data
under its care. The best known example of such a coordination service
is Apache's ZooKeeper[1], but there are others that don't have the
noxious Java dependency - e.g. doozer[2] written in Go, Arakoon[3]
written in OCaml, ConCoord[4] written in Python. These all provide a
tightly consistent generally-hierarchical namespace for relatively small
amounts of data. In addition, there are two other features that might
be useful.

* Watches: register for notification of changes to an object (or
directory/container), without having to poll.

* Ephemerals: certain objects go away when the client that created them
drops its connection to the server(s).

Here's a rough sketch of how we'd use such a service.

* Membership: a certain small set of servers (three or more) would be
manually set up as coordination-service masters, e.g. via "peer probe
xxx as master"). Other servers would connect to these masters, which
would use ephemerals to update a "cluster map" object. Both clients and
servers could set up watches on the cluster map object to be notified of
servers joining and leaving.

* Configuration: the information we currently store in each volume's
"info" file as the basis for generating volfiles (and perhaps the
volfiles themselves) would be stored in the configuration service.
Again, servers and clients could set watches on these objects to be
notified of changes and do the appropriate graph switches, reconfigures,
quorum actions, etc.

* Maintenance operations: these would still run in glusterd (which isn't
going away). They would use the coordination for leader election to
make sure the same activity isn't started twice, and to keep status
updated in a way that allows other nodes to watch for changes.

* Status queries: these would be handled entirely by querying objects
within the coordination service.

Of the alternatives available to us, only ZooKeeper directly supports
all of the functionality we'd want. However, the Java dependency is
decidedly unpleasant for us and would be totally unacceptable to some of
our users. Doozer seems the closest of the remainder; it supports
watches but not ephemerals, so we'd either have to synthesize those on
top of doozer itself or find another way to handle membership (the only
place where we use that functionality) based on the features it does
have. The project also seems reasonably mature and active, though we'd
probably still have to devote some time to developing our own local
doozer expertise.

In a similar vein, another possibility would be to use *ourselves* as
the coordination service, via a hand-configured AFR volume. This is
actually an approach Kaleb and I were seriously considering for HekaFS
at the time of the acquisition, and it's not without its benefits.
Using libgfapi we can prevent this special volume from having to be
mounted, and we already know how to secure the communications paths for
it (something that would require additional work with the other
solutions). On the other hand, it would probably require additional
translators to provide both ephemerals and watches, and might require
its own non-glusterd solution to issues like failure detection and
self-heal, so it doesn't exactly meet the "make it somebody else's
problem" criterion.

In conclusion, I think our best (long term) way forward would be to
prototype a doozer-based version of glusterd. I could possibly be
persuaded to try a "gluster on gluster" approach instead, but at this
moment it wouldn't be my first choice. Are there any other suggestions
or objections before I forge ahead?

[1] http://zookeeper.apache.org/
[2] https://github.com/ha/doozerd
[3] http://arakoon.org/
[4] http://openreplica.org/doc/

_______________________________________________
Gluster-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/gluster-devel

From:	Anand Avati
Subject:	Re: [Gluster-devel] Glusterd: A New Hope
Date:	Fri, 22 Mar 2013 10:51:09 -0700