Re: [Gluster-devel] ZkFarmer

On Mon, May 7, 2012 at 9:33 PM, Anand Babu Periasamy <address@hidden> wrote:

On Mon, May 7, 2012 at 7:43 AM, Jeff Darcy <address@hidden> wrote:
> I've long felt that our ways of dealing with cluster membership and staging of
> config changes is not quite as robust and scalable as we might want.
> Accordingly, I spent a bit of time a couple of weeks ago looking into the
> possibility of using ZooKeeper to do some of this stuff. Yeah, it brings in a
> heavy Java dependency, but when I looked at some lighter-weight alternatives
> they all seemed to be lacking in more important ways. Basically the idea was
> to do this:
>
> * Set up the first N (e.g. N=3) nodes in our cluster as ZooKeeper servers, or
> point everyone at an existing ZooKeeper cluster.
>
> * Use ZK ephemeral nodes as a way to track cluster membership ("peer probe"
> merely updates ZK, and "peer status" merely reads from it).
>
> * Store config information in ZK *once* instead of regenerating volfiles etc.
> on every node (and dealing with the ugly cases where a node was down when the
> config change happened).
>
> * Set watches on ZK nodes to be notified when config changes happen, and
> respond appropriately.
>
> I eventually ran out of time and moved on to other things, but this or
> something like it (e.g. using Riak Core) still seems like a better approach
> than what we have. In that context, it looks like ZkFarmer[1] might be a big
> help. AFAICT someone else was trying to solve almost exactly the same kind of
> server/config problem that we have, and wrapped their solution into a library.
> Is this a direction other devs might be interested in pursuing some day,
> if/when time allows?
>
>
> [1] https://github.com/rs/zkfarmer

Real issue is here is: GlusterFS is a fully distributed system. It is
OK for config files to be in one place (centralized). It is easier to
manage and backup. Avati still claims that making distributed copies
are not a problem (volume operations are fast, versioned and
checksumed). Also the code base for replicating 3 way or all-node is
same. We all need to come to agreement on the demerits of replicating
the volume spec on every node.

My claim is somewhat similar to what you said literally, but slightly different in meaning. What I mean is, while it is true keeping multiple copies of the volfile is more expensive/resource consuming in theory, what is the breaking point in terms of number of servers where it begins to matter? There are trivial (low lying) enhancements which are possible (for e.g, store volfiles of a volume only on participating servers instead of all servers) which could address a class of concerns. There are clear advantages in having volfiles in all the participating nodes at least - it takes away dependency on order of booting of servers in your data centre. If volfiles are available locally you dont have to wait/retry for the "central servers" to come up first. Whether this is volfiles managed by glusterd, or "storage servers" of ZK, it is a big advantage to have the startup of a given server decoupled from the others (of course the coupling comes in at an operational level at the time of volume modifications, but that is much more acceptable).

If the storage of volfiles on all servers really seems unnecessary, we should first come up with real hard numbers - number of servers v/s latency of volume operations and then figure out at what point it starts becoming unacceptably slow. Maybe a good solution is to just propagate the volfiles in the background while still retaining version info than introducing a more intrusive change? But we really need the numbers first.

If we are convinced to keep the config info in one place, ZK is
certainly one a good idea. I personally hate Java dependency. I still
struggle with Java dependencies for browser and clojure. I can digest
that if we are going to adopt Java over Python for future external
modules. Alternatively we can also look at creating a replicated meta
system volume. What ever we adopt, we should keep dependencies and
installation steps to the bare minimum and simple.

It is true other projects have figured out the problem of membership and configuration management and specialize at doing that. That is very good for the entire computing community as a whole. If there are components we can incorporate and build upon their work, that is very desirable. At the same time we also need to check what other baggage we inherit along with the specialized expertise we take on. One of the biggest strengths of Gluster has been its "lightweight"edness and lack of dependencies - which in turn has driven our adoption significantly which in turn results in higher feedback and bug reports etc. (i.e, it is not an isolated strength in itself). Enforcing a Java dependency down the throat of users who want a simple distributed filesystem (yes, the moment we stop thinking of gluster as a "simple" distributed filesystem - even though it may be an oxymoron technically, but I guess you know what I mean :) it's a slippery slope towards it becoming "yet another" distributed filesystem.) The simplicity is what "makes" gluster to a large extent what it is. This makes the developer's life miserable to a fair degree, but it anyways always is, one way or another ;)

I am not against adopting external projects. There are good reasons many times to do so. If there are external projects which are "compatible in personality" with gluster and helps us avoid reinventing the wheel, we must definitely do so. If they are not compatible, I'm sure there are lessons and ideas we can adopt, if not code.

Avati

From:	Anand Avati
Subject:	Re: [Gluster-devel] ZkFarmer
Date:	Tue, 8 May 2012 18:33:50 -0700