gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Architecture advice


From: Gordan Bobic
Subject: Re: [Gluster-devel] Architecture advice
Date: Mon, 12 Jan 2009 18:30:49 +0000
User-agent: Thunderbird 2.0.0.19 (X11/20090107)

Martin Fick wrote:

Not on the client, anyway. But if you're AFR-ing on
server side, then your client always talks to one server
anyway. The traditional way to handle server failure in that
case is to set up Heartbeat or RHCS to fail over the IP
address resource to the surviving server.

The TCP connection will reset when the fail-over occurs -
I'm not sure how gracefully/transparently GlusterFS
reconnects.
...

1.4 supports an new HA translator that is meant for clients to contact servers 
that AFR each other.  Like this:


       Client
         |
        HA
       /   \
      /     \
     /       \
Server A   Server B
| | AFR AFR
    | \     / |
    |  \   /  |
| \ / | | X |
    |   / \   |
    |  /   \  |
   Vol A   Vol B


I wasn't aware of there being a HA translator built
into GlusterFS, but unless you have proper fencing in place,
failing over IP addresses won't work. Without proper
cluster fencing in place you can easily find yourself in a
split-brain situation where both servers think they have the
same IP address and neither can talk to any of the clients.

...
No need for fencing simply because you now use HA translator.
The assumption in this case is that the servers can still talk
to each other but that one server's connection to the clients
may have died.

That means that 50% of the scope for failure will still wipe you out because you'll start splitbraining. Not the way forward at all. A fencing setup will at least preserve the data integrity. The correct way to handle comms channel failure between client and server is to have bonded interfaces going via different physical paths. _ONLY_ dealing with the situation where both servers are alive and connected to each other but we can only reach one due to an obscure failure somewhere in the network (e.g. a failed switch port or a failed NIC in the server) is a pretty half-arsed edge case.

Why re-invent the wheel when the tools to deal with these failure modes already exist?

Any failures on the server side may still warrant a fencing setup,
but AFR is not yet setup to work cooperatively with a fencing setup.

It doesn't have to be. If one server in AFR dies nothing spectacular happens. Things time out and carry on. I don't see what cooperation there would need to be. RHCS does it's own heart-beating and fencing. Mix and match as required.

Gordan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]