Re: [Gluster-devel] Architecture advice

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Architecture advice

From:	Martin Fick
Subject:	Re: [Gluster-devel] Architecture advice
Date:	Mon, 12 Jan 2009 14:34:25 -0800 (PST)

> > Why is that the correct way?  There's nothing
> wrong with having "bonding" at the glusterfs
> protocol level, is there?
> 
> The problem is that it only covers a very narrow edge case
> that isn't all that likely. A bonded NIC over separate
> switches all the way to both servers is a much more sensible
> option. Or else what failure are you trying to protect
> yourself against? It's a bit like fitting a big padlock
> on the door when there's a wall missing.

I think you need to be more specific then using 
analogies.  My only guess from your assertions is 
that you have a very narrow specific use case /
setup / terminology in mind that does not 
necessarily mesh with my narrow use case ... :)

So, the HA translator supports talking to two
different servers with two different transport
mechanism and two different IPs.  Bonding does 
not support anything like this is far as I can 
tell.  So, it seems like you are assuming a
different back end use case, one where the 
servers employ the same IP perhaps using round 
robin or perhaps in an active passive way.  Both
of these are very different beasts and I would
need to know which you are talking about to
understand what you are getting at.  But the HA
translator setup is closer to the round robin
(active/active) setup and I am guessing you 
are taking about an active / passive setup.



> > That is somewhat what the HA translator is, except
> that it is supposed to take care of some additional
> failures.  It is supposed to retransmit "in
> progress" operations that have not succeeded because of
> comm failures (I have yet to figure out where in the code
> this happens though).
> 
> This is a reinvention of a wheel. NFS already handles this
> gracefully for the use-case you are describing.

I am lost, what does NFS have to do with it?


> >> Why re-invent the wheel when the tools to deal
> with these
> >> failure modes already exist?
> > 
> > Are you referring to bonding here? If so, see above
> why HA may be better (or additional benefit).
> 
> My original point is that it doesn't add anything new
> that you couldn't achieve with tools that are already
> available.


Well, I was trying to explain to you that it
does, but then the NFS thing, I am confused.

How do current tools achieve the following
setup?  Client A talks to Server A and 
submits a read request.  The read request 
is received on Server A (TCP acked to the 
client), and then Server A dies.  How will
the following request be completed without
glusterfs returning an "endpoint not 
connected" error?

No, I have not confirmed that this actually
works with the HA translator, but I was told
that the following would happen if it were 
used.  Client A talks to Server A and 
submits a read request.  The read request 
is received on Server A (TCP acked to the 
client), and then Server A dies.  Client A
will then in theory retry the read request
on Server B.  Bonding cannot do anything
like this (since the read was tcp ACKed)?  
Neither can heartbeat/failover
of an active/passive backend since on the
first failure the client will get a 
connection error and the glusterfs client
protocol does not retransmit).

I think that this is quite different from
any bonding solution.  Not better, different,
If I were to use this it would not preclude 
me from also using bonding, but it solves a 
somewhat different problem.  It is not a 
complete solution, it is a piece, but not
a duplicated piece.  If you don't like it,
or it doesn't fit your backend use case, 
don't use it! :)


> > Yes, if a server goes down you are fine (aside from the
> > scenario where the other server then goes down followed
> > by the first one coming back up).  But, if you are using
> > the HA translator above and the communication goes down
> > between the two servers you may still get split brain
> > (thus the need for heartbeat/fencing).
> 
> And therein lies the problem - unless you are proposing
> adding a complete fencing infrastructure into glusterfs,
> too.

No. I am proposing adding a complete transactional 
model to AFR so that if a write fails on one node, 
some policy can decide whether the same write 
should be committed of rolled back on the other 
nodes.  Today, the policy is to simply apply it to 
the other nodes regardless.  This is a recipe for 
split brain.  

In the case of network segregation some policy 
should decide to allow writes to be applied
to one side of the segregation and denied on the 
other.  This does not require fencing (but it
would be better with it), it could be a simple 
policy like: "apply writes if a majority of nodes 
can be reached", if not fail (or block would be
even better).


> > AFR needs to be able write all or nothing to all
> > servers until some external policy machine (such as
> > heartbeat) decides that it is safe (because of fencing or
> > other mechanism) to proceed writing to only a portion of the
> > subvolumes (servers).  Without this I don't see how you
> > can prevent split brain?
> 
> With server-side AFR, splitbrain cannot really occur (OK,
> there's a tiny window of opportunity for it if the
> server isn't really totally dead since there's no
> total FS lock-out until fencing is completed like on GFS,
> but it's probably close enough). If the server's
> can't heartbeat to each other, they can't AFR to
> each other, either. So either the write gets propagated, or
> it doesn't. The machine that remained operational will
> have more up to date files and as necessary those will get
> synced back. It's not quite as tight as GFS in terms of
> ensuring data consistency like a DRBD+GFS solution would be,
> but it is probably close enough for most use-cases.


I guess what you call tiny, I call huge.  Even if 
you have your heartbeat fencing occur in under a
tenth of a second, that is time enough to split 
brain a major portion of a filesystem.  I would 
never trust it.

To borrow your analogy, adding heartbeat to the 
current AFR:  "It's a bit like fitting a big 
padlock on the door when there's a wall missing."
:)  

Every single write needs to ensure that it will 
not cause split brain for me to trust it.  
If not, why would I bother with gluserfs over
AFR instead of glusterfs over DRBD?  Oh right, 
because I cannot get glusterfs to failover without
incurring connection errors on the client! ;)
(not your beef, I know, from another thread)

This is one reason I was hoping that the HA
translator would address this, but the HA
translator is useless in an active/passive
backend setup, it only works in active/active.
If you try using it in an active/passive setup,
during failover it will retry too quickly on
the second server causing connection errors
on the client!!!  This is the primary reason
that I am suggesting that the HA translator
block until the connection is restored, it
would allow for failovers to occur.


But, to be clear, I am not disagreeing with you
that the HA translator does not solve the split
brain problem at all.  Perhaps this is what is 
really "upsetting" you, not that it is
"duplicated" functionality, but rather that it 
does not help AFR solve it's split brain 
personality disorders, it only helps make them 
more available, thus making split brain even 
more likely!! ;(

Excited/disgruntled about the new HA 
translator, ;)

-Martin

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gluster-devel] Architecture advice, (continued)
- Re: [Gluster-devel] Architecture advice, Joe Landman, 2009/01/08
  - Re: [Gluster-devel] Architecture advice, Dan Parsons, 2009/01/08
- Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/12
  - Re: [Gluster-devel] Architecture advice, Martin Fick, 2009/01/12
    - Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/12
- Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/12
- Re: [Gluster-devel] Architecture advice, Martin Fick, 2009/01/12
  - Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/12
    - Re: [Gluster-devel] Architecture advice, Martin Fick <=
    - Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/12
    - Re: [Gluster-devel] Architecture advice, Martin Fick, 2009/01/12
- Re: [Gluster-devel] Architecture advice, Gordan Bobic, 2009/01/14

Prev by Date: Re: [Gluster-devel] RPM / BerkeleyDB on GlusterFS
Next by Date: [Gluster-devel] erroneous double disk space display in 'df'?
Previous by thread: Re: [Gluster-devel] Architecture advice
Next by thread: Re: [Gluster-devel] Architecture advice
Index(es):
- Date
- Thread