Re: [Gluster-devel] ping timeout

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] ping timeout

From:	Gordan Bobic
Subject:	Re: [Gluster-devel] ping timeout
Date:	Thu, 25 Mar 2010 12:13:17 +0000
User-agent:	Thunderbird 2.0.0.22 (X11/20090625)

Stephan von Krawczynski wrote:

On Thu, 25 Mar 2010 10:43:10 +0000
Gordan Bobic <address@hidden> wrote:
Stephan von Krawczynski wrote:
On Thu, 25 Mar 2010 09:56:24 +0000
Gordan Bobic <address@hidden> wrote:
If I have your mentioned scenario right, including what you believeshould happen:
    * First node goes down. Simple enough.
    * Second node has new file operations performed on it that the first
      node does not get.
    * First node comes up. It is completely fenced from all other
      machines to get itself in sync with the second node.
    * Second node goes down. Is it before/after first node is synced?
          o If it is before then you have a fully isolated FS that is
            not accessible.
          o If it is after then you don't have a problem.
I would suggest writing a script and performing some firewalling toperform the fencing.
This is not really good enough - you need an out-of-band fencing devicethat you can use to forcibly down the node that disconnected, e.g.remote power-off by power management (e.g. UPS or a network controllablepower bar) or remote server management (Dell DRAC, Raritan eRIC G4, HPiLO, Sun LOM, etc.). When the node gets rebooted, it has to notice thereare other nodes already up and specifically set itself into such a modethat it will lose any contest on being the source node for resync untilit has fully checked all the files' metadata against it's peers.
I believe you can run ls -R on the file-system toget it in sync. You would need to mount glfs locally on the first node,get it in sync, then open the firewall ports afterward. Is that anappropriate solution?
The problem is that firewalling would have to be applied by every nodeother than the node that dropped off, and this would need to becommunicated to all the other nodes, and they would have to confirmbefore the fencing action is deemed to have succeeded. This is a lotmore complex and error prone compared to just using a single point offencing for each node such as a network controlled power bar.(e.g.http://www.linuxfordevices.com/c/a/News/Entrylevel-4port-IP-power-switch-runs-Linux/
)
Let me add some thoughts here:
First it looks obvious to me that fencing is not needed for glusterfs in the
described cases. If your first node comes up again it will not deliver data
that is not in-sync with the second node, that is what glusterfs is all about.
Not quite - there are a lot of failure modes that involve networkpartitioning that WILL cause split-brain and unhealable files.
I was talking about the given example. Of course you may create any number of
setups that have a potential to explode without chance for restauration.
My general advice would be to try to keep the network setup as simple as
possible, because this is obviously one major source of destruction.
Creating a really fault-tolerant setup cannot only depend on the cluster fs
used, because whatever you use, none will save you in every case.
But if you design the setup carefully around glusterfs chances are you get
away with fault scenarios where others are just plain dead.

Or at least partially alive / partially potentially corrupted vs. plaindead. For some use cases that is an advantage. For others no service ispreferable to potentially corrupted files.

And there is a
good bunch of troubles you cannot run into per design, namely storage issues.

There's an extra layer of recoverability, but most of the storagefailure modes still exist, albeit one step further down the stack.

Now, when your second nodes goes down while the first is not completely synced
you only have these choices:
1. Blow up the setup and deliver nothing
2. Deliver what the first node actually has.
It looks obvious that the second choice is preferred because whatever the
out-of-sync data is, there is likely in-sync data too to be served. And so you
are at least partly saved.
But are opening yourself to the prospect of having files that cannot behealed. I can think of plenty of cases where this is a worse casescenario than just blocking/fencing.


In fact this is only a matter of how paranoid you want to be. You can reduce
the risk of seeing these cases by adding additional bricks to your
replication. Since you only need one brick out of X staying alive for
eliminating the risk it is in fact all up to you.

That would only be the case if there was the concept of quorum in glfs,which AFAIK, there isn't. You'd need some sort of a quorum based votingmechanism on which node to kick out of the cluster/fence, and thenarrange that in such a way that it fulfills the requirements foravailability and fault tolerance. If your cluster has to be quorate,then there can be no split-brain.

All glfs does in a way is massively improve the granularity of clusterfs operations, from fs level down to file level. But all the basicclustering concepts and requirements remain the same (fencing, quorum,split-brain, etc.)

You are also forgetting that the failure mode you are describinginvolves a previous failure, too. If A isn't in sync with B and B goesdown, that means A went down first, but came back up.
I don't quite get the argument here. Isn't it intended somehow that A comes
back. It should come back anyways, at least by admin interaction. Still the
service should be kept up and the original setup restored, or not?

I would argue that A shouldn't be allowed to become an active (orperhaps this could be relaxed to not being allowed to become the ONLY)participant in the cluster until it is fully up to date with it's peer(s).

The real hot topic here is how the time between the first node coming back and
the second node going down is used for an optimal self heal procedure. The
risk of split brain is lower the faster the self heal procedure works.
I'd say that any risk of split brain needs to be suitably addressed. Asolution that includes fencing (to prevent split-brain from occurring inthe first place) plus keeping a separate list of files that are "dirty"so they can be resynced explicitly before a node is allowed to fullyre-join might be a reasonable way to go. This is similar to what DRBDdoes (it keeps a bitmap of dirty blocks for fast resync).
The re-join is implicit in glusterfs. For files not needing self-heal the
re-join time is equal to the upcoming of glusterfsd. For files needing
self-heal the re-join takes place right after their healing. And you don't have
to do anything, it is simply glusterfs behaviour.

The problem is that it lacks guarantees about the node providing servicebeing up to date if a more up to date node goes down. This may beunacceptable in a lot of cases. The resync is done lazily on-access, sothe only way to deal with the resync is to issue ls -laR. As you pointedout that can be very slow on a large data set, so some way of each nodekeeping a list of dirty files for each disconnected node would provide apotentially quicker way to resync, in addition to providing a mechanismby which a decision could be made on whether a node is ready to takeover the work if all of it's peers were to go down (i.e. ensure that anode cannot provide service for a file that has been marked on it asdirty by another node).

It is obvious that the optimal strategy has to know exactly what files to
heal. And I just made a proposal for that in another post.
Doing ls -lR will be no good strategy for simple runtime reasons if you have
large amounts of data.

I agree, although I'm pretty sure there can be failure modes where it isnecessary.


Well, the good thing about it is, it's all your choice. If you want to check
out the situation you can ls at any suitable time. But if runtime is a risk
factor you need a dirty file list.


Definitely agree on the dirty file list.

Then again, if you have that big a data set, you should bepartitioning it in smaller RAID1 stripes with RAID0 stripes on top. Thatway the time to resync any server to it's peer is kept manageable.Simply running a 100TB mirror isn't sensible. Keeping 100 1TB mirrors ismuch more workable cometh resync time.
In my eyes design and implementation should allow both, and they should be
equally manageable. And there must be no difference in resync time if you
have an equal number of dirty files. Since glusterfs has kind of a local
file-by-file design the total fs size should not make a difference.

I'm not so sure about that. Both should be implementable, but expectingboth to be equally manageable from the performance and resilienceperspective is misguided. You wouldn't prefer a RAID 01 over RAID 10,would you???


Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gluster-devel] ping timeout, (continued)
- Re: [Gluster-devel] ping timeout, Christopher Hawkins, 2010/03/24
  - Re: [Gluster-devel] ping timeout, Stephan von Krawczynski, 2010/03/24
  - Re: [Gluster-devel] ping timeout, Gordan Bobic, 2010/03/24
    - Re: [Gluster-devel] ping timeout, Michael Cassaniti, 2010/03/25
    - Re: [Gluster-devel] ping timeout, Gordan Bobic, 2010/03/25
    - Re: [Gluster-devel] ping timeout, Stephan von Krawczynski, 2010/03/25
    - Re: [Gluster-devel] ping timeout, Gordan Bobic, 2010/03/25
    - Re: [Gluster-devel] ping timeout, Stephan von Krawczynski, 2010/03/25
    - Re: [Gluster-devel] ping timeout, Gordan Bobic <=
    - [Gluster-devel] split-brain [was ping timeout], Ian Rogers, 2010/03/25
    - Re: [Gluster-devel] split-brain [was ping timeout], Vikas Gorur, 2010/03/25

Prev by Date: Re: [Gluster-devel] ping timeout
Next by Date: Re: [Gluster-devel] How to make out-of-sync files visible in replication setup
Previous by thread: Re: [Gluster-devel] ping timeout
Next by thread: [Gluster-devel] split-brain [was ping timeout]
Index(es):
- Date
- Thread