Re: [Gluster-devel] solutions for split brain situation

2009/9/18 Mark Mielke <address@hidden>

On 09/17/2009 06:47 PM, Stephan von Krawczynski wrote:

Way above in this discussion I told that we only talk about the first/primary
subvolume/backend for simplicity. It makes no sense to check a journal if I
can stat the real file which I have to do anyway if an open/create arrives -
and we are talking exactly about that. So please explain where is your assumed
race? Really only a braindead implementation can race on an open. You can
delay a flush on close (like writebehind), but you can obviously not delay an
open neither r,rw nor create because you have to know if the file is a)
existing and b) can be created if not. As long as you don't touch the backend
you will not find out if a create may fail for disk-full or the like. It may
as well fail because of access-privileges. whatever it is, you will not find a
trusted answer without asking the backend, no journal will save you.

Like most backend storages, the backend storage includes the data pages, the metadata, AND the journal. "Without asking the backend" and "no journal will save you" are not not understanding that the backend *includes* the journal.

A scenario which should make this clear: Let's say the file a.c is removed a from a 2-node replication cluster. Something like the following should occur: Step 1 is to lock the resource. Step 2 is to record the intent to remove on each node. Step 3 is to remove on each node. Step 4 is to clear the intent from each node. Step 5 is to unlock the resource. Now, let's say that one node is not accessible during this process and it comes back up later. After it comes back up, should a process that happens to see the file does not exist on node 1, but does exist on node 2. Should the file exist or not? I don't know if GlusterFS even does this correctly - but if it does, the file should NOT exist. There should be sufficient information, probably in the journal, to show that the file was *removed*, and therefore, even if one node still has the file, the journal tells us that the file was removed. The self-heal operation should remove the file from the node that was down as soon as the discrepancy is detected.

Correct me if I am wrong, but GlusterFS uses extended attributes on the directory to note if direct children of the directory have been updated. For instance, if you remove a file and one node is down, self-heal will find that the last directory change on the down node is older than that of the other nodes, bringing any create/unlink operations into line with the other nodes.

The point here, is that the journal SHOULD be consulted. If you think otherwise, I think you are not looking for a reliable replication cluster that implements POSIX guarantees.

I think GlusterFS doesn't provide all of these guarantees as well as it should, but I have not done the full testing to expose how correct or incorrect it is in various cases. As it is, I just received a problem where a Java program trying to use file locking failed in a GlusterFS mount point, but succeeded in /var/tmp, so although I still think GlusterFS has potentially - I'm slowly backing down from what production data I am willing to store in it. It's unfortunate that this solution space seems so immature. I'm still switching back and forth between wondering if I should push / help GlusterFS into solving all of the problems, or just write my own solution.

My favourite solution is a mostly asynchronous master-master approach, where each node can fall out of date from the other, as long as they touch different data, but that changes that do touch the same data become serialized. Unfortunately, this also requires the most clever implementation strategy as well, and clever can take time or exceptional talent.

Read again: I said "and not going over glusterfs for some unknown reason."
"unkown reason" means that I can think of some for myself but tend to believe
there may be lots of others. My personal reason nr 1 is the soft migration
situation.

See my comment about writing a program to set up the xattr metadata for you

How about using the code that is there - inside glusterfsd.
It must be there, else you would not be able to mount an already populated
backend for the first time. Did you try? I did.

This could mean that GlusterFS is too lax with regard to consistency guarantees. If files can appear in the background, and magically be shown - this indicates that GlusterFS is not enforcing use through the mount point, which introduces the potential for inconsistent or faulty results. You are asking for it to guess what you want, without seeing that what you are asking for is incompatible with provisions for any guarantee of a consistent view. That "it works" is actually more concerning to me that justifying over your position. To me it says it's one more potential problem that I might hit in the future. A file that should be removed magically re-appears - how is this a good thing?

Cheers,
mark

I guess the last question is a good one for the developers. If the required extended attributes do not exist on the backend, should the files/directories (excluding the root directory) show in a stat() call? That may be a blessing or curse for new users, especially when this post has been going on about automatic creation of extended attributes for pre-existing files in the backend.

From:	Michael Cassaniti
Subject:	Re: [Gluster-devel] solutions for split brain situation
Date:	Fri, 18 Sep 2009 13:52:05 +1000