gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] solutions for split brain situation


From: Joe Landman
Subject: Re: [Gluster-devel] solutions for split brain situation
Date: Wed, 16 Sep 2009 18:49:24 -0400
User-agent: Thunderbird 2.0.0.23 (X11/20090817)

Some comments as a user of the open source version, and as a reseller of the commercial version, including having provided emergency support to users ... take these with a grain of salt if you wish.

Mark Mielke wrote:
On 09/16/2009 05:45 AM, Gordan Bobic wrote:
It's not my project (I'm just a user of it), but having done my

[...]

I came to a slightly different conclusion, but similar effect. Of the projects available, GlusterFS is the closest to production *today*. The

As a user of many file systems over (quite) a span of time, I have as of yet to see "the one true file system that is really bug free, always works, and never fails." All software is buggy. Some more so than others, but all software is buggy. Anyone telling you otherwise is trying to sell something to you.

world has waited a long time for this. It is imperfect, but right now it's still high on the list of solutions that can be used today and have potential for tomorrow.

Every storage design and implementation you do, you need to ask yourself "if this went away, what would be the impact upon me and my work?" You then need to design to this. Failure to do so ... well ...

In case it is of any use to other, here is the list I had worked out before when doing my analysis:

- GlusterFS (http://gluster.com/community/index.php) - Very promising shared nothing architecture, production ready software supported commercially, based on FUSE (provides insulation from the kernel at a small performance cost). Simple configuration. Very cute implementation where each "brick" for a "cluster/replication" setup is just a regular file system that can be accessed natively, so the data is always safe and can be inspected using UNIX commands or backed up using rsync. Most logic is client side, including replication, and they use file system attributes to journal changes and "self-heal". But, very recently there has been some problems, possibly with how GlusterFS calls Linux, triggering a Linux problem that causes the system to freeze up a bit. My own first test froze things up. The GlusterFS support people want to find the problem and I will be working with them to see whether this can be resolved or not.

- Ceph (http://ceph.newdream.net/) - Very promising shared nothing architecture, that has kernel module support instead of FUSE (better performance) but not ready for production. They say they will stabilize it by the end of 2009, but do not recommend using it for production even at that time.

Ceph is very interesting, and should be one to watch over time. Sage and group seem to have fewer resources at their disposal than Z-Research, so evolution may take longer.


- PVFS (http://www.pvfs.org/) - Very promising architecture. Widely used in production. V1 has a shared metadata server. V2 they are changing to a shared nothing architecture. Has kernel module support instead of FUSE (better performance). However, PVFS does not provide POSIX guarantees. In particular, the do not implement advisory locking through flock()/fcntl(). This means that use of this system would probably require an architecture that does master/slave fail over as opposed to master/master fail over. Most file system accesses do not care for this level of locking, but dovecot in particular probably does. The dovecot locking through .lock files might work, but I need to look a little closer.

PVFS is not a POSIX file system. You shouldn't try to use it as one. PVFS2 is the current release, and as Dan from Synthetic Genomics might note, it has some issues with codes that want to use it as a parallel POSIX file system. PVFS2 is purpose built for MPI-IO and related codes. There is nothing wrong with this, and in fact, this is a good thing, as MPI-IO capabilities are very important in HPC sectors.

Probably not so important for Dovecot.

[...]

- Lustre (http://www.lustre.org/) - Seems to be the focus of the Commercial world. Currently based on ext3/ext4, to be based on ZFS in 2010.Weakness seems to be on having a single shared metadata server that must be highly available using a shared disk solution such as GFS or OCFS. Due to this architecture, I do not consider this solution to meet our requirements of a shared nothing architecture where any server can completely die, and the other server take over the load without intervention.

Lustre is dependent upon Sun, and there are, to put it mildly, concerns over its future within Oracle. Oracle isn't really in the high performance computing market, which is where Lustre plays. I won't go into more depth here on its future.

Lustre is predominantly an object based storage system. It depends critically upon features that require very specific kernels and kernel patches, which tend to make it incompatible with requirements of keeping the distro specific kernels.

The migration to ZFS has been seen in some circles (people have mentioned this to us) as a migration over to solaris, which has caused numerous users to start to look at transition plans off of Lustre. Which is hard, when you have Petabytes of data ... moving it ain't easy.

- CRFS (http://oss.oracle.com/projects/crfs/) - Btrfs based - Btrfs is Oracle's answer to ZFS, and CRFS is Oracle's answer to Lustre, although development of this solution seems slow and this system is not ready for production. Development for both have effectively stalled since 2008. If these are ever released, I think they will be great solutions, but they are apparently having designs problems (either developers who are not good enough, or the design is too complicated, probably both).

BTRFS has most definitely not stalled. It is now in the Linux kernel as of 2.6.29, and is the target file system for a number of well known distros going forward. Ext4 is simply not viable for the storage sizes people are contemplating. Xfs, a venerable file system, has most of its developers at SGI, which has obvious risks associated with that. jfs may not be actively developed anymore. Chris Mason has been very actively doing btrfs work as far as I can tell from the various sources:http://btrfs.wiki.kernel.org/index.php/Main_Page#News

CRFS is dependent upon BTRFS, so CRFS is more of a placeholder.

With Sun owning ZFS and Oracle BTRFS, given that the latter is GPL compliant and the former is not (and is patent encumbered), I expect more work on BTRFS going forward for Linux, an important platform for Oracle. Solaris is not increasing in installed base, rather it is rapidly doing the opposite, and this trend isn't likely lost on Oracle.

Of course, we could be wrong, and our biases are in part due to what we sell, resell, and support, so take what I say with a grain of salt if you wish.

I do expect GlusterFS to work well atop BTRFS in the not so distant future.

You did neglect pNFS in your notes. Its sort of the "pink elephant" in the room. There are good things about it, and some ... er ... challenging things about it. I expect the kerberos requirements (and all this implies) aren't going to help its adoption. If you haven't dealt with a kerberos installation and management situation, you might not get this.

Also pohlemfs in Linux was included in 2.6.29. This is an interesting parallel file system, but we haven't played with it much yet.

Finally, on the other file systems you should pay attention to, nilfs2 looks to be quite interesting. Continuous snapshotting is quite interesting, though how this could be used from within in GlusterFS (GlusterFS atop nilfs2) isn't completely apparent yet. It could make for some very powerful capability in GlusterFS if the developers go this route.


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: address@hidden
web  : http://scalableinformatics.com
       http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615




reply via email to

[Prev in Thread] Current Thread [Next in Thread]