Re: [Gluster-devel] solutions for split brain situation

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] solutions for split brain situation

From:	Joe Landman
Subject:	Re: [Gluster-devel] solutions for split brain situation
Date:	Wed, 16 Sep 2009 18:49:24 -0400
User-agent:	Thunderbird 2.0.0.23 (X11/20090817)

Some comments as a user of the open source version, and as a reseller ofthe commercial version, including having provided emergency support tousers ... take these with a grain of salt if you wish.


Mark Mielke wrote:

On 09/16/2009 05:45 AM, Gordan Bobic wrote:
It's not my project (I'm just a user of it), but having done my


[...]

I came to a slightly different conclusion, but similar effect. Of theprojects available, GlusterFS is the closest to production *today*. The

As a user of many file systems over (quite) a span of time, I have as ofyet to see "the one true file system that is really bug free, alwaysworks, and never fails." All software is buggy. Some more so thanothers, but all software is buggy. Anyone telling you otherwise istrying to sell something to you.

world has waited a long time for this. It is imperfect, but right nowit's still high on the list of solutions that can be used today and havepotential for tomorrow.

Every storage design and implementation you do, you need to ask yourself"if this went away, what would be the impact upon me and my work?" Youthen need to design to this. Failure to do so ... well ...

In case it is of any use to other, here is the list I had worked outbefore when doing my analysis:
- GlusterFS (http://gluster.com/community/index.php) - Verypromising shared nothing architecture, production ready softwaresupported commercially, based on FUSE (provides insulation from thekernel at a small performance cost). Simple configuration. Very cuteimplementation where each "brick" for a "cluster/replication" setup isjust a regular file system that can be accessed natively, so the data isalways safe and can be inspected using UNIX commands or backed up usingrsync. Most logic is client side, including replication, and they usefile system attributes to journal changes and "self-heal". But, veryrecently there has been some problems, possibly with how GlusterFS callsLinux, triggering a Linux problem that causes the system to freeze up abit. My own first test froze things up. The GlusterFS support peoplewant to find the problem and I will be working with them to see whetherthis can be resolved or not.
- Ceph (http://ceph.newdream.net/) - Very promising shared nothingarchitecture, that has kernel module support instead of FUSE (betterperformance) but not ready for production. They say they will stabilizeit by the end of 2009, but do not recommend using it for production evenat that time.

Ceph is very interesting, and should be one to watch over time. Sageand group seem to have fewer resources at their disposal thanZ-Research, so evolution may take longer.

- PVFS (http://www.pvfs.org/) - Very promising architecture. Widelyused in production. V1 has a shared metadata server. V2 they arechanging to a shared nothing architecture. Has kernel module supportinstead of FUSE (better performance). However, PVFS does not providePOSIX guarantees. In particular, the do not implement advisory lockingthrough flock()/fcntl(). This means that use of this system wouldprobably require an architecture that does master/slave fail over asopposed to master/master fail over. Most file system accesses do notcare for this level of locking, but dovecot in particular probably does.The dovecot locking through .lock files might work, but I need to look alittle closer.

PVFS is not a POSIX file system. You shouldn't try to use it as one.PVFS2 is the current release, and as Dan from Synthetic Genomics mightnote, it has some issues with codes that want to use it as a parallelPOSIX file system. PVFS2 is purpose built for MPI-IO and related codes.There is nothing wrong with this, and in fact, this is a good thing,as MPI-IO capabilities are very important in HPC sectors.


Probably not so important for Dovecot.

[...]

- Lustre (http://www.lustre.org/) - Seems to be the focus of theCommercial world. Currently based on ext3/ext4, to be based on ZFS in2010.Weakness seems to be on having a single shared metadata server thatmust be highly available using a shared disk solution such as GFS orOCFS. Due to this architecture, I do not consider this solution to meetour requirements of a shared nothing architecture where any server cancompletely die, and the other server take over the load withoutintervention.

Lustre is dependent upon Sun, and there are, to put it mildly, concernsover its future within Oracle. Oracle isn't really in the highperformance computing market, which is where Lustre plays. I won't gointo more depth here on its future.

Lustre is predominantly an object based storage system. It dependscritically upon features that require very specific kernels and kernelpatches, which tend to make it incompatible with requirements of keepingthe distro specific kernels.

The migration to ZFS has been seen in some circles (people havementioned this to us) as a migration over to solaris, which has causednumerous users to start to look at transition plans off of Lustre.Which is hard, when you have Petabytes of data ... moving it ain't easy.

- CRFS (http://oss.oracle.com/projects/crfs/) - Btrfs based - Btrfsis Oracle's answer to ZFS, and CRFS is Oracle's answer to Lustre,although development of this solution seems slow and this system is notready for production. Development for both have effectively stalledsince 2008. If these are ever released, I think they will be greatsolutions, but they are apparently having designs problems (eitherdevelopers who are not good enough, or the design is too complicated,probably both).

BTRFS has most definitely not stalled. It is now in the Linux kernel asof 2.6.29, and is the target file system for a number of well knowndistros going forward. Ext4 is simply not viable for the storage sizespeople are contemplating. Xfs, a venerable file system, has most of itsdevelopers at SGI, which has obvious risks associated with that. jfsmay not be actively developed anymore. Chris Mason has been veryactively doing btrfs work as far as I can tell from the varioussources:http://btrfs.wiki.kernel.org/index.php/Main_Page#News


CRFS is dependent upon BTRFS, so CRFS is more of a placeholder.

With Sun owning ZFS and Oracle BTRFS, given that the latter is GPLcompliant and the former is not (and is patent encumbered), I expectmore work on BTRFS going forward for Linux, an important platform forOracle. Solaris is not increasing in installed base, rather it israpidly doing the opposite, and this trend isn't likely lost on Oracle.

Of course, we could be wrong, and our biases are in part due to what wesell, resell, and support, so take what I say with a grain of salt ifyou wish.


I do expect GlusterFS to work well atop BTRFS in the not so distant future.

You did neglect pNFS in your notes. Its sort of the "pink elephant" inthe room. There are good things about it, and some ... er ...challenging things about it. I expect the kerberos requirements (andall this implies) aren't going to help its adoption. If you haven'tdealt with a kerberos installation and management situation, you mightnot get this.

Also pohlemfs in Linux was included in 2.6.29. This is an interestingparallel file system, but we haven't played with it much yet.

Finally, on the other file systems you should pay attention to, nilfs2looks to be quite interesting. Continuous snapshotting is quiteinteresting, though how this could be used from within in GlusterFS(GlusterFS atop nilfs2) isn't completely apparent yet. It could makefor some very powerful capability in GlusterFS if the developers go thisroute.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: address@hidden
web  : http://scalableinformatics.com
       http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gluster-devel] solutions for split brain situation, (continued)
- Re: [Gluster-devel] solutions for split brain situation, Mark Mielke, 2009/09/14
  - Re: [Gluster-devel] solutions for split brain situation, Stephan von Krawczynski, 2009/09/14
- RE: [Gluster-devel] solutions for split brain situation, Gordan Bobic, 2009/09/16
  - Re: [Gluster-devel] solutions for split brain situation, Mark Mielke, 2009/09/16
    - Re: [Gluster-devel] solutions for split brain situation, Joe Landman <=
    - Re: [Gluster-devel] solutions for split brain situation, Gordan Bobic, 2009/09/16
- RE: [Gluster-devel] solutions for split brain situation, Gordan Bobic, 2009/09/16
  - Re: [Gluster-devel] solutions for split brain situation, Stephan von Krawczynski, 2009/09/17
- RE: [Gluster-devel] solutions for split brain situation, Gordan Bobic, 2009/09/17
  - Re: [Gluster-devel] solutions for split brain situation, Stephan von Krawczynski, 2009/09/17
    - Re: [Gluster-devel] solutions for split brain situation, Mark Mielke, 2009/09/17
    - Re: [Gluster-devel] solutions for split brain situation, Anand Avati, 2009/09/17
    - Re: [Gluster-devel] solutions for split brain situation, Michael Cassaniti, 2009/09/17
    - Re: [Gluster-devel] solutions for split brain situation, Mark Mielke, 2009/09/18
    - Re: [Gluster-devel] solutions for split brain situation, Anand Avati, 2009/09/18

Prev by Date: Re: [Gluster-devel] solutions for split brain situation
Next by Date: Re: [Gluster-devel] solutions for split brain situation
Previous by thread: Re: [Gluster-devel] solutions for split brain situation
Next by thread: Re: [Gluster-devel] solutions for split brain situation
Index(es):
- Date
- Thread