gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Harddisk economy alternatives


From: Gordan Bobic
Subject: Re: [Gluster-devel] Harddisk economy alternatives
Date: Wed, 09 Nov 2011 17:51:59 +0000
User-agent: Roundcube Webmail/0.4.2

On Wed, 09 Nov 2011 17:50:00 +0100, Magnus Näslund <address@hidden> wrote:
[...]

We want the data replicated at least 3 times physically (box-wise),
so we've ordered 3 test servers with 24x3TB "enterprise" SATA disks
each with an areca card + bbu. We'll probably be running the tests
feeding raid volumes to glusterfs, and from what I've seen this seems
to be a standard.

With that amount of space I hope you are going to be using something like ZFS rather than normal RAID. Otherwise you are likely to find the error rate will slowly and silently eat your data.

Possible future:

Since our storage system will be in it for a really long term, we're
looking at the total economics of the solution vs. the data safety
concerns.

We've seen suggestions on letting glusterfs manage the disk directly.

What exactly do you mean by that? GlusterFS requires a normal xattr capable FS underneath it. Thus I presume you are referring to using GLFS instead of RAID (i.e. stripe+distribute).

The way I see it, this would give a win in that
1) We would be using all disks, no RAID/spare storage overhead
2) No RAID-rebuilds
3) ...
4) Profit

Also, we know that any long time system we build should be planned
with replacing disks continuously.

My main concern with such data volumes would be the error rates of modern disks. If your FS doesn't have automatic checking and block level checksums, you will suffer data corruption, silent or otherwise. Quality of modern disks is pretty appaling these days. One of my experiences is here:
http://www.altechnative.net/?p=120
but it is by no means the only one.

Currently the only FS that meets all of my reliability criteria is ZFS (and the linux port works quite well now), and it has saved me from data corruption, silent and otherwise, a number of times by now, in cases where normal RAID wouldn't have helped.

So in my mind we could buy quality boxes with 24-36 disks run by 3-4
SATA controller cards (Marvell?),

My experience with Marvell cards is limited. Do they have 8-port cards?
I use 8-port LSI cards without any serious problems. The only issue I have seen is that they tend to reset the bus when the disk is slow to respond (specifically due to running a SMART self-test), which means that on one hand you lose the SMART short/long self-test option for monitoring, but this is mitigated by weekly ZFS scrubs which I trust more anyway.

using cheap and large desktop disks
(maybe not the "green" variety).

I would suggest you at the very least use disks that have Write-Read-Verify capabilities. My recent experience shows that only Seagates include this feature, even though, as it turns out, Samsung seems to own the patent on it (and my Samsungs definitely don't have that feature). If you do this, you may want to look into the WRV patch for hdparm I submitted upstream, too, but there hasn't been a release of it since then.

Another good idea is to use disks of similar spec from a different manufacturer in different machines, and make sure that your glfs bricks are mirrored so that they have different make disks under them.

We could have a reporting system on
top of glusterfs that reports defective disks that would be replaced
as part of our on-duty maintenance. Since the storage is replicated
over 3+ boxes, the breakage of a single disk would not compromise the
data safety as long as the disks are replaced in timely manner.

Bear in mind that your network bandwidth is unlikely to be as good as your internal disk bandwidth, and restoring a 3TB brick by doing a "ls -laR" is likely to take a very long time. So you may be better off with RAIDZ2/RAIDZ3 or even just mirrored volumes in each of the machines, distributed using glfs, in terms of single disk failure recovery time.

Anyway, to summarize:
1) With large volumes of data, you need something other than the disk's sector checksums to keep your data correct, i.e. a checksum checking FS. If you don't, expect to see silent data corruption sooner or later. 2) Don't use the same make of disk in all the servers - I have seen multiple disks from the same manufacturer fail minutes apart more than once.
3) Use WRV features of they are available.
4) Make sure your glfs bricks are mirrored between machines in such a way that the underlying disks are different (e.g. say you have 24 disks in each box, divided into 3x 8-disk RAIDZ3 volumes. Use each one of those 8-disk volumes as a brick, and mirror it to a another similar machine so that the 8 disks on the other server are from a different manufacturer).

The glfs part on top is relatively straightforward and will "just work" provided you use a reasonably sane configuration. It is the layers underneath that you will need to get right to keep your data healthy.

Gordan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]