Re: [Gluster-devel] Performance Translators' Stability and Usefulness

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Performance Translators' Stability and Usefulness

From:	Gordan Bobic
Subject:	Re: [Gluster-devel] Performance Translators' Stability and Usefulness
Date:	Sat, 04 Jul 2009 17:43:54 +0100
User-agent:	Thunderbird 2.0.0.22 (X11/20090625)

Geoff Kassel wrote:

(If it wasn't for that migrating to another solution would cause considerable,business-destroying downtime for my client base, I would have done so quitesome time ago.)

There is an argument somewhere in there about deploying things thataren't production ready at time of deployment. But that's a different story.

All I see instead is this constant drive towards new features, with little tono signs that functionality that should be complete by now is actually so.

I can understand your point of view, but at the same time I'm assumingthat the feature expansion is being done at the request of the payingcustomers they have, whose priorities and use cases may well besufficiently different that the issues we are running into aren't ascritical for them.

AFR is *the* key feature of GlusterFS in my mind - and the only point (I feel)for using it. Yet it's still this unstable after two plus years ofdevelopment?

It is the only feature of it that I am looking into using, too, but itis plausible that somebody with a large distributed server farm focusedon performance rather than redundancy may see it differently.

I have been using GlusterFS since the v1.3.x days, and I have yet to see
a version since then that doesn't crash at least once a day from just
load on even the simplest configurations.

I wouldn't say daily, but occasionally, I have seen lock-ups recently
during multiple glusterfs resyncs (separate volumes) on the new/target
machine. I have only seen it once, however, forcefully killing the
processes fixed it and it didn't re-occur. I have a suspicion that this
was related to the mounting order. I have seen weirdness happen when
changing the server order cluster-wide, and when servers rejoin the
cluster.
Well, I see one to two crashes nightly, when I rotate logs or perform backupsthat are stored on the GlusterFS exported drive. (It's hit and miss whichprocesses run to completion on the first go before the crash, which shouldnever be an issue with a reliable storage medium.)


There's a strong argument there for implementing syslog based logging.

How do you do log rotation, BTW? Do you have to issue a HUP? Or restartthe glusterfsd process? As I said, I have seen issues with restartingserver processes in different orders. Sometimes things will lock up andthe glusterfsd process has to be killed and restarted. It seems to workwhen servers come up in priority order, but other orderings can be hitand miss.

The only common factor identifiable is higher-than-average I/O load.
I don't run any performance translators, because they make the situation muchworse. It's just a straight AFR/posix-locks/dataspace/namespace setup, asI've posted quite a few times before.


Why do you namespaces for straight AFR?

I've had to institute server scripting to restart GlusterFS and any processesthat touches replicated files (i.e. nearly everything running on my servers)because of these crashes to try to minimise the downtime to my clients.

Sounds like a lot of effort and micro-downtime compared to a migrationto something else. Have you explored other options like PeerFS, GFS andSeznamFS? Or NFS exports with failover rather than Gluster clients, withGluster only server-to-server?

Yes, that was bad, 2.0.2 is pretty good. Sure, there is still that
annoying settle-time bug that consistently fails the first attempt to
access the file system immediately after mounting (the time gap is
pretty tight, but if you script it, it is 100% reproducible). But other
than that I'm finding that all the other issues I had with it have been
resolved.

After two major data integrity bugs in two major releases in a row, I'm takingvery much a wait-and-see attitude with any and all GlusterFS releases.

My use-case is somewhat unusual because I'm working on shared-rootfsclusters, and I need WAN functionality which cripples solutions likeDRBD+GFS. But for data-only storage, there are probably alternatives outthere. I'm intending to implement SeznamFS for bulk data, for example,because it's MySQL-like round-robin file replication distributes thebandwidth usage much more effectively (at the expense of having nolocking capability and the replication ring being cut off if any onenode fails). I'll probably stick with Gluster for /home for now becauseSeznamFS seemed to cause X and/or KDE to fail to start when /home was onSeznamFS.

What exactly do you mean by "regression test"? Regression testing means
putting in a test case to check for all the bugs that were previously
discovered and fixed to make sure a further change doesn't re-introduce
the bug. I haven't seen the test suite, so have no reason to doubt that
there is regression testing being carried out for each release. Perhaps
the developers can clarify the situation on the testing?
I meant it in the same sense that you do. I have not seen any framework -automated or otherwise - in the repository or release files to run throughtests for previous and/or forseeable bugs and corner cases.

OK, I haven't actually checked. A "make test" feature listing all bugsby bugzilla ID as it goes through the testing process would go a longway toward providing some quality reassurance.

A test to compare cryptographic hashes of files before, after, and duringstorage/transfer between GlusterFS clients and backends should surely existif there's any half-serious attempt at regression testing going on.

One of the problems is that some tests in this case are impossible tocarry out without having multiple nodes up and running, as a number ofbugs have been arising in cases where nodes join/leave or cause raceconditions. It would require a distributed test harness which would bedifficult to implement so that they run on any client that builds thebinaries. Just because the test harness doesn't ship with the sourcesdoesn't mean it doesn't exist on a test rig the developers use.

Surely, though, if tests like these existed and were being used, after thedebacle with 2.0.0, they would have picked up at least the issue reported in2.0.1 before release?

That depends. There are always going to be borderline or unusual usecases that wouldn't have been foreseen. For example, I tripped severalissues with my usage of it for the root file system that would have beenunlikely to arise for most people. The most odd one was the fact thatglusterfsd wouldn't start without /tmp existing and being writable eventhough it doesn't seem to keep anything in there after startup. I onlytwigged that was what was happening when I was working on debuging itwith Harha and for him the mounting worked when he mounted under /tmp,when I was mounting under /mnt. He thought it was something about /mnthaving some kind of weird permissions issue, but then I twigged that Ididn't actually have /tmp on my initrd bootstrap where this was beingdone on my setup. To this day I haven't seen an explanation of why /tmpis required and if it is a fuse requirement or gluster requirement orsomething else entirely.

That leads me to ask - where's the unit tests that are meant to exist,according to http://www.gluster.org/docs/index.php/GlusterFS_QA? If theyexist, why (apparently) aren't tests like these still not part of them?

As I explained before, you can't sensibly come up with QA tests fortiming based issues and race conditions, because those will always beheisenbuggy to some extent. I'm not saying such tests should exist, andat least perform some hammering for extended periods that was known totrigger the known issues, but that only counts statistically, it won'tprovide conclusive evidence of absence of the bug.


Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] Performance Translators' Stability and Usefulness, Gordan Bobic, 2009/07/03
- Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Shehjar Tikoo, 2009/07/04
  - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Geoff Kassel, 2009/07/04
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Gordan Bobic, 2009/07/04
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Geoff Kassel, 2009/07/04
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Gordan Bobic <=
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Shehjar Tikoo, 2009/07/05
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Geoff Kassel, 2009/07/05
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Gordan Bobic, 2009/07/05
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Filipe Maia, 2009/07/05
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Geoff Kassel, 2009/07/06
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Michael Cassaniti, 2009/07/06
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Mickey Mazarick, 2009/07/06
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Geoff Kassel, 2009/07/06
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Anand Avati, 2009/07/06
    - Re: [Gluster-devel] Performance Translators' Stability and Usefulness, Geoff Kassel, 2009/07/07

Prev by Date: Re: [Gluster-devel] Performance Translators' Stability and Usefulness
Next by Date: Re: [Gluster-devel] Performance Translators' Stability and Usefulness
Previous by thread: Re: [Gluster-devel] Performance Translators' Stability and Usefulness
Next by thread: Re: [Gluster-devel] Performance Translators' Stability and Usefulness
Index(es):
- Date
- Thread