Re: [Gluster-devel] Gluster health/status

2010/2/23 Harald Stürzebecher <address@hidden>

2010/2/22 Samuel Hassine <address@hidden>:

> I'm also looking for a way to monitor gluster nodes.
>
> Any solutions ?
>
> Le lundi 22 février 2010 à 10:12 +0500, Anton a écrit :
>> Hello!
>>
>>
>>
>> I'm looking for the way to determine the health of the GLUSTER
>> cluster. Is there any way to determine if any of the nodes failed? In
>> the log files it is possible to grep that there is "remotexx:
>> disconnected" - but it is not sutable for monitoring. There should be
>> the simple way to just query the cluster against the .vol file and
>> see, if any node/brick failed to attach and so trigger the alarm. Is
>> there anything like "gluster --reporthealth"?

Checking if a connection to the GlusterFS TCP server port (6996 IIRC)
is possible might be an indicator for working/failing - at least for
setups that use TCP. I don't know if anything like that is possible
for Infiniband-only setups.

IPoIB (IP over Infiniband)?

IIRC, Nagios can check if a port is open on a remote machine. That
won't find something like disk/filesystem problems on the server, but
it could report crashed GlusterFS server processes and machines that
are not working at all.

nagios can run checks remotely

http://www.logix.cz/michal/devel/nagios/
http://blogs.techrepublic.com.com/opensource/?p=321

so it can check the real status of glusterfsd or whatever we want on remote host

I know that this simple method won't provide a positive status (=it
works) which would be preferable, but at least it can provide a
negative status (=_something_ failed on _that_ machine) in some cases.

glusterfsd port can be stolen, check of open port is indirect and unreliable way to check status

@gluster.org:
IIRC, some time ago someone requested a syslog feature to debug
problems with GlusterFS as root filesystem for a diskless cluster -
are there any news on that?
Having the clients report problems to a central logging server might
be useful for monitoring.

monitoring of glusterfs daemons from client side is unreliable as monitoring errors can be caused by faults on the client side (I suppose nagios server host(s) to be reliable host)

I insist on remote checks because
1) glusterfsd should abort if non-recoverable error happened, in the case remote check of real status is the most reliable check
2) if glustefsd or any FS-related service continues to work in a non-healthy state after non-recoverable error happened then it can lead to damage and irreversible loss of data. Non-recoverable errors should be investigated and fixed only by system administrator with complete set of system tools at hands.

Regards,

Alexey.

Regards,

Harald

_______________________________________________
Gluster-devel mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/gluster-devel

From:	Alexey Filin
Subject:	Re: [Gluster-devel] Gluster health/status
Date:	Wed, 24 Feb 2010 20:40:45 +0300