gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gluster-devel] Spurious disconnections / connectivity loss


From: Gordan Bobic
Subject: [Gluster-devel] Spurious disconnections / connectivity loss
Date: Fri, 29 Jan 2010 18:41:10 +0000
User-agent: Thunderbird 2.0.0.22 (X11/20090625)

I'm seeing things like this in the logs, coupled with things locking up for a while until the timeout is complete:

[2010-01-29 18:29:01] E [client-protocol.c:415:client_ping_timer_expired] home2: Server 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting. [2010-01-29 18:29:01] E [client-protocol.c:415:client_ping_timer_expired] home2: Server 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.

The thing is, I know for a fact that there is no network outage of any sort. All the machines are on a local gigabit ethernet, and there is no connectivity loss observed anywhere else. ssh sessions going to the machines that are supposedly "not responding" remain alive and well, with no lag.

The NICs in all the servers are a mix of Marvell (using the Marvell sk98lin driver) and Realtek (using the Realtek r8168 driver) - none of which have exhibited any other observable problems in use.

In 42 seconds, TCP would have re-transmitted if the packets really have gotten lost, so I'm not convinced it's packet loss (glfs uses TCP, right?). If it's not packet loss, then that implies that glfs daemons get stuck somewhere and either miss or ignore the packets in question. It smells like a bug, and it's not a new one, either - I have observed this in 2.0.x, too. It typically happens under heavy load (e.g. resyncing a volume to an empty server, or doing "ls -laR" on a volume to make sure it's up to date on all servers. In such cases, the network bandwidth used is nowhere near what the network can handle, nor are the CPUs in the servers anywhere near being maxed out - most of the time is spent waiting for the latencies (ping and context switches) to catch up. So I don't think it's a load (CPU or network) issue.

Is there a way to help debug this further?

Gordan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]