[Gluster-devel] Spurious disconnections / connectivity loss

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gluster-devel] Spurious disconnections / connectivity loss

From:	Gordan Bobic
Subject:	[Gluster-devel] Spurious disconnections / connectivity loss
Date:	Fri, 29 Jan 2010 18:41:10 +0000
User-agent:	Thunderbird 2.0.0.22 (X11/20090625)

I'm seeing things like this in the logs, coupled with things locking upfor a while until the timeout is complete:

[2010-01-29 18:29:01] E[client-protocol.c:415:client_ping_timer_expired] home2: Server10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.[2010-01-29 18:29:01] E[client-protocol.c:415:client_ping_timer_expired] home2: Server10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.

The thing is, I know for a fact that there is no network outage of anysort. All the machines are on a local gigabit ethernet, and there is noconnectivity loss observed anywhere else. ssh sessions going to themachines that are supposedly "not responding" remain alive and well,with no lag.

The NICs in all the servers are a mix of Marvell (using the Marvellsk98lin driver) and Realtek (using the Realtek r8168 driver) - none ofwhich have exhibited any other observable problems in use.

In 42 seconds, TCP would have re-transmitted if the packets really havegotten lost, so I'm not convinced it's packet loss (glfs uses TCP,right?). If it's not packet loss, then that implies that glfs daemonsget stuck somewhere and either miss or ignore the packets in question.It smells like a bug, and it's not a new one, either - I have observedthis in 2.0.x, too. It typically happens under heavy load (e.g.resyncing a volume to an empty server, or doing "ls -laR" on a volume tomake sure it's up to date on all servers. In such cases, the networkbandwidth used is nowhere near what the network can handle, nor are theCPUs in the servers anywhere near being maxed out - most of the time isspent waiting for the latencies (ping and context switches) to catch up.So I don't think it's a load (CPU or network) issue.


Is there a way to help debug this further?

Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] Spurious disconnections / connectivity loss, Gordan Bobic <=
- Re: [Gluster-devel] Spurious disconnections / connectivity loss, Anand Avati, 2010/01/29
- Re: [Gluster-devel] Spurious disconnections / connectivity loss, Stephan von Krawczynski, 2010/01/30
  - Re: [Gluster-devel] Spurious disconnections / connectivity loss, Gordan Bobic, 2010/01/30
    - Re: [Gluster-devel] Spurious disconnections / connectivity loss, Stephan von Krawczynski, 2010/01/31
    - Re: [Gluster-devel] Spurious disconnections / connectivity loss, Gordan Bobic, 2010/01/31

Prev by Date: Re: [Gluster-devel] Problem with little write/read files in Gluster 3.0.0
Next by Date: Re: [Gluster-devel] Spurious disconnections / connectivity loss
Previous by thread: [Gluster-devel] Problem with little write/read files in Gluster 3.0.0
Next by thread: Re: [Gluster-devel] Spurious disconnections / connectivity loss
Index(es):
- Date
- Thread