gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not con


From: Fred Hucht
Subject: Re: [Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Date: Tue, 25 Nov 2008 14:22:35 +0100

Hi,

crawling through all /var/log/messages, I found on one of the failing nodes (node68)

Nov 25 04:04:12 node68 kernel: INFO: task pw.x:20052 blocked for more than 120 seconds. Nov 25 04:04:12 node68 kernel: "echo 0 > /proc/sys/kernel/ hung_task_timeout_secs" disables this message. Nov 25 04:04:12 node68 kernel: pw.x D ffff81027c3d5d68 0 20052 1 Nov 25 04:04:12 node68 kernel: ffff81027c3d5d48 0000000000000086 ffff81021c0e7460 0000000000000000 Nov 25 04:04:12 node68 kernel: ffff81041f14e800 000000038022a7ae ffff81041f314238 ffff81041f314000 Nov 25 04:04:12 node68 kernel: 0000000000000000 0000000000000001 0000000000000246 0000000000000003
Nov 25 04:04:12 node68 kernel: Call Trace:
Nov 25 04:04:12 node68 kernel: [<ffffffff882ae9c7>] :fuse:request_send +0x2c8/0x2f0 Nov 25 04:04:12 node68 kernel: [<ffffffff80242ab3>] autoremove_wake_function+0x0/0x2e Nov 25 04:04:12 node68 kernel: [<ffffffff80242ab3>] autoremove_wake_function+0x0/0x2e Nov 25 04:04:12 node68 kernel: [<ffffffff882ae037>] :fuse:fuse_request_init+0x2f/0x38 Nov 25 04:04:12 node68 kernel: [<ffffffff882b1761>] :fuse:fuse_open_common+0xef/0x15e Nov 25 04:04:12 node68 kernel: [<ffffffff882b188e>] :fuse:fuse_open +0x0/0x7 Nov 25 04:04:12 node68 kernel: [<ffffffff80286e30>] __dentry_open +0xe6/0x1ba Nov 25 04:04:12 node68 kernel: [<ffffffff80286f2a>] nameidata_to_filp +0x26/0x35 Nov 25 04:04:12 node68 kernel: [<ffffffff80286f66>] do_filp_open+0x2d/ 0x3d Nov 25 04:04:12 node68 kernel: [<ffffffff80287180>] get_unused_fd_flags+0x104/0x113 Nov 25 04:04:12 node68 kernel: [<ffffffff802872a3>] do_sys_open +0x46/0xc3 Nov 25 04:04:12 node68 kernel: [<ffffffff8020b08b>] system_call_after_swapgs+0x7b/0x80
Nov 25 04:04:12 node68 kernel:
Nov 25 04:04:12 node68 kernel: INFO: task pw.x:20053 blocked for more than 120 seconds. Nov 25 04:04:12 node68 kernel: "echo 0 > /proc/sys/kernel/ hung_task_timeout_secs" disables this message. Nov 25 04:04:12 node68 kernel: pw.x D ffff8101c5083d68 0 20053 1 Nov 25 04:04:12 node68 kernel: ffff8101c5083d48 0000000000000086 ffff81021c0e7460 0000000000000000 Nov 25 04:04:12 node68 kernel: ffff81041f14a800 000000008022a7ae ffff81021d8b9238 ffff81021d8b9000 Nov 25 04:04:12 node68 kernel: 0000000000000000 0000000000000001 0000000000000246 0000000000000003
Nov 25 04:04:12 node68 kernel: Call Trace:
Nov 25 04:04:12 node68 kernel: [<ffffffff882ae9c7>] :fuse:request_send +0x2c8/0x2f0 Nov 25 04:04:12 node68 kernel: [<ffffffff80242ab3>] autoremove_wake_function+0x0/0x2e Nov 25 04:04:12 node68 kernel: [<ffffffff80242ab3>] autoremove_wake_function+0x0/0x2e Nov 25 04:04:12 node68 kernel: [<ffffffff882ae037>] :fuse:fuse_request_init+0x2f/0x38 Nov 25 04:04:12 node68 kernel: [<ffffffff882b1761>] :fuse:fuse_open_common+0xef/0x15e Nov 25 04:04:12 node68 kernel: [<ffffffff882b188e>] :fuse:fuse_open +0x0/0x7 Nov 25 04:04:12 node68 kernel: [<ffffffff80286e30>] __dentry_open +0xe6/0x1ba Nov 25 04:04:12 node68 kernel: [<ffffffff80286f2a>] nameidata_to_filp +0x26/0x35 Nov 25 04:04:12 node68 kernel: [<ffffffff80286f66>] do_filp_open+0x2d/ 0x3d Nov 25 04:04:12 node68 kernel: [<ffffffff80287180>] get_unused_fd_flags+0x104/0x113 Nov 25 04:04:12 node68 kernel: [<ffffffff802872a3>] do_sys_open +0x46/0xc3 Nov 25 04:04:12 node68 kernel: [<ffffffff8020b08b>] system_call_after_swapgs+0x7b/0x80
Nov 25 04:04:12 node68 kernel:

The other two failing nodes had nothing related in the logs. Note that pw.x:20052 and pw.x:20053 are the two parallel jobs running on this node.

A similar error was logged during the crash two days ago on node22:

Nov 23 14:16:43 node22 kernel: INFO: task pw.x:32355 blocked for more than 120 seconds. Nov 23 14:16:43 node22 kernel: "echo 0 > /proc/sys/kernel/ hung_task_timeout_secs" disables this message. Nov 23 14:16:43 node22 kernel: pw.x D ffff8102049c1d68 0 32355 1 Nov 23 14:16:43 node22 kernel: ffff8102049c1d48 0000000000000082 ffff81013e0e1c60 0000000000000000 Nov 23 14:16:43 node22 kernel: ffff81021e4ea000 000000038022a7ae ffff81021f004a38 ffff81021f004800 Nov 23 14:16:43 node22 kernel: 0000000000000000 0000000000000001 0000000000000246 0000000000000003
Nov 23 14:16:43 node22 kernel: Call Trace:
Nov 23 14:16:43 node22 kernel: [<ffffffff882ae9c7>] :fuse:request_send +0x2c8/0x2f0 Nov 23 14:16:43 node22 kernel: [<ffffffff80242ab3>] autoremove_wake_function+0x0/0x2e Nov 23 14:16:43 node22 kernel: [<ffffffff80242ab3>] autoremove_wake_function+0x0/0x2e Nov 23 14:16:43 node22 kernel: [<ffffffff882ae037>] :fuse:fuse_request_init+0x2f/0x38 Nov 23 14:16:43 node22 kernel: [<ffffffff882b1761>] :fuse:fuse_open_common+0xef/0x15e Nov 23 14:16:43 node22 kernel: [<ffffffff882b188e>] :fuse:fuse_open +0x0/0x7 Nov 23 14:16:43 node22 kernel: [<ffffffff80286e30>] __dentry_open +0xe6/0x1ba Nov 23 14:16:43 node22 kernel: [<ffffffff80286f2a>] nameidata_to_filp +0x26/0x35 Nov 23 14:16:43 node22 kernel: [<ffffffff80286f66>] do_filp_open+0x2d/ 0x3d Nov 23 14:16:43 node22 kernel: [<ffffffff80287180>] get_unused_fd_flags+0x104/0x113 Nov 23 14:16:43 node22 kernel: [<ffffffff802872a3>] do_sys_open +0x46/0xc3 Nov 23 14:16:43 node22 kernel: [<ffffffff8020b08b>] system_call_after_swapgs+0x7b/0x80
Nov 23 14:16:43 node22 kernel:

That's all in /var/log/messages. Remember that the program "pw.x" runs without problems via NFS as that this is the only program used for testing presently.

Fred

On 25.11.2008, at 13:42, Joe Landman wrote:

Fred Hucht wrote:
Hi!
The glusterfsd.log on all nodes are virtually empty, the only entry on 2008-11-25 reads 2008-11-25 03:13:48 E [io-threads.c:273:iot_flush] sc1-ioth: fd context is NULL, returning EBADFD
on all nodes. I don't think that this is related to our problems.
Regards,
    Fred

Hi Fred

Could you post complete /var/log/messages file on pastebin? I have seen something like this before when fuse crashes. Fuse crashing could be due to a bug in fuse, the kernel, etc. Also could be hardware that is failing.

 Does an unmount/remount fix the problem?

Joe


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: address@hidden
web  : http://www.scalableinformatics.com
      http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

Dr. Fred Hucht <address@hidden>
Institute for Theoretical Physics
University of Duisburg-Essen, 47048 Duisburg, Germany





reply via email to

[Prev in Thread] Current Thread [Next in Thread]