Re: [Gluster-devel] Weird lock-ups

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Weird lock-ups

From:	Gordan Bobic
Subject:	Re: [Gluster-devel] Weird lock-ups
Date:	Mon, 27 Oct 2008 18:58:28 +0000
User-agent:	Thunderbird 2.0.0.17 (X11/20081001)

Krishna Srinivas wrote:

On Tue, Oct 21, 2008 at 5:54 PM, Gordan Bobic <address@hidden> wrote:

I'm starting to see lock-ups when using a single-file client/server setup.

machine1 (x86): =================================
volume home2
       type protocol/client
       option transport-type tcp/client
       option remote-host 192.168.3.1
       option remote-subvolume home2
end-volume

volume home-store
       type storage/posix
       option directory /gluster/home
end-volume

volume home1
       type features/posix-locks
       subvolumes home-store
end-volume

volume server
       type protocol/server
       option transport-type tcp/server
       subvolumes home1
       option auth.ip.home1.allow 127.0.0.1,192.168.*
end-volume

volume home
       type cluster/afr
       subvolumes home1 home2
       option read-subvolume home1
end-volume

machine2 (x86-64): =================================
volume home1
       type protocol/client
       option transport-type tcp/client
       option remote-host 192.168.0.1
       option remote-subvolume home1
end-volume

volume home-store
       type storage/posix
       option directory /gluster/home
end-volume

volume home2
       type features/posix-locks
       subvolumes home-store
end-volume

volume server
       type protocol/server
       option transport-type tcp/server
       subvolumes home2
       option auth.ip.home2.allow 127.0.0.1,192.168.*
end-volume

volume home
       type cluster/afr
       subvolumes home1 home2
       option read-subvolume home2
end-volume

==================

Do those configs look sane?

When one machine is running on it's own, it's fine. Other client-only
machines can connect to it without any problems. However, as soon as the
second client/server comes up, typically the first ls access on the
directory will lock the whole thing up solid.

Interestingly, on the x86 machine, the glusterfs process can always be
killed. Not so on the x86-64 machine (the 2nd machine that comes up). kill
-9 doesn't kill it. The only way to clear the lock-up is to reboot.

Using the 1.3.12 release compiled into an RPM on both machines (CentOS 5.2).

One thing worthy of note is that machine2 is nfsrooted / network booted. It
has local disks in it, and a local dmraid volume is mounted under /gluster
on it (machine1 has a disk-backed root).

So, on machine1:
/ is local disk
on machine2:
/ is NFS
/gluster is local disk
/gluster/home is exported in the volume spec for AFR.

If /gluster/home is newly created, it tends to get a little further, but
still locks up pretty quickly. If I try to execute find /home once it is
mounted, it will eventually hang, and the only thing of note I could see in
the logs is that it said "active lock found" at the point where it


Do you see this error on server1 or server2? Any other clues in the logs?


Access to the FS locks up on both server1 and server2.

I have split up the setup to separate cliend and server on server2(x86-64), and have tried to get it to sync up just the file placeholders(find . at the root of the glusterfs mounted tree), and this, too causesa lock-up. I have managed to kill the glusterfsd process, but only afterkilling the glusterfs process first.


This ends up in the logs on server2, in the glusterfs (client) log:

2008-10-27 18:44:31 C [client-protocol.c:212:call_bail] home2: bailingtransport2008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup]home2: forced unwinding frame type(1) op(36) address@hidden2008-10-27 18:44:31 E [client-protocol.c:4215:client_setdents_cbk]home2: no proper reply from server, returning ENOTCONN2008-10-27 18:44:31 E [afr_self_heal.c:155:afr_lds_setdents_cbk] mirror:op_ret=-1 op_errno=1072008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup]home2: forced unwinding frame type(1) op(34) address@hidden2008-10-27 18:44:31 E [client-protocol.c:4430:client_lookup_cbk] home2:no proper reply from server, returning ENOTCONN2008-10-27 18:44:31 E [fuse-bridge.c:468:fuse_entry_cbk] glusterfs-fuse:19915: (34) /gordan/bin => -1 (5)2008-10-27 18:45:51 C [client-protocol.c:212:call_bail] home2: bailingtransport2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup]home2: forced unwinding frame type(1) op(0) address@hidden2008-10-27 18:45:51 E [client-protocol.c:2688:client_stat_cbk] home2: noproper reply from server, returning ENOTCONN2008-10-27 18:45:51 E [afr.c:3298:afr_stat_cbk] mirror: (child=home2)op_ret=-1 op_errno=1072008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup]home2: forced unwinding frame type(1) op(34) address@hidden2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2:no proper reply from server, returning ENOTCONN2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer]home2: transport_submit failed2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer]home2: transport_submit failed2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup]home2: forced unwinding frame type(1) op(34) address@hidden2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2:no proper reply from server, returning ENOTCONN2008-10-27 18:46:23 E [protocol.c:271:gf_block_unserialize_transport]home1: EOF from peer (192.168.0.1:6996)2008-10-27 18:46:23 E [client-protocol.c:4834:client_protocol_cleanup]home1: forced unwinding frame type(2) op(5) address@hidden

2008-10-27 18:46:23 E [client-protocol.c:4246:client_lock_cbk] home1: noproper reply from server, returning ENOTCONN

I think this was generated in the logs only after the server2 client wasforcefully killed, not when the lock-up occured, though.

If I merge the client and server config into a single volume definitionon server2, the lock-up happens as soon as the FS is mounted. Ifserver2-server gets brought up first, the server1-combined, thenserver2-client, it seems to last a bit longer.

I'm wondering now if it fails on a particular file/file type (e.g. asocket).

But whatever is causing it, it is completely reproducible. I haven'tbeen able to keep it running under these circumstances for long enoughto finish loading X with the home directory mounted over glusterfs withboth servers running.


Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] Weird lock-ups, Gordan Bobic, 2008/10/21
- Re: [Gluster-devel] Weird lock-ups, Krishna Srinivas, 2008/10/22
  - Re: [Gluster-devel] Weird lock-ups, Gordan Bobic <=
    - Re: [Gluster-devel] Weird lock-ups, gordan, 2008/10/27
    - Re: [Gluster-devel] Weird lock-ups, gordan, 2008/10/27
    - Re: [Gluster-devel] Weird lock-ups, gordan, 2008/10/27
    - Re: [Gluster-devel] Weird lock-ups, KwangErn Liew, 2008/10/28
    - Re: [Gluster-devel] Weird lock-ups, Gordan Bobic, 2008/10/28
    - Re: [Gluster-devel] Weird lock-ups, KwangErn Liew, 2008/10/29
- RE: [Gluster-devel] Weird lock-ups, Gordan Bobic, 2008/10/29
- Re: [Gluster-devel] Weird lock-ups, Christopher Hawkins, 2008/10/29

Prev by Date: Re: [Gluster-devel] How to increase a throughput
Next by Date: Re: [Gluster-devel] Weird lock-ups
Previous by thread: Re: [Gluster-devel] Weird lock-ups
Next by thread: Re: [Gluster-devel] Weird lock-ups
Index(es):
- Date
- Thread