Re: [Gluster-devel] Weird lock-ups

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Weird lock-ups

From:	gordan
Subject:	Re: [Gluster-devel] Weird lock-ups
Date:	Mon, 27 Oct 2008 19:55:03 +0000 (GMT)
User-agent:	Alpine 2.00 (LRH 1167 2008-08-23)



On Mon, 27 Oct 2008, Gordan Bobic wrote:

Krishna Srinivas wrote:
 On Tue, Oct 21, 2008 at 5:54 PM, Gordan Bobic <address@hidden> wrote:
> I'm starting to see lock-ups when using a single-file client/server> setup.>> machine1 (x86): =================================
>  volume home2
>         type protocol/client
>         option transport-type tcp/client
>         option remote-host 192.168.3.1
>         option remote-subvolume home2
>  end-volume
>> volume home-store
>         type storage/posix
>         option directory /gluster/home
>  end-volume
>> volume home1
>         type features/posix-locks
>         subvolumes home-store
>  end-volume
>> volume server
>         type protocol/server
>         option transport-type tcp/server
>         subvolumes home1
>         option auth.ip.home1.allow 127.0.0.1,192.168.*
>  end-volume
>> volume home
>         type cluster/afr
>         subvolumes home1 home2
>         option read-subvolume home1
>  end-volume
>> machine2 (x86-64): =================================
>  volume home1
>         type protocol/client
>         option transport-type tcp/client
>         option remote-host 192.168.0.1
>         option remote-subvolume home1
>  end-volume
>> volume home-store
>         type storage/posix
>         option directory /gluster/home
>  end-volume
>> volume home2
>         type features/posix-locks
>         subvolumes home-store
>  end-volume
>> volume server
>         type protocol/server
>         option transport-type tcp/server
>         subvolumes home2
>         option auth.ip.home2.allow 127.0.0.1,192.168.*
>  end-volume
>> volume home
>         type cluster/afr
>         subvolumes home1 home2
>         option read-subvolume home2
>  end-volume
>> ==================>> Do those configs look sane?>> When one machine is running on it's own, it's fine. Other client-only
>  machines can connect to it without any problems. However, as soon as the
>  second client/server comes up, typically the first ls access on the
>  directory will lock the whole thing up solid.
>> Interestingly, on the x86 machine, the glusterfs process can always be> killed. Not so on the x86-64 machine (the 2nd machine that comes up).> kill
>  -9 doesn't kill it. The only way to clear the lock-up is to reboot.
>> Using the 1.3.12 release compiled into an RPM on both machines (CentOS> 5.2).>> One thing worthy of note is that machine2 is nfsrooted / network booted.> It> has local disks in it, and a local dmraid volume is mounted under> /gluster
>  on it (machine1 has a disk-backed root).
>> So, on machine1:
>  / is local disk
>  on machine2:
>  / is NFS
>  /gluster is local disk
>  /gluster/home is exported in the volume spec for AFR.
>> If /gluster/home is newly created, it tends to get a little further, but
>  still locks up pretty quickly. If I try to execute find /home once it is
> mounted, it will eventually hang, and the only thing of note I could see> in
>  the logs is that it said "active lock found" at the point where it

 Do you see this error on server1 or server2? Any other clues in the logs?
Access to the FS locks up on both server1 and server2.
I have split up the setup to separate cliend and server on server2 (x86-64),and have tried to get it to sync up just the file placeholders (find . at theroot of the glusterfs mounted tree), and this, too causes a lock-up. I havemanaged to kill the glusterfsd process, but only after killing the glusterfsprocess first.
This ends up in the logs on server2, in the glusterfs (client) log:
2008-10-27 18:44:31 C [client-protocol.c:212:call_bail] home2: bailingtransport2008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup] home2:forced unwinding frame type(1) op(36) address@hidden2008-10-27 18:44:31 E [client-protocol.c:4215:client_setdents_cbk] home2: noproper reply from server, returning ENOTCONN2008-10-27 18:44:31 E [afr_self_heal.c:155:afr_lds_setdents_cbk] mirror:op_ret=-1 op_errno=1072008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup] home2:forced unwinding frame type(1) op(34) address@hidden2008-10-27 18:44:31 E [client-protocol.c:4430:client_lookup_cbk] home2: noproper reply from server, returning ENOTCONN2008-10-27 18:44:31 E [fuse-bridge.c:468:fuse_entry_cbk] glusterfs-fuse:19915: (34) /gordan/bin => -1 (5)2008-10-27 18:45:51 C [client-protocol.c:212:call_bail] home2: bailingtransport2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup] home2:forced unwinding frame type(1) op(0) address@hidden2008-10-27 18:45:51 E [client-protocol.c:2688:client_stat_cbk] home2: noproper reply from server, returning ENOTCONN2008-10-27 18:45:51 E [afr.c:3298:afr_stat_cbk] mirror: (child=home2)op_ret=-1 op_errno=1072008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup] home2:forced unwinding frame type(1) op(34) address@hidden2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2: noproper reply from server, returning ENOTCONN2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer] home2:transport_submit failed2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer] home2:transport_submit failed2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup] home2:forced unwinding frame type(1) op(34) address@hidden2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2: noproper reply from server, returning ENOTCONN2008-10-27 18:46:23 E [protocol.c:271:gf_block_unserialize_transport] home1:EOF from peer (192.168.0.1:6996)2008-10-27 18:46:23 E [client-protocol.c:4834:client_protocol_cleanup] home1:forced unwinding frame type(2) op(5) address@hidden
1230
2008-10-27 18:46:23 E [client-protocol.c:4246:client_lock_cbk] home1: noproper reply from server, returning ENOTCONN
I think this was generated in the logs only after the server2 client wasforcefully killed, not when the lock-up occured, though.
If I merge the client and server config into a single volume definition onserver2, the lock-up happens as soon as the FS is mounted. If server2-servergets brought up first, the server1-combined, then server2-client, it seems tolast a bit longer.
I'm wondering now if it fails on a particular file/file type (e.g. a socket).
But whatever is causing it, it is completely reproducible. I haven't beenable to keep it running under these circumstances for long enough to finishloading X with the home directory mounted over glusterfs with both serversrunning.

Update - the problem seems to be somehow linked to running the client onserver2. If I start up server2-server, and server1 client+server, I canexecute a complete "find ." on the gluster mounted volume (from server1,obviously, server2 doesn't have a client running), and instigate afull resync by the usual "find /home -type f -exec head -c1 {} \; >/dev/null". This all works, and all files end up on server2.

But doing this with client up and running on server2 makes the wholeprocess lock up. Sometimes I can only get the first "ls -la" on the baseof the mounted tree before everything subsequent locks up and ends upwaiting until I "killall glusterfs" on server2. At this point glusterfsd(server) on server2 is unkillable until glusterfs (client) is killedfirst.

I have just completed a full rescan of the underlying file system onserver1 just in case that might have gone wrong, and it passed without anyissues.

So, something in the server2 (x86-64) client part causes a lock-upsomewhere in the process. :-(


Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] Weird lock-ups, Gordan Bobic, 2008/10/21
- Re: [Gluster-devel] Weird lock-ups, Krishna Srinivas, 2008/10/22
  - Re: [Gluster-devel] Weird lock-ups, Gordan Bobic, 2008/10/27
    - Re: [Gluster-devel] Weird lock-ups, gordan <=
    - Re: [Gluster-devel] Weird lock-ups, gordan, 2008/10/27
    - Re: [Gluster-devel] Weird lock-ups, gordan, 2008/10/27
    - Re: [Gluster-devel] Weird lock-ups, KwangErn Liew, 2008/10/28
    - Re: [Gluster-devel] Weird lock-ups, Gordan Bobic, 2008/10/28
    - Re: [Gluster-devel] Weird lock-ups, KwangErn Liew, 2008/10/29
- RE: [Gluster-devel] Weird lock-ups, Gordan Bobic, 2008/10/29
- Re: [Gluster-devel] Weird lock-ups, Christopher Hawkins, 2008/10/29

Prev by Date: Re: [Gluster-devel] Weird lock-ups
Next by Date: Re: [Gluster-devel] Weird lock-ups
Previous by thread: Re: [Gluster-devel] Weird lock-ups
Next by thread: Re: [Gluster-devel] Weird lock-ups
Index(es):
- Date
- Thread