Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport e

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport e

From:	Gerry Reno
Subject:	Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected
Date:	Wed, 14 May 2008 00:12:54 -0400
User-agent:	Thunderbird 1.5.0.12 (X11/20070530)

Krishna Srinivas wrote:

On Thu, May 8, 2008 at 9:19 PM, Gerry Reno <address@hidden> wrote:

Krishna Srinivas wrote:

Gerry,

In your client spec "client-local" does not have any purpose right?

This is your setup:
server1 and server2 have /home/vmail/mailbrick as storage exports.
on client you have an AFR which connects to server1 and server2.
client mounts it on /home/vmail/mailstore

Can you try mounting on command line instead of fstab?
When you kill one of the servers, can you see if you see anything
in the log files?

Also mention "option transport-timeout 5" in the two "client/protocol"
subvolumes. (so the timeout will be 5 secs)

Thanks
Krishna

 Two machines.
 Each machine has a server storage brick (/home/vmail/mailbrick)
 Each machine also has a client (/home/vmail/mailstore)
 If one of the machines either crashes or needs to be rebooted then it hangs
the client mount on the other machine.

 I'll umount the mount from fstab and remount from command line and let you
know.


Also mention "option transport-timeout 5" in the two "client/protocol"
subvolumes. (so the timeout will be 5 secs)

 Regards,
 Gerry

Ok, I ran some tests:

First, when I started I noticed that on one machine when I did a 'df'that I would see two client mounts and on the other machine I would seeone client mount. I unmounted the clients from fstab and then changedthe client.vol to include the option transport-timeout 5. Then Istarted the clients from the command line. I see one client mount oneach machine. I kill one machine. The other machine still functions.Did this a couple times. Then I went and left the timeout in the voland just rebooted both machines. They both came back up and df showstwo client mounts on both machines. ps shows two client processes onboth machines. I kill one machine again and the other machine stillfunctions. So I was not able to recreate hang.

I check logs and I can see in the log that there are thousands of lineslike the following over the past weeks in both logs:

2008-04-26 00:27:55 E [client-protocol.c:4405:client_lookup_cbk]client2: no proper reply from server, returning ENOTCONN2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2:non-blocking connect() returned: 111 (Connection refused)2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer]client2: not connected at the moment to submit frame type(1) op(22)2008-04-26 00:27:55 E [client-protocol.c:3742:client_opendir_cbk]client2: no proper reply from server, returning ENOTCONN2008-04-26 00:27:55 E [afr_self_heal.c:290:afr_lds_opendir_cbk] afr:op_ret=-1 op_errno=1072008-04-26 00:27:55 E [afr_self_heal.c:290:afr_lds_opendir_cbk] afr:op_ret=-1 op_errno=242008-04-26 00:27:55 E [fuse-bridge.c:459:fuse_entry_cbk] glusterfs-fuse:11084: (34) /example.com/john => -1 (5)2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2:non-blocking connect() returned: 111 (Connection refused)2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer]client2: not connected at the moment to submit frame type(1) op(34)2008-04-26 00:27:55 E [client-protocol.c:4405:client_lookup_cbk]client2: no proper reply from server, returning ENOTCONN2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2:non-blocking connect() returned: 111 (Connection refused)2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer]client2: not connected at the moment to submit frame type(1) op(34)2008-04-26 00:27:55 E [client-protocol.c:4405:client_lookup_cbk]client2: no proper reply from server, returning ENOTCONN2008-04-26 00:27:55 E [tcp-client.c:190:tcp_connect] client2:non-blocking connect() returned: 111 (Connection refused)2008-04-26 00:27:55 W [client-protocol.c:331:client_protocol_xfer]client2: not connected at the moment to submit frame type(1) op(34)

2008-04-25 19:47:47 E [afr.c:2018:afr_open_cbk] afr:(path=/example.com/john/dovecot-uidlist.lock child=client2) op_ret=-1op_errno=22008-04-25 19:47:47 E [afr.c:2018:afr_open_cbk] afr:(path=/example.com/john/dovecot-uidlist.lock child=client1) op_ret=-1op_errno=22008-04-25 19:47:47 E [fuse-bridge.c:692:fuse_fd_cbk] glusterfs-fuse:5775: (12) /example.com/john/dovecot-uidlist.lock => -1 (2)

2008-04-25 13:09:02 W [fuse-bridge.c:402:fuse_entry_cbk] glusterfs-fuse:3883: (34) /example.com/gerryreno/dovecot-keywords => 566935 Rehashingbecause st_nlink less than dentry maps2008-04-25 13:09:02 E [fuse-bridge.c:1140:fuse_unlink] glusterfs-fuse:3894: UNLINK /example.com/gerryreno/dovecot-uidlist (fuse_loc_fill()returned NULL inode)

Anyway, I wasn't able to see the hang using the transport-timeout. I'mtrying to think about why there are two client mounts from fstabthough. That seems strange.


Regards,
Gerry

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Gerry Reno, 2008/05/01
- Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Gerry Reno, 2008/05/01
  - Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Gerry Reno, 2008/05/02
    - Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Krishna Srinivas, 2008/05/02
    - Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Gerry Reno, 2008/05/06
    - Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Krishna Srinivas, 2008/05/08
    - Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Gerry Reno, 2008/05/08
    - Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Krishna Srinivas, 2008/05/08
    - Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Krishna Srinivas, 2008/05/13
    - Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Gerry Reno, 2008/05/13
    - Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected, Gerry Reno <=

Prev by Date: Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected
Next by Date: [Gluster-devel] write-behind tuning
Previous by thread: Re: [Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected
Next by thread: [Gluster-devel] new wiki article : HA gluster using AFR on the server side
Index(es):
- Date
- Thread