Re: [Gluster-devel] io recovering after failure

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] io recovering after failure

From:	Mickey Mazarick
Subject:	Re: [Gluster-devel] io recovering after failure
Date:	Fri, 30 Nov 2007 08:03:31 -0500
User-agent:	Thunderbird 2.0.0.9 (Windows/20071031)

AFR is being handled on the client... I simplified the specs down tolook exactly like the online example and I'm still seeing the same result.This is an infiniband setup so that may be the problem. We want to runthis on a 6 brick 100+ client cluster over infiniband.

Whenever I kill the gluster daemon on RTPST201 it hangs and the clientlog says:/2007-11-30 07:55:14 E [unify.c:145:unify_buf_cbk] bricks: afrnsreturned 107

2007-11-30 07:55:14 E [unify.c:145:unify_buf_cbk] bricks: afrns returned 107

2007-11-30 07:55:34 E [ib-verbs.c:1100:ib_verbs_send_completion_proc]transport/ib-verbs: send work request on `mthca0' returned errorwc.status = 12, wc.vendor_err = 129, post->buf = 0x2aaaad801000,wc.byte_len = 0, post->reused = 2102007-11-30 07:55:34 E [ib-verbs.c:1100:ib_verbs_send_completion_proc]transport/ib-verbs: send work request on `mthca0' returned errorwc.status = 12, wc.vendor_err = 129, post->buf = 0x2aaaac2bf000,wc.byte_len = 0, post->reused = 1682007-11-30 07:55:34 E [ib-verbs.c:951:ib_verbs_recv_completion_proc]transport/ib-verbs: ibv_get_cq_event failed, terminating recv thread2007-11-30 07:55:34 E [ib-verbs.c:1100:ib_verbs_send_completion_proc]transport/ib-verbs: send work request on `mthca0' returned errorwc.status = 12, wc.vendor_err = 129, post->buf = 0x2aaaabfb9000,wc.byte_len = 0, post->reused = 230/



Storage Bricks are:
RTPST201,RTPST202

########################Storage Brick vol spec:
volume afrmirror
 type storage/posix
 option directory /mnt/gluster/afrmirror
end-volume
volume afrns
 type storage/posix
 option directory /mnt/gluster/afrns
end-volume
volume afr
 type storage/posix
 option directory /mnt/gluster/afr
end-volume
volume server
type protocol/server
option transport-type ib-verbs/server # For ib-verbs transport
option ib-verbs-work-request-send-size  131072
option ib-verbs-work-request-send-count 64
option ib-verbs-work-request-recv-size  131072
option ib-verbs-work-request-recv-count 64
 ##auth##
 option auth.ip.afrmirror.allow *
option auth.ip.afrns.allow *
option auth.ip.afr.allow *
option auth.ip.main.allow *
option auth.ip.main-ns.allow *
end-volume

#####################Client spec is:
volume afrvol1
 type protocol/client

option transport-type ib-verbs/clientoption remote-host RTPST201

 option remote-subvolume afr
end-volume

volume afrmirror1
 type protocol/client

option transport-type ib-verbs/clientoption remote-host RTPST201

 option remote-subvolume afrmirror
end-volume

volume afrvol2
 type protocol/client

option transport-type ib-verbs/clientoption remote-host RTPST202

 option remote-subvolume afr
end-volume

volume afrmirror2
 type protocol/client

option transport-type ib-verbs/clientoption remote-host RTPST202

 option remote-subvolume afrmirror
end-volume

volume afr1
 type cluster/afr
 subvolumes afrvol1 afrmirror2
end-volume

volume afr2
 type cluster/afr
 subvolumes afrvol2 afrmirror1
end-volume


volume afrns1
 type protocol/client
 option transport-type ib-verbs/client
 option remote-host RTPST201
 option remote-subvolume afrns
end-volume
volume afrns2
 type protocol/client
 option transport-type ib-verbs/client
 option remote-host RTPST202
 option remote-subvolume afrns
end-volume

volume afrns
 type cluster/afr
 subvolumes afrns1 afrns2
end-volume

volume bricks
 type cluster/unify
 option namespace afrns
 subvolumes afr1 afr2
 option scheduler alu   # use the ALU scheduler
 option alu.order open-files-usage:disk-usage:read-usage:write-usage
end-volume


Krishna Srinivas wrote:

If you have the AFR on the server side, and if this server goes down then
all the FDs associated with the files on this server will return ENOTCONN
error. (If that is how your setup is? ) But if you had AFR on the client
side it would have worked seamlessly. However this situation will be
handled when we bring out the HA translator

Krishna

On Nov 30, 2007 3:01 AM, Mickey Mazarick <address@hidden> wrote:

Is this true for files that are currently open? For example I have a
virtual machine running that had a file open at all times. Errors are
bubbling back to the application layer instead of just waiting. After
that I have to unmount/remount the gluster vol. Is there a way of
preventing this?

(This is the latest tla btw)
Thanks!


Anand Avati wrote:

This is possible already, just that the files from the node which are
down will not be accessible for the time the server is down. When the
server is brought back up, the files are made accessible again.

avati

2007/11/30, Mickey Mazarick <address@hidden
<mailto:address@hidden>>:

    Is there currently a way to force a client connection to retry dist io
    until a failed resource comes back online?
    if a disk in a unified volume drops I have to remount on all the
    clients. Is there a way around this?

    I'm using afr/unify on 6 storage bricks and I want to be able to
    change
    a server config setting and restart the server bricks one at a time
    without losing the mount point on the clients. Is this currently
    possible without doing ip failover?
    --
    _______________________________________________
    Gluster-devel mailing list
    address@hidden <mailto:address@hidden>

http://lists.nongnu.org/mailman/listinfo/gluster-devel





--
It always takes longer than you expect, even when you take into
account Hofstadter's Law.

-- Hofstadter's Law

--

_______________________________________________
Gluster-devel mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/gluster-devel

--

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] io recovering after failure, Mickey Mazarick, 2007/11/29
- Re: [Gluster-devel] io recovering after failure, Anand Avati, 2007/11/29
  - Re: [Gluster-devel] io recovering after failure, Mickey Mazarick, 2007/11/29
    - Re: [Gluster-devel] io recovering after failure, Krishna Srinivas, 2007/11/30
    - Re: [Gluster-devel] io recovering after failure, Mickey Mazarick <=

Prev by Date: [Gluster-devel] distributed locking
Next by Date: Re: [Gluster-devel] distributed locking
Previous by thread: Re: [Gluster-devel] io recovering after failure
Next by thread: [Gluster-devel] distributed locking
Index(es):
- Date
- Thread