gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Blocking client when server is down


From: Martin Fick
Subject: Re: [Gluster-devel] Blocking client when server is down
Date: Wed, 31 Dec 2008 08:52:18 -0800 (PST)

--- On Wed, 12/31/08, Harald Stürzebecher <address@hidden> wrote:

> 2008/12/31 Martin Fick <address@hidden>:
> > --- On Tue, 12/30/08, Basavanagowda Kanur
> <address@hidden> wrote:
> >
> >> If server is down for transport-timout time, then
> client
> >> returns all the calls with 'Transport Endpoint
> not connected'
> >> error.
> >
> > Yes, this is exactly what I do not want.  I want
> reads/writes to simply block when the server is down and to
> complete (the blocked calls) when the server returns.  I do
> not want my applications to get an error, only a delay. 
> Without this it is not possible to recover gracefully from a
> server/network failure.
> >
> > While we are at it, what is the timeout in, seconds,
> milliseconds?
> 
> http://www.gluster.org/docs/index.php/Translators_v1.4#client
> says:
> "# option transport-timeout 30            # seconds to
> wait for a response
>                                          # from server for
> each request"
> 
> Setting that to 604800 should give you a week to fix the
> server. ;-) I hope it will try to reconnect sometimes to see if the
> server is up again.

Thanks, I missed that.  But, unfortunately it doesn't work the way you are 
suggesting (that's why I was asking, to confirm that it was indeed seconds).  
If you simply kill the server daemon, it will fail the connection immediately, 
despite any long timeouts that you set.  I suppose that is because it will kill 
the tcp connection.  It appears that the glusterfs protocol simply cannot deal 
with resending requests, I suppose it expects TCP to do that for you?  But if a 
server goes down after the TCP request was received and TCP acked, but before 
it was serviced and responded to at the gluterfs protocol layer, I do not 
believe that glusterfs knows how to retransmit the request, this is where the 
timeout comes into play I believe.  I think that is the root cause for why 
blocking is not currently implemented.

This timeout is only useful when the connection still exists but the server is 
not responding, i.e. if you stop glusterfd in foreground with ^Z and then start 
it again with 'fg', in under the timeout value, it will survive this.  I assume 
a downed network link would be affected by this too, if the link is not down 
long enough to time out the TCP connection.  This makes this timeout useful 
only if you have a heavily loaded server or network that cannot respond to you 
and you actually want to timeout.  And then, what?  It is not useful for 
extending recovery.  I am not sure how timing out in this case really helps 
anything anyway, except for when using AFR or the HA translators perhaps.

More food for the wiki I suppose, :)

-Martin








reply via email to

[Prev in Thread] Current Thread [Next in Thread]