[Gluster-devel] Re: [Gluster-users] 2.0.6

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gluster-devel] Re: [Gluster-users] 2.0.6

From:	Anand Avati
Subject:	[Gluster-devel] Re: [Gluster-users] 2.0.6
Date:	Fri, 21 Aug 2009 17:42:21 -0500 (CDT)

Stephan,
   Please find replies below. I am merging the thread back to the ML.

> > Stephan, we need some more info. I think we are a lot closer to
> diagnosing this issue now. The hang is being caused by an io-thread
> getting hung either as a deadlock inside the glusterfsd process code,
> or blocked on disk access for an excessively long time. The following
> details will be _extremely_ useful for us.
> > 
> > 1. What is your backend FS, kernel version and distro running on
> server2? Is the backend FS on a local disk or some kind of SAN or
> iSCSI?
> 
> The backend FS is reiserfs3, kernel version 2.6.30.5, distro openSuSE
> 11.1.
> The backend FS resides on a local Areca RAID system. See attached
> output of
> former email. 
> 
> > 2. Was the glusterfsd on server2 taking 100% cpu at the time of the
> hang?
> 
> I can only try to remember that from the time I took the strace logs.
> I am not
> a 100% sure, but from typing and looking I would say the load was very
> low,
> probably next to zero. 
> 
> > 3. On server2, now that you have killed it with -11, you should be
> having a core file in /. Can you get the backtrace from all the
> threads? Please use the following commands -
> > 
> > sh# gdb /usr/sbin/glusterfsd -c /core.X
> > 
> > and then at the gdb prompt
> > 
> > (gdb) thread apply all bt
> > 
> > This should output the backtraces from all the threads.
> 
> The bad news is this: we were not able to normally shut down the box
> because
> the local (exported) fs hung completely. So shutdown did not work. We
> had to
> hard-reset it. When examining the box few minutes ago we had to find
> out that
> all logs (and likely the core dump) were dumped and lost. I have seen
> this
> kind of behaviour before, it is originated from reiserfs3 and not
> really
> unusual. This means: we redo the test and hope we can force the
> problem again.
> Then we take all possible logs, dmesg, cores away from the server
> before
> rebooting it. I am very sorry we lost the important part of
> information... 

Stephan,
   This clearly points that the root cause for the bonnie hangs which you have 
been facing on every release of 2.0.x is because of the hanging reiserfs export 
you have. When you have the backend FS which is misbehaving, this is the 
expected behavior of GlusterFS. Not only will you see this in all versions of 
GlusterFS, you will face the same hangs even with NFS or even running bonnie 
directly  on your backend FS. All the IO calls are getting queued and blocked 
in the IO thread which is touching the disk, and the main FS thread is up 
responding to ping-pong requests, thus keeping the server "alive". All of us on 
this ML could have spent far fewer cycles if the initial description of the 
problem included a note which mentioned that one of the server's backend 
reiserfs3 is known to freeze in the environment before. When someone reports a 
hang on the glusterfs mountpoint, the first thing we developers do is trying to 
find code paths for what we call "missing frames" (technically it is a syscall 
leak, somewhat like a memory leak) and this is a very demanding and time 
consuming debugging for us. All the information you can provide us will only 
help us debug the issue faster.


All,
   The reason I merged this thread back with the ML is because we want to 
request anybody reporting issues to give as much information as possible 
upfront. In the interest of all of us, both the developers' and more 
importantly of the community for getting quicker releases, good bug reports are 
the best thing you can offer us. Please describe the FS configuration, 
environment, application and steps to reproduce issue with versions, configs 
and logs of every relevant component. And if you can, in fact, report all this 
directly on our bug tracking site (http://bugs.gluster.com) (and keep the MLs 
for discussions as much as possible) that would be the best you can do for us.

Thank you for all the support!

Avati

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] Re: [Gluster-users] 2.0.6, Anand Avati <=
- [Gluster-devel] Re: [Gluster-users] 2.0.6, Anand Avati, 2009/08/22
  - Message not available
    - [Gluster-devel] Re: [Gluster-users] 2.0.6, Anand Avati, 2009/08/22
    - Message not available
    - [Gluster-devel] Re: [Gluster-users] 2.0.6, Anand Avati, 2009/08/23
- [Gluster-devel] Re: [Gluster-users] 2.0.6, Anand Avati, 2009/08/24

Prev by Date: Re: [Gluster-devel] bug in booster
Next by Date: [Gluster-devel] Re: [Gluster-users] 2.0.6
Previous by thread: [Gluster-devel] bug in booster
Next by thread: [Gluster-devel] Re: [Gluster-users] 2.0.6
Index(es):
- Date
- Thread