bug-glibc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Problem with pthread_mutex hangs on 16-way Linux SMP


From: Mark Saake
Subject: Problem with pthread_mutex hangs on 16-way Linux SMP
Date: Mon, 07 Oct 2002 15:33:07 -0700

Sorry if this is an inappropriate area to post this.

We are running into a problem with pthread_mutexes in a multi-threaded
application running on RedHat (Linux 2.4.17) on a 16-way SMP IA64
machine. We have seen this with all versions of glibc what we have tried
(2.2.3, 2.2.4, and 2.2.5)

The symptom of the problem is that we end up with a thread that is
waiting on a mutex never acquiring the mutex, even when the mutex has
been released. The thread stays in sigsuspend(). The mutex that is being
waited on has the  __m_lock.__status field set to 0, which means that
there should not be anyone waiting on the mutex that has not been woken
up.

Let me walk through a gdb of a session where we have a hang and through
some of the linuxthreads source code. If anyone has any ideas, we are
all ears.

Debugging a hung program, we see a thread that looks like the following:

(gdb) thread 3
[Switching to thread 3 (Thread 2051 (LWP 13306))]#0  0x20000000002afa22
in rt_sigsuspend () at soinit.c:56
56      soinit.c: No such file or directory.
        in soinit.c
(gdb) where
#0  0x20000000002afa22 in rt_sigsuspend () at soinit.c:56
#1  0x200000000016c100 in __sigsuspend (set=0x80000fffff5ff860)
    at ../sysdeps/unix/sysv/linux/ia64/sigsuspend.c:38
#2  0x200000000009bb80 in __pthread_wait_for_restart_signal (
    self=0x20000000000c1bc0) at pthread.c:942
#3  0x200000000009f420 in __pthread_alt_lock (lock=0x60000000001252f8,
    self=0x80000fffff5ffa60) at restart.h:34
#4  0x2000000000098bc0 in __pthread_mutex_lock
(mutex=0x60000000001252e0)
    at mutex.c:120
#5  0x40000000000d7310 in svc_run () at svc_run.c:72
#6  0x2000000000096f20 in pthread_start_thread (arg=0x80000fffff5ffa60)
    at manager.c:274
#7  0x20000000002af7c0 in __clone2 () at soinit.c:56
#8  0x200000000016c100 in __sigsuspend (set=0x2000000000035218)
    at ../sysdeps/unix/sysv/linux/ia64/sigsuspend.c:38


(Please note: svc_run() is not the glibc version, but our own version
which
can handle multi-threading).

We are stuck in __pthread_mutex_lock() (frame 4) on a mutex.  
__pthread_mutex_lock() (mutex.c of linuxthreads) basically just calles 
__pthread_alt_lock() with the __m_lock field of the mutex.  

__pthread_alt_lock() (spinlock.c of linuxthreads) has the
HAS_COMPARE_AND_SWAP macro defined. It checks if the lock->__status
value is set to 0.  If it is,  this means that the lock is not being
held. The newstatus value is set to 1,  it does a compare_and_swap() and
if this succeeds, it returns.

If the lock is already held (e.g. lock->__status != 0), then 
__pthread_alt_lock() creates a wait_node, filling in the wait_node.thr
field with the information returned by thread_self(). It then sets the
newstatus field equal to this wait_node, in effect putting us on the
head of a linked list of threads waiting for this mutex. It then does a
compare_and_swap to actually put this wnode in the list and calls
suspend(), which is just inline code from restart.h to call
__pthread_wait_for_restart_signal().

__pthread_wait_for_restart_signal() basically sets up the current signal
mask, removing sigrestart from being blocked. It then calls
sigsuspend()  until it gets a sigrestart signal.  (It checks the
self->p_signal value).

So, based on the above code walkthru, I decide to look at the lock
structure. On a good lock (e.g. one that is not hung), we see something
that looks like the following:

(gdb) p/x *lock
$11 = {__status = 0x2000000000ad3940, __spinlock = 0}

which looks good. The __status value is set to the self() value of one
of the  threads.

For this thread, however, the lock stucture looks like the following:
(gdb) p/x *lock
$12 = {__status = 0, __spinlock = 0}

Now, this should never happen. The only way that lock->__status can get
set  back to 0 is in the pthread_alt_unlock() (in spinlock.c), and then
only if  lock->_status is already equal to 0, or 1, meaning that we are
the only ones that have interest in the lock. If anyone else was on the
lock->__status list, then we would find the one with the highest
priority, dequeue it and then wake it up.

Does anyone out there have any ideas?




reply via email to

[Prev in Thread] Current Thread [Next in Thread]