bug-glibc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Hanlding SIGSEGV/SIGBUG with glibc 2.3.2


From: Yair Lenga
Subject: Hanlding SIGSEGV/SIGBUG with glibc 2.3.2
Date: Thu, 24 Jun 2004 12:55:50 -0400
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113

Hi,

I am working on porting large program running on SGI/SunOS to RedHat Linux AS3.0, using glibc 2.3.2. My question is

How to implement a cleanup function in response to SIGSEGV, SIGBUS and other signals for which the signal handler can not flag the condition and return.

Details:

The server programs are using signal handler to perform various cleanup tasks, with the sequence like:

signal_handler (signo) {
   ...
   if ( serving_request) notify_remote_monitor(RPC_HAS_FAILED) ;
   system("record_crash") ;
   ...
}

The notify_remote_monitor is an RPC call to a different server - to notify him that a failure has occured. This is a MUST requirement.

The system was working OK for any signal with AS2.1 - glibc-2.2.4 - both user generated signals (SIGTERM ...), and for software failures (SIGSEGV, SIGBUS). After switching to Advanced Server 3.0 - glibc-2.3.2 using tls/libc.so, we found that in many cases the server will hang after getting a signal. Attaching GDB to the server found that the signal (in this case SIGCHLD==17), happend during "free". Attempt to go into "vfork" cause infinite spin on mutex, that is left lock from the uncompleted free. The problem can be replicated for many signals, and in general the sequence is:

   * The program is calling free
   * Free is locking the arena, call int_free to free the memory
   * Signal is recieved
   * the signal handler is invoked, trying to call
     free/malloc/vfork/... as part of the cleanup
   * The process is trying to lock the arena - and get into infinite wait.

The documentation is very clear that the signal handler should not do anything, but to flag the error condition, and return - and add a check for the flag during the normal program flow. I can implement this (with some effort) for SIGTERM, SIGALRM, and other signals that can resume processing. But this approach does not work for SIGSEGV, SIGBUS, etc - where the signal handler can not return.

I tried using setjmp and longjmp to resume processing after SIGSEGV, but it could not resolve the mutex lock.

I hope that other people have some experience and/or ideas on how to deal with SIGSEGV (and similar) signals.

Many thanks for any help,
Yair Lenga


gdb) where
#0  0xb742c8dc in ptmalloc_lock_all () from /lib/tls/libc.so.6
#1  0xb7461796 in fork () from /lib/tls/libc.so.6
#2 0x0804ee17 in fork_process (c=0xbfff91d8 "/home/sb/book/sbyb/bin/mortserver", argv=0xbfff9128) at fork_process.c:15 #3 0x0804b0b6 in sched_fork_server (serverfile=0xbfff91d8 "/home/sb/book/sbyb/bin/mortserver", dblogin=0x80809bc "", mortdb=0x80809a8 "", port=4200, cpid=0) at yb_sched.c:376 #4 0x0804c18a in restart_server (tbl=0x8080f98, login=0x80809bc "", mortdb=0x80809a8 "", serverfile=0xbfff91d8 "/home/sb/book/sbyb/bin/mortserver", port=4200, cpid=0) at yb_sched.c:658
#5  0x0804c0fa in sig_chld (x=17) at yb_sched.c:642
#6  <signal handler called>
#7  0xb7429eca in _int_free () from /lib/tls/libc.so.6
#8  0xb7428e68 in free () from /lib/tls/libc.so.6
#9  0xb74bb8d8 in xdrrec_destroy () from /lib/tls/libc.so.6
#10 0xb74b8e53 in svctcp_destroy () from /lib/tls/libc.so.6
#11 0xb74b7d49 in svc_getreq_common_internal () from /lib/tls/libc.so.6
#12 0xb74b7b0f in svc_getreqset_internal () from /lib/tls/libc.so.6
#13 0x08053cd6 in yb_svc_run (lsock=3, str=0xbfffd4f4 "/home/sb/book/sbyb/bin/yb_scheduler") at yb_rpc_svc_lib.c:593 #14 0x080525d2 in enter_mainloop (lsock=3, cp=0xbfffd4f4 "/home/sb/book/sbyb/bin/yb_scheduler") at yb_rpc_svc_lib.c:250 #15 0x08058b45 in yb_server_main (argc=2, argv=0xbfff9df4) at yb_rpc_lmain.c:41
#16 0x0805726e in main (argc=2, argv=0xbfff9df4) at yb_rpc_main.cc:14

If someone is intersted, attached is a small program to replicae the problem. It hangs for me on RedHat AS3.0, see stack trace below.

#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

static void catch_me(int signo) ;
static void boom(int signo) ;

main()
{
   signal(SIGALRM, boom) ;
   signal(SIGSEGV, catch_me) ;
   free("ab") ;
}

static void boom(int signo) {
   printf("bam\n") ;
   _exit(0) ;
}

static void catch_me(int signo) {
   signal(signo, SIG_DFL) ;
   printf("bim\n") ;
   system("echo system catch me") ;
   free(q) ;
   printf("bom\n") ;
   raise(signo);
}

The stack trace:
gdb -p 14911
(gdb) where
#0  0xb758d241 in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
#1  0xb7518e64 in _L_mutex_lock_2507 () from /lib/tls/libc.so.6
#2  0xb74e1e84 in system () from /lib/tls/libc.so.6
#3  0x0804848b in catch_me ()
#4  <signal handler called>
#5  0xb7515e6f in _int_free () from /lib/tls/libc.so.6
#6  0xb7514e68 in free () from /lib/tls/libc.so.6
#7  0x08048443 in main ()





reply via email to

[Prev in Thread] Current Thread [Next in Thread]