[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: SIGSEGV problem
From: |
Martin Pala |
Subject: |
Re: SIGSEGV problem |
Date: |
Fri, 15 Aug 2003 20:29:06 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030714 Debian/1.4-2 |
Jan-Henrik Haukeland wrote:
Christian Hopp <address@hidden> writes:
On Thu, 14 Aug 2003, Jan-Henrik Haukeland wrote:
I ran a fast test with efence and managed to reproduce the SIGSEGV (it
may be more). SIGSEGV is thrown in process/common.c:connectchild()
from this line:
parent->children[parent->children_num - 1] = (struct myprocesstree *) child;
From my gdb/efence session:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1024 (LWP 1269)]
0x0805b340 in connectchild (parent=0x41143fa0, child=0x41144740)
at process/common.c:232
(gdb) p *parent->children
Cannot access memory at address 0x41365fcc
(gdb) p parent->children[parent->children_num - 1]
Cannot access memory at address 0x41365ffc
I suspect it's caused by trying to access something outside the
array. Maybe Christian can debug this since it's his code :) I'm of to
bed, it's late.
Strange... I just had a look at the code... and it IMHO impossible to
access memory which is not allocated at this position!
I do a xcalloc of parent->children_num entities of pointers and it has to
be possible to access the last one (parent->childen_num - 1)... or? Or is
it being deleted while this happens... somekind of race condition???
I think it must have been a race condition of some sort. The strange
thing is that I cannot reproduce the problem after I added the signal
block code. Maybe that was it and it is fixed!? Do any of you get any
more SIGSEGV now? Martin?
I did another test, problem remains. There were 152 problem occurences
of 3848 attempts => error ratio 5.34%
I'm running Debian unstable (sid) with glic-2.3.2 and gcc-3.3.1
My configuration:
---8<---
set daemon 5
set logfile syslog
set mailserver ms2.dkm.cz
set mail-format { from: address@hidden }
set httpd port 2812 and allow 127.0.0.1 use address 127.0.0.1
check slapd with pidfile /var/run/slapd.pid
start program = "/etc/init.d/slapd start"
stop program = "/etc/init.d/slapd stop"
if failed host 127.0.0.1 port 389 protocol ldap3 then restart
if cpu usage > 2% for 5 cycles then restart
group database
if 2 restarts within 2 cycles then timeout
mode active
---8<---
To replicate the problem it is sufficient to:
1.) stop slapd
2.) change /etc/init.d/slapd start startup script so, that it is not
able to start slapd successfully
3.) while true; do strace -f -o monit.strace.`date +%Y%m%d%S%N` ./monit
-vc /etc/monitrc validate > monit.out.`date +%Y%m%d%S%N` 2>&1; done
You will quickly obtain few occurences of the problem.
As i wrote, it fails in wait_start (see attached strace):
24065 stat64("XE^^G^H/run/slapd.pid", <unfinished ...>
24063 close(3 <unfinished ...>
24064 read(4, <unfinished ...>
24065 <... stat64 resumed> 0xbf7ff93c) = -1 ENOENT (No such file or
directory)
24063 <... close resumed> ) = -1 EBADF (Bad file descriptor)
24064 <... read resumed>
"address@hidden,@address@hidden"..., 148) = 148
24065 --- SIGSEGV (Segmentation fault) @ 0 (0) ---
24065 ... wait_start thread
As you can see, instead of /var/run/slapd.pid the s->path referenced
string is garbled (in 5% cases - see the error ratio above). Under
normal condition (95% cases) s->path references correct data. With above
setup it fails every time the error occures in exactly same place.
As i wrote, when i inserted some primitive fprintf based marks arround
critical code, the problem didn't occured any time. I think it is
because these calls slowed down monit and the (possibly) race condition
didn't occured.
It is possible to involve another kind of SIGSEGV, when you:
1.) stop slapd
2.) change /etc/init.d/slapd start startup script so, that it is not
able to start slapd successfully
3.) echo 7777 > /var/run/slapd.pid #some non-existent pid
4.) while true; do strace -f -o monit.strace.`date +%Y%m%d%S%N` ./monit
-vc /etc/monitrc validate > monit.out.`date +%Y%m%d%S%N` 2>&1; done
The result is similar but the place is different (but again every time
the same) - see strace output i've send.
I tried to run gdb on core:
unicorn:~/cvs/monit# gdb ./monit core.24065
GNU gdb 5.3.90_2003-08-01-cvs-debian
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-linux"...
Core was generated by `./monit -vc /etc/monitrc validate'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /usr/lib/i686/cmov/libssl.so.0.9.7...done.
Loaded symbols for /usr/lib/i686/cmov/libssl.so.0.9.7
Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.7...done.
Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.7
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libnss_compat.so.2...done.
Loaded symbols for /lib/libnss_compat.so.2
Reading symbols from /lib/libnss_nis.so.2...done.
Loaded symbols for /lib/libnss_nis.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
#0 0x08053b48 in is_process_running (s=0x807cc58) at util.c:880
880 memset(s->procinfo, 0, sizeof *(s->procinfo));
(gdb) bt
#0 0x08053b48 in is_process_running (s=0x807cc58) at util.c:880
#1 0x0804ba8d in wait_start (service=0x807cc58) at control.c:415
#2 0x400258be in pthread_start_thread () from /lib/libpthread.so.0
#3 0x4027b217 in clone () from /lib/libc.so.6
From util.c:
int is_process_running(Service_T s) {
pid_t pid;
ASSERT(s);
errno= 0;
if((pid= get_pid(s->path))) {
if(( getpgid(pid) > 0 ) || ( errno == EPERM ))
return pid;
}
memset(s->procinfo, 0, sizeof *(s->procinfo));
return FALSE;
}
=> it seems we need to take care for non thread safe memset in
is_process_running, which resets procinfo every time (probably move it
to other place?)
Martin
monit.strace.2003081513446770000.gz
Description: application/gzip
Re: SIGSEGV problem, Jan-Henrik Haukeland, 2003/08/13