help-cfengine
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: cfagent core dump


From: Wheeler, John
Subject: RE: cfagent core dump
Date: Mon, 26 Jan 2004 09:30:40 -0600

Yep, sorry I didn't mention it before, but this trace was from a solaris
8 machine.

SunOS 5.8 Generic_108528-19 sun4u sparc SUNW,Ultra-2

> -----Original Message-----
> From: Frank Ranner [mailto:franner@NOSPAM.webone.com.au]
> Sent: Sunday, January 25, 2004 7:22 PM
> To: help-cfengine@gnu.org
> Subject: Re: cfagent core dump
> 
> Let me guess - this comes from a solaris system?
> 
> I'm having the same problems with cfservd. I have tracked it down to
the
> fact that on solaris cfengne is using threads, but the access to the
DB
>   routines are not thread safe, ie there is no locking done to prevent
> threads from stomping on each other. Worse, each of the calls to the
DB
> routines collects the return code in errno, which is defined as a
static
> int in some cases, instead of allowing errorno.h and/or pthread.h to
> define it propoerly.
> 
> What is happening in your back trace is that the strerror(errno) is
> returning NULL, probably because the DB assignment returned a negative
> value indicating the put failed (because the database has been
corrupted
> by uncontrolled access by multiple threads). Supplying a NULL pointer
to
> printf as a %s item causes a seg fault in libc exactly as you have
> experienced.
> 
> I am actually trying to come up with a fix. As a temporary workaround
I
> patched cflogs printf arg to:
> 
> strerror(errno) == NULL ? "invalid errno" : strerror(errno)
> 
> A better fix would be to use the db error routines which translates
> syscall and database errors to strings.
> 
> The long term fx is to use db->env to set up multi-reader/single
writer
> safe access to the DB. Or to put a simple lock around every get/put.
> 
> The access to the checksum database is very inefficient as the
database
> appears to be opened and closed for every access. It would be better
if
> it was opened at the start of a connection, and remained open for the
> life of the thread. cfservd does far to much work verifying checksums
> for large directory trees.
> 
> Regards,
> Frank Ranner
> 
> Wheeler, John wrote:
> > (gdb) backtrace
> > #0  0xff1331bc in strlen () from /usr/lib/libc.so.1
> > #1  0xff1861c8 in _doprnt () from /usr/lib/libc.so.1
> > #2  0xff187e04 in printf () from /usr/lib/libc.so.1
> > #3  0x0006d448 in CfLog (level=1570816, string=0xffbed520 "put
failed",
> >     errstr=0xf3428 "db->put") at log.c:154
> > #4  0x00062600 in PutLock (
> >     name=0x1837f0
> >
"last.cfagent_conf.100.web001prod.shellcommand.corporaterotate._usr_bin_
> >
gzip__var_adm_apache_corporate_80_logs_access_2004_01_access_22Jan_12AM"
> > ) at locks.c:497
> > #5  0x0006223c in GetLastLock () at locks.c:406
> > #6  0x00061b0c in GetLock (operator=0x12b2e0
> > "shellcommand.corporaterotate",
> >     operand=0x1292e0
> >
"_usr_bin_gzip__var_adm_apache_corporate_80_logs_access_2004_01_access_2
> > 2Jan_12AM", ifelapsed=1, expireafter=120,
> >     host=0x182be0 "web001prod", now=1074889804) at locks.c:208
> > #7  0x00031624 in Scripts () at do.c:1155
> > #8  0x0002d1ac in DoTree (passes=1, info=0xe0660 "Main Tree") at
> > cfagent.c:1276
> > #9  0x0002a858 in main (argc=1572312, argv=0xffbefcb4) at
cfagent.c:187
> >
> > not sure what the issue is. I removed the cfengine_lock_db and it
solved
> > the problem. This machine did fill / (root partition) and since /var
is
> > not mounted separate this may have contributed to the corruption.
> >
> >
> 
> _______________________________________________
> Help-cfengine mailing list
> Help-cfengine@gnu.org
> http://mail.gnu.org/mailman/listinfo/help-cfengine




reply via email to

[Prev in Thread] Current Thread [Next in Thread]