[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: cfagent core dump
From: |
Wheeler, John |
Subject: |
RE: cfagent core dump |
Date: |
Mon, 26 Jan 2004 09:30:40 -0600 |
Yep, sorry I didn't mention it before, but this trace was from a solaris
8 machine.
SunOS 5.8 Generic_108528-19 sun4u sparc SUNW,Ultra-2
> -----Original Message-----
> From: Frank Ranner [mailto:franner@NOSPAM.webone.com.au]
> Sent: Sunday, January 25, 2004 7:22 PM
> To: help-cfengine@gnu.org
> Subject: Re: cfagent core dump
>
> Let me guess - this comes from a solaris system?
>
> I'm having the same problems with cfservd. I have tracked it down to
the
> fact that on solaris cfengne is using threads, but the access to the
DB
> routines are not thread safe, ie there is no locking done to prevent
> threads from stomping on each other. Worse, each of the calls to the
DB
> routines collects the return code in errno, which is defined as a
static
> int in some cases, instead of allowing errorno.h and/or pthread.h to
> define it propoerly.
>
> What is happening in your back trace is that the strerror(errno) is
> returning NULL, probably because the DB assignment returned a negative
> value indicating the put failed (because the database has been
corrupted
> by uncontrolled access by multiple threads). Supplying a NULL pointer
to
> printf as a %s item causes a seg fault in libc exactly as you have
> experienced.
>
> I am actually trying to come up with a fix. As a temporary workaround
I
> patched cflogs printf arg to:
>
> strerror(errno) == NULL ? "invalid errno" : strerror(errno)
>
> A better fix would be to use the db error routines which translates
> syscall and database errors to strings.
>
> The long term fx is to use db->env to set up multi-reader/single
writer
> safe access to the DB. Or to put a simple lock around every get/put.
>
> The access to the checksum database is very inefficient as the
database
> appears to be opened and closed for every access. It would be better
if
> it was opened at the start of a connection, and remained open for the
> life of the thread. cfservd does far to much work verifying checksums
> for large directory trees.
>
> Regards,
> Frank Ranner
>
> Wheeler, John wrote:
> > (gdb) backtrace
> > #0 0xff1331bc in strlen () from /usr/lib/libc.so.1
> > #1 0xff1861c8 in _doprnt () from /usr/lib/libc.so.1
> > #2 0xff187e04 in printf () from /usr/lib/libc.so.1
> > #3 0x0006d448 in CfLog (level=1570816, string=0xffbed520 "put
failed",
> > errstr=0xf3428 "db->put") at log.c:154
> > #4 0x00062600 in PutLock (
> > name=0x1837f0
> >
"last.cfagent_conf.100.web001prod.shellcommand.corporaterotate._usr_bin_
> >
gzip__var_adm_apache_corporate_80_logs_access_2004_01_access_22Jan_12AM"
> > ) at locks.c:497
> > #5 0x0006223c in GetLastLock () at locks.c:406
> > #6 0x00061b0c in GetLock (operator=0x12b2e0
> > "shellcommand.corporaterotate",
> > operand=0x1292e0
> >
"_usr_bin_gzip__var_adm_apache_corporate_80_logs_access_2004_01_access_2
> > 2Jan_12AM", ifelapsed=1, expireafter=120,
> > host=0x182be0 "web001prod", now=1074889804) at locks.c:208
> > #7 0x00031624 in Scripts () at do.c:1155
> > #8 0x0002d1ac in DoTree (passes=1, info=0xe0660 "Main Tree") at
> > cfagent.c:1276
> > #9 0x0002a858 in main (argc=1572312, argv=0xffbefcb4) at
cfagent.c:187
> >
> > not sure what the issue is. I removed the cfengine_lock_db and it
solved
> > the problem. This machine did fill / (root partition) and since /var
is
> > not mounted separate this may have contributed to the corruption.
> >
> >
>
> _______________________________________________
> Help-cfengine mailing list
> Help-cfengine@gnu.org
> http://mail.gnu.org/mailman/listinfo/help-cfengine