help-cfengine
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cfagent core dump


From: Frank Ranner
Subject: Re: cfagent core dump
Date: Mon, 26 Jan 2004 12:21:59 +1100
User-agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)

Let me guess - this comes from a solaris system?

I'm having the same problems with cfservd. I have tracked it down to the fact that on solaris cfengne is using threads, but the access to the DB routines are not thread safe, ie there is no locking done to prevent threads from stomping on each other. Worse, each of the calls to the DB routines collects the return code in errno, which is defined as a static int in some cases, instead of allowing errorno.h and/or pthread.h to define it propoerly.

What is happening in your back trace is that the strerror(errno) is returning NULL, probably because the DB assignment returned a negative value indicating the put failed (because the database has been corrupted by uncontrolled access by multiple threads). Supplying a NULL pointer to printf as a %s item causes a seg fault in libc exactly as you have experienced.

I am actually trying to come up with a fix. As a temporary workaround I patched cflogs printf arg to:

strerror(errno) == NULL ? "invalid errno" : strerror(errno)

A better fix would be to use the db error routines which translates syscall and database errors to strings.

The long term fx is to use db->env to set up multi-reader/single writer safe access to the DB. Or to put a simple lock around every get/put.

The access to the checksum database is very inefficient as the database appears to be opened and closed for every access. It would be better if it was opened at the start of a connection, and remained open for the life of the thread. cfservd does far to much work verifying checksums for large directory trees.

Regards,
Frank Ranner

Wheeler, John wrote:
(gdb) backtrace
#0  0xff1331bc in strlen () from /usr/lib/libc.so.1
#1  0xff1861c8 in _doprnt () from /usr/lib/libc.so.1
#2  0xff187e04 in printf () from /usr/lib/libc.so.1
#3  0x0006d448 in CfLog (level=1570816, string=0xffbed520 "put failed",
    errstr=0xf3428 "db->put") at log.c:154
#4  0x00062600 in PutLock (
    name=0x1837f0
"last.cfagent_conf.100.web001prod.shellcommand.corporaterotate._usr_bin_
gzip__var_adm_apache_corporate_80_logs_access_2004_01_access_22Jan_12AM"
) at locks.c:497
#5  0x0006223c in GetLastLock () at locks.c:406
#6  0x00061b0c in GetLock (operator=0x12b2e0
"shellcommand.corporaterotate",
    operand=0x1292e0
"_usr_bin_gzip__var_adm_apache_corporate_80_logs_access_2004_01_access_2
2Jan_12AM", ifelapsed=1, expireafter=120,
    host=0x182be0 "web001prod", now=1074889804) at locks.c:208
#7  0x00031624 in Scripts () at do.c:1155
#8  0x0002d1ac in DoTree (passes=1, info=0xe0660 "Main Tree") at
cfagent.c:1276
#9  0x0002a858 in main (argc=1572312, argv=0xffbefcb4) at cfagent.c:187

not sure what the issue is. I removed the cfengine_lock_db and it solved
the problem. This machine did fill / (root partition) and since /var is
not mounted separate this may have contributed to the corruption.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]