[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: monit restart bug/race condition (3.1 and current CVS version behavi
From: |
Martin Pala |
Subject: |
Re: monit restart bug/race condition (3.1 and current CVS version behavior) |
Date: |
Mon, 10 Feb 2003 02:08:56 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021226 Debian/1.2.1-9 |
Jan-Henrik Haukeland wrote:
Martin Pala <address@hidden> writes:
[..]
You can replicate the problem easily when you let monit run in
daemon mode and wakeup every 1s.
I think we will need to add a restart action to check_process() which
does the restart in one atomic operation (instead of the current
check_process(stop) -> wait -> check_process(start). That is, call
do_stop -> do_start in sequence in check_process(). And to avoid any
race condition (if the monit daemon thread has a low poll frequency)
use an action-mutex so only one thread can execute an action in
check_process() at the same time.
Something like
check_process(P, restart) {
LOCK(action_mutex)
do_stop(P);
do_start(P);
END_LOCK;
}
I can look into it, so you can concentrate on the device stuff :-)
I made i fix for the 2.) problem - the code is now simplier little bit.
I though about mutex for race condition too, but there's one problem -
it isn't possible to use it just in check_process() and
d_check_process(), otherwise it will break stop method, because of
following test on the start of do_validate():
if(do_not_validate(p)) return;
In the case, that service stop will be requested (monit running in
daemon mode), monit httpd will lock the action mutex, run program's stop
method and wait for program to stop (which could take some time, because
the application need to finish - databases, etc.). After program stop,
it will set do_validate to false, which will disable monitoring.
However, in the meantime between stop method call and do_validate flag
disabling (by monit httpd), monit main thread passes do_not_validate(p)
test (because do_validate is yet true). It detects that the service
don't answer and tries to start the service. It will block because of
action_mutex - as soon as the mutex is freed, it will run start method
and start application and its monitoring again (which is bad in this case).
I tried it on simple test - in this case monit without action mutex is
less affected then monit with action mutex in check_process(), which
failes every time. To solve the race condition, it will be needed to
implement probably per-process mutex before do_validate flag test and
cervlet check_process() calls. I did it too, but i found a deadlock in
that case => it needs more work (last 'development' patch attached -
probably there could be better way to do it).
I must go to work tomorrow morning, so i will continue either tomorrow
evening (or you can try it :)
Btw. what about to release 3.2 after race condition fix? There were few
bugs in 3.1, so probably we can make bugfix release 3.2 (plus few
extensions which are already made for it) and schedule device tests for
3.3, which i hope will be ready soon :)
Cheers,
Martin
diff -Naur monit/http/cervlet.c monit.race_condition/http/cervlet.c
--- monit/http/cervlet.c 2003-02-10 01:53:36.000000000 +0100
+++ monit.race_condition/http/cervlet.c 2003-02-10 01:42:04.000000000 +0100
@@ -539,7 +539,9 @@
if(p->start)
- check_process(name, action);
+ LOCK(p->mutex)
+ check_process(name, action);
+ END_LOCK;
else {
@@ -555,7 +557,9 @@
if(p->stop)
- check_process(name, action);
+ LOCK(p->mutex)
+ check_process(name, action);
+ END_LOCK;
else {
@@ -570,8 +574,10 @@
if( is(action, "restart") ) {
if(p->start && p->stop) {
-
- check_process(name, action);
+
+ LOCK(p->mutex)
+ check_process(name, action);
+ END_LOCK;
} else {
diff -Naur monit/monitor.h monit.race_condition/monitor.h
--- monit/monitor.h 2003-02-10 01:53:33.000000000 +0100
+++ monit.race_condition/monitor.h 2003-02-10 01:44:00.000000000 +0100
@@ -311,6 +311,7 @@
ProcInfo_T procinfo; /**< Data for the procfs check */
/** For internal use */
+ pthread_mutex_t mutex; /**< Mutex used for action synchronization */
struct myprocess *next; /**< next process in chain */
struct myprocess *next_depend; /**< next depend process in chain */
} *Process_T;
diff -Naur monit/validate.c monit.race_condition/validate.c
--- monit/validate.c 2003-02-10 01:53:35.000000000 +0100
+++ monit.race_condition/validate.c 2003-02-10 01:38:31.000000000 +0100
@@ -130,6 +130,9 @@
ASSERT(p);
+ /* Obtain process action mutex */
+ LOCK(p->mutex)
+
/* First, check for pre-conditions */
if(do_not_validate(p)) return;
@@ -253,6 +256,9 @@
/* Remove the SIGTERM block */
pthread_sigmask(SIG_SETMASK, &os, NULL);
+
+ /* Release process action mutex */
+ END_LOCK;
}