monit-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: The checksum statement ++


From: Martin Pala
Subject: Re: The checksum statement ++
Date: Fri, 15 Aug 2003 12:43:49 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030711




Jan-Henrik Haukeland wrote:

I'm not at all happy with how the checksum is implemented now. For
instance if I use this entry in monitrc:

check file httpd.conf with path /usr/local/apache/httpd.conf
if failed checksum then exec "/etc/init.d/apache restart"
alert address@hidden

1) If the checksum for httpd.conf was changed then for *every* cycle
  an alert is sent and apache restarted. This is not what we want!
  In this case the old checksum should be set to the new checksum, so
  apache is restarted only *once* and only one alert is sent.


I think it could be usefull to reflect event based architecture in timeout statement. Present syntax:

 IF number RESTART number CYCLE THEN TIMEOUT

 (such as: "if 3 restarts within 5 cycles then timeout")


It could be better to use one of following schemes:

 IF number EVENT number CYCLE THEN TIMEOUT

where <EVENT> is choice of supported event types, currently:

FAILED
START
STOP
RESTART
CHECKSUM
RESOURCE
TIMEOUT
TIMESTAMP
SIZE
CONNECTION
PERMISSION
UID
GID

New syntax usage example:

check file httpd with path /usr/local/apache/bin/httpd
if failed checksum then alert
if 1 checksum within 1 cycles then timeout
alert address@hidden


You can use it traditional way (which one of new syntax case):

check process apache with path /var/run/apache.pid
if failed port 80 then restart
if 3 restart within 5 cycles then timeout
alert address@hidden


This way we can universaly solve the problem of monit flooding the user with alert messages of other then RESTART event - it could be possible to set limit for each particular event.





2) In the example below we will call the apache stop program, but for
  security reasons we do absolutely not want to do that! Instead we
  should only send an alert and then *stop* monitoring the apache
  entry (which was done in the original checksum implementation).

 check apache with pidfile "/usr/local/apache/logs/httpd.pid"
start program = "/usr/local/apache/bin/http start" stop program = "/usr/local/apache/bin/http stop" alert address@hidden depends on httpd

 check http.bin with path /usr/local/apache/bin/http
   if failed checksum then stop


The solution could be to broadcast TIMEOUT (do_monitor flag) event to all dependants. It is possible to implement it relativly easy as standalone action in addition to 'start', 'stop' and 'restart' i think. Here is "high level" illustration of this - the real implementation will require more changes of control.c functions, but this is sufficient to show what should be done (no how it is done):

/*** control.c ***/
void check_service(char *P, char *action) {
...
 if(IS(action, "timeout")) {

   if(s->do_monitor) {

     LOCK(Run.mutex)
         s->do_monitor= FALSE;
     END_LOCK;

     DEBUG("Monitoring disabled -- process %s\n", s->name);

   }

   do_depend(s, "timeout");

   return;
 }
...
}

/*** event.c ***/
static void handle_timeout(Event_T E) {

 check_service(E->source->name, "timeout");

}

This way we can timeout whole chain. It is question whether it is desirable behavior - from my point of view yes, because in the case that some service depends on other service and this service has hard error, it is clear that the dependant have big problems too.

We can think about TIMEOUT event as "hard error" and about the rest (FAILED, RESTART, CHECKSUM, etc.) as "soft errors". Monit by default can handle soft errors by specified action (ALERT|RESTART|STOP|EXEC) - the user specifies the ratio/condition which causes soft-to-hard error requalification (presumption for this to work is that the 'timeout' statement extension described above will be implemented).

Example:

 check apache with pidfile "/usr/local/apache/logs/httpd.pid"
start program = "/usr/local/apache/bin/http start" stop program = "/usr/local/apache/bin/http stop" alert address@hidden depends on httpd.bin

 check httpd.bin with path /usr/local/apache/bin/http
   if failed checksum then alert
   if 1 checksum within 1 cycles then timeout


=> this will cause 'httpd.bin' and its dependant 'apache' service to timeout (hard error) without actually trying to execute something.



Summary: there are two proposals:
1.) generalization of timeout statement
2.) TIMEOUT event hard error classification and its broadcasting through dependency tree


What do you think?



I do not have a solution to this problem now and it's late. Maybe
tomorrow or maybe others have already thought up a good solution by
then :)

- On another note, please try to keep the code at 80 chars per
line. (Martin :)
I'm sorry - i will try to set some sort of margins in my 'vim'









reply via email to

[Prev in Thread] Current Thread [Next in Thread]