[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: The checksum statement ++
From: |
Martin Pala |
Subject: |
Re: The checksum statement ++ |
Date: |
Fri, 15 Aug 2003 12:43:49 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030711 |
Jan-Henrik Haukeland wrote:
I'm not at all happy with how the checksum is implemented now. For
instance if I use this entry in monitrc:
check file httpd.conf with path /usr/local/apache/httpd.conf
if failed checksum
then exec "/etc/init.d/apache restart"
alert address@hidden
1) If the checksum for httpd.conf was changed then for *every* cycle
an alert is sent and apache restarted. This is not what we want!
In this case the old checksum should be set to the new checksum, so
apache is restarted only *once* and only one alert is sent.
I think it could be usefull to reflect event based architecture in
timeout statement. Present syntax:
IF number RESTART number CYCLE THEN TIMEOUT
(such as: "if 3 restarts within 5 cycles then timeout")
It could be better to use one of following schemes:
IF number EVENT number CYCLE THEN TIMEOUT
where <EVENT> is choice of supported event types, currently:
FAILED
START
STOP
RESTART
CHECKSUM
RESOURCE
TIMEOUT
TIMESTAMP
SIZE
CONNECTION
PERMISSION
UID
GID
New syntax usage example:
check file httpd with path /usr/local/apache/bin/httpd
if failed checksum then alert
if 1 checksum within 1 cycles then timeout
alert address@hidden
You can use it traditional way (which one of new syntax case):
check process apache with path /var/run/apache.pid
if failed port 80 then restart
if 3 restart within 5 cycles then timeout
alert address@hidden
This way we can universaly solve the problem of monit flooding the user
with alert messages of other then RESTART event - it could be possible
to set limit for each particular event.
2) In the example below we will call the apache stop program, but for
security reasons we do absolutely not want to do that! Instead we
should only send an alert and then *stop* monitoring the apache
entry (which was done in the original checksum implementation).
check apache with pidfile "/usr/local/apache/logs/httpd.pid"
start program = "/usr/local/apache/bin/http start"
stop program = "/usr/local/apache/bin/http stop"
alert address@hidden
depends on httpd
check http.bin with path /usr/local/apache/bin/http
if failed checksum then stop
The solution could be to broadcast TIMEOUT (do_monitor flag) event to
all dependants. It is possible to implement it relativly easy as
standalone action in addition to 'start', 'stop' and 'restart' i think.
Here is "high level" illustration of this - the real implementation will
require more changes of control.c functions, but this is sufficient to
show what should be done (no how it is done):
/*** control.c ***/
void check_service(char *P, char *action) {
...
if(IS(action, "timeout")) {
if(s->do_monitor) {
LOCK(Run.mutex)
s->do_monitor= FALSE;
END_LOCK;
DEBUG("Monitoring disabled -- process %s\n", s->name);
}
do_depend(s, "timeout");
return;
}
...
}
/*** event.c ***/
static void handle_timeout(Event_T E) {
check_service(E->source->name, "timeout");
}
This way we can timeout whole chain. It is question whether it is
desirable behavior - from my point of view yes, because in the case that
some service depends on other service and this service has hard error,
it is clear that the dependant have big problems too.
We can think about TIMEOUT event as "hard error" and about the rest
(FAILED, RESTART, CHECKSUM, etc.) as "soft errors". Monit by default can
handle soft errors by specified action (ALERT|RESTART|STOP|EXEC) - the
user specifies the ratio/condition which causes soft-to-hard error
requalification (presumption for this to work is that the 'timeout'
statement extension described above will be implemented).
Example:
check apache with pidfile "/usr/local/apache/logs/httpd.pid"
start program = "/usr/local/apache/bin/http start"
stop program = "/usr/local/apache/bin/http stop"
alert address@hidden
depends on httpd.bin
check httpd.bin with path /usr/local/apache/bin/http
if failed checksum then alert
if 1 checksum within 1 cycles then timeout
=> this will cause 'httpd.bin' and its dependant 'apache' service to
timeout (hard error) without actually trying to execute something.
Summary: there are two proposals:
1.) generalization of timeout statement
2.) TIMEOUT event hard error classification and its broadcasting through
dependency tree
What do you think?
I do not have a solution to this problem now and it's late. Maybe
tomorrow or maybe others have already thought up a good solution by
then :)
- On another note, please try to keep the code at 80 chars per
line. (Martin :)
I'm sorry - i will try to set some sort of margins in my 'vim'
- The checksum statement ++, Jan-Henrik Haukeland, 2003/08/14
- Re: The checksum statement ++,
Martin Pala <=
- Re: The checksum statement ++, Martin Pala, 2003/08/15
- Re: The checksum statement ++, Jan-Henrik Haukeland, 2003/08/15
- Re: The checksum statement ++, Martin Pala, 2003/08/15
- Re: The checksum statement ++, Jan-Henrik Haukeland, 2003/08/15
- Re: The checksum statement ++, Martin Pala, 2003/08/15
- Re: The checksum statement ++, Jan-Henrik Haukeland, 2003/08/15
- Re: The checksum statement ++, Martin Pala, 2003/08/15