monit-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: monit ./control.c ./event.c ./event.h ./l.l ./m...


From: Martin Pala
Subject: Re: monit ./control.c ./event.c ./event.h ./l.l ./m...
Date: Thu, 28 Aug 2003 00:13:14 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030714 Debian/1.4-2

Jan-Henrik Haukeland wrote:

Martin Pala <address@hidden> writes:

        - fix checksum, gid, uid, permission tests to not timeout after error
        occurence (this way it will behave more consistent - immediate timeout
        can be caused by unmonitor action, for other cases modified timeout
        statement should fit)

I'm not sure you should change the timeout statement. As I said
before, it is for process (re)starts and other events are not very
interesting in this context. For instance:

if 1 checksum event within 1 cycle then timeout

Is uninteresting, because, either you want checksum to unmonitor once
or you want checksum to report always. Likewise with other events
except (re)starts.
What i expect is, that all tests will behave consistently for same actions. For example if you will use:

if failed host www.tildeslash.com port 80 protocol http then alert

and

if failed checksum then alert

you will receive different behavior - the first case will send infinite alerts, until it is restricted by timeout statement. The second case will send only one alert, but it won't disable monitoring. What is worse is, that original checksum is rewriten to actual (bad) value. You will see from web interface erroneous checksum as associated checksum (the original correct checksum is forgotten). This affects uid, gid and permission tests too.

I think it will be better:

- to keep original associated value (checksum/uid/gid/permission)
- to provide consistent behavior for all 'alert' action instances

The first hint is clear - the second has two possibilities:

1.) support only one timeout statement instance:

 IF number EVENTS WITHIN number CYCLES THEN TIMEOUT

In such case it will be pretty simple - all executive events (such as restart, timestamp, checksum, gid, uid, permission, checksum, etc.) will increment event counter in the case that they will fail (each cycle). As soon as the counter will overflow, the service will be timed out (alias unmonitored). The advantage is simplicity, but there is no difference between events - you can set common/shared limit only.


2.) allow specification of timeout statement for each event type (multiinstance statement):

 IF number event WITHIN number CYCLES THEN TIMEOUT

... where event is choice of {CHECKSUM|GID|UID|RESTART|TIMESTAMP|SIZE|etc.}

If you want to, you can set different timeout limits for each event type. The advantage is, that you can choose standalone limit for each service, as well as you don't need to limit some specific event type if you want to (which is rare case i think).



I think this way the behavior will be consistent enough. Both solutions are possible. What do you think?

Martin













reply via email to

[Prev in Thread] Current Thread [Next in Thread]