monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Restart timer for checking services


From: David Paper
Subject: Re: Restart timer for checking services
Date: Fri, 9 Aug 2013 08:37:47 -0400

Hi Martin,

Thanks for the detailed reply.  I was expecting to have something 
mis-configured.  I'll keep my eyes out for Monit 5.5.2 and the changelog.

-dave

On Aug 9, 2013, at 8:00 AM, Martin Pala <address@hidden> wrote:

> Hello,
> 
> the start timeout waits only for the process itself to start - as soon as the 
> process shows up in the process table, the start command is finished and the 
> testing resumed. The restart doesn't reset the errors record - the "5 cycles" 
> condition will then match immediately, as the cycles before the restart are 
> counted as well.
> 
> We will change modify the restart command to reset the pre-restart error 
> cycles. Also the timeout should temporarily suppress the errors from the same 
> service tests till it expires.
> 
> Regards,
> Martin
> 
> 
> 
> 
> On Aug 7, 2013, at 8:04 PM, David Paper <address@hidden> wrote:
> 
>> Greetings,
>> 
>> I've dug through the monit docs, examples and changelog from 5.2.3 to 5.5.1, 
>> and I am unable to find a reference to this problem.  Here is what I am 
>> seeing.  Using Monit 5.2.3 on RedHat linux 5.4 86_x64 platform.  
>> 
>> I have a process that locks up due to out of memory (java) and monit tries 
>> to stop/start it. When I manually stop/start the process, monit waits the 
>> 180 seconds before it begins testing, and can test successfully.  The job 
>> works as defined.  The process takes more than 2 minutes to come online and 
>> start listening for TCP requests.    What doesn't work is that the monit 
>> restart functionality appears to immediately test the port 1 second after 
>> restart, again at 1 minute after restart, then sensing the process isn't 
>> working correctly, tries to restart it, and the sequence begins all over.   
>> If I didn't know better, I would say that Monit is ignoring the defined 
>> time/cycle settings on a restart.
>> 
>> My monit job for this process looks like this:
>> 
>> check process jboss-ssp with pidfile /var/run/jboss/jboss-sspnode.pid
>>      start program = "/opt/jboss/bin/monit_run.sh -c sspnode -b 10.91.51.32 
>> -g ssp-io-lp1 -u 239.255.150.1 -Djboss.messaging.ServerPeerID=1" 
>>              as uid 349 and as gid 349 with timeout 180 seconds
>>      stop program = "/bin/bash -c 'kill -9 `cat 
>> /var/run/jboss/jboss-sspnode.pid`'"
>>              as uid 349 and as gid 349 
>>      if failed host 10.91.51.141 port 8080 for 5 times within 5 cycles then 
>> alert
>>      if failed host 10.91.51.141 port 8080 for 5 times within 5 cycles then 
>> restart
>> 
>> Here is my monitrc:
>> 
>> set daemon  60            # check services at 1-minute intervals
>>    with start delay 60  # optional: delay the first check by 1-minute
>> set logfile syslog facility log_daemon                       
>> set idfile /var/run/monit.id
>> set statefile /var/run/monit.state
>> set mailserver smartmail.mydomain.com,               # primary mailserver
>> set eventqueue
>>    basedir /opt/monit/eventqueue #set the base directory where events will 
>> be stored
>>    slots 100           # optionally limit the queue size
>> set alert address@hidden                # receive all alerts
>> set httpd port 2812 and
>>   use address localhost  # only accept connection from localhost
>>   allow localhost        # allow localhost to connect to the server and
>> include /opt/monit/monit.d/*
>> 
>> The syslog messages that show monits behavior:
>> 
>> Aug  7 04:02:26 stdeciovag1 monit[4111]: 'jboss-ssp' failed, cannot open a 
>> connection to INET[10.91.51.141:8080] via TCP
>> Aug  7 04:02:26 stdeciovag1 monit[4111]: 'jboss-ssp' trying to restart
>> Aug  7 04:02:26 stdeciovag1 monit[4111]: 'jboss-ssp' stop: /bin/bash
>> Aug  7 04:02:27 stdeciovag1 monit[4111]: 'jboss-ssp' start: 
>> /opt/jboss/bin/monit_run.sh
>> Aug  7 04:02:27 stdeciovag1 logger: Running /opt/jboss/bin/run.sh
>> Aug  7 04:02:28 stdeciovag1 monit[4111]: 'jboss-ssp' failed, cannot open a 
>> connection to INET[10.91.51.141:8080] via TCP
>> Aug  7 04:03:28 stdeciovag1 monit[4111]: 'jboss-ssp' failed, cannot open a 
>> connection to INET[10.91.51.141:8080] via TCP
>> Aug  7 04:03:28 stdeciovag1 monit[4111]: 'jboss-ssp' trying to restart
>> Aug  7 04:03:28 stdeciovag1 monit[4111]: 'jboss-ssp' stop: /bin/bash
>> Aug  7 04:03:29 stdeciovag1 monit[4111]: 'jboss-ssp' start: 
>> /opt/jboss/bin/monit_run.sh
>> Aug  7 04:03:29 stdeciovag1 logger: Running /opt/DECE_jboss/bin/run.sh
>> Aug  7 04:03:30 stdeciovag1 monit[4111]: 'jboss-ssp' failed, cannot open a 
>> connection to INET[10.91.51.141:8080] via TCP
>> Aug  7 04:04:30 stdeciovag1 monit[4111]: 'jboss-ssp' failed, cannot open a 
>> connection to INET[10.91.51.141:8080] via TCP
>> Aug  7 04:04:30 stdeciovag1 monit[4111]: 'jboss-ssp' trying to restart
>> Aug  7 04:04:30 stdeciovag1 monit[4111]: 'jboss-ssp' stop: /bin/bash
>> Aug  7 04:04:31 stdeciovag1 monit[4111]: 'jboss-ssp' start: 
>> /opt/jboss/bin/monit_run.sh
>> ….
>> 
>> This goes on forever until someone manually intervenes and stops and starts 
>> the monit job manually.
>> 
>> Any help/guidance would be appreciated.
>> 
>> Thanks,
>> 
>> -dave
>> 
>> 
>> 
>> 
>> 
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
> 
> 
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general

--
Dave Paper                          address@hidden

"The trouble with quotes on the Internet is you never know if they are 
genuine.” —Abraham Lincoln




reply via email to

[Prev in Thread] Current Thread [Next in Thread]