monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [monit] Monit fails to create PID file on restart


From: Jonathan Maddox
Subject: Re: [monit] Monit fails to create PID file on restart
Date: Tue, 13 Jul 2010 14:34:35 +1000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.10) Gecko/20100527 Thunderbird/3.0.5

Hello,

We have seen the same issue with in-house daemons running on CentOS (with init scripts derived from those in stock CentOS packages) and monitored by monit. I am sure that the same problem can occur with stock daemons such as Apache with certain workloads. It is the result of a race condition between the init script and the monit 'restart' action, which is triggered when a daemon takes several seconds to shut down when signalled.

What happens is this : Monit's 'restart' action first begins the 'stop' command in a background process, while the main process polls once per second for the process no longer to exist, defined by reading the pidfile and checking for a process with the specified pid. When the pidfile and the daemon no longer match, monit will run the 'start' command.

The 'stop' command is often an init script. The stock CentOS and RedHat init scripts will read the pidfile for the daemon and will send several signals to the same process number with sleeps in between, as follows ( this code is in /etc/init.d/functions in the function killproc() ):

delay=3

...

if checkpid $pid 2>&1; then
  # TERM first, then KILL if not dead
  kill -TERM $pid >/dev/null 2>&1
  usleep 100000
  if checkpid $pid && sleep 1 &&
    checkpid $pid && sleep $delay &&
    checkpid $pid ; then
      kill -KILL $pid >/dev/null 2>&1
      usleep 100000
  fi
fi

...

rm -f "${pid_file:-/var/run/$base.pid}"

(I've elided irrelevant bits that depend on special options passed to this function. This is the default behaviour.)

The race condition is that monit's polling can notice that the daemon is gone while the init script is still doing one of its sleeps, and will have already called the 'start' command well before the 'stop' command is complete. The pidfile for the *new* invocation of the daemon will have been created, and so the 'stop' command, when it wakes up, removes the new pidfile.

There are several ways to fix this.

One way would be simply to remove the line in the init script which deletes the pid file. Since stale pid files are commonplace after many error cases (eg. daemon crashes and hardware failure) and scripts all seem to be written to cope with them, removing the pid file does not win anything.

Another, more complete fix would be for monit not to run its 'start' command until after 'stop' has returned. It would then work even with 'broken' init scripts.

We have dealt with the problem locally by replacing relevant parts of the init scripts, only for those daemons which are watched by monit. Our scripts no longer unconditionally remove the pid file after 'stop', but will delete it only if it still contains the pid which has just been killed. This is actually still open to a race, but not a race across a sleep of one or more seconds. It also assumes that the kernel will not recycle the same pid for the same daemon ... probably not completely valid but seems reliable enough to us for now.

I should now raise bugs with monit and CentOS and/or RedHat upstreams to let them know about the issue, now that I know that other people have found it to be a problem in the wild.

Looking superficially at the init scripts provided by a couple of other distributions, I see that some don't seem ever to remove the pid file so they are obviously immune to this race.

I hope this helps.

regards,

Jonathan Maddox



reply via email to

[Prev in Thread] Current Thread [Next in Thread]