freeipmi-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04


From: Robert Hardy
Subject: Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04
Date: Tue, 01 Feb 2011 14:40:30 -0500
User-agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101207 Thunderbird/3.1.7

It is possible that there is a bios option which starts the watchdog which is enabled.
Once I get a chance, I will dig around in the BIOS and see.

I would think it would be much better behaviour on startup to do a equivalent to bmc-watchdog -y then start the watchdog.

Failing to start simply because the BIOS started the countdown seems very very bad to me especially without logging anything. You're left in a state where the watchdog dies quietly and the server hard reboots every couple of minutes.

I'm willing to test anything you send my way. The server isn't really in production yet but will be soon.

Ultimately I'm trying to package some better .debs for use on Ubuntu. The current ones are badly packaged, to the point of of being unusable. I've re-written the init script for Ubuntu but I'd really like to see an upstart based one....

Rob

On 2011-02-01 12:54 PM, Albert Chu wrote:
Hey Robert,

I think I see the problem(s).  I call _err_exit(), which writes to
stderr, instead of _daemon_error_exit() which writes to the log.  That's
the error logging issue, which is secondary to the real one.

As for the real issue, I think this is being hit:

   if (timer_state == IPMI_BMC_WATCHDOG_TIMER_TIMER_STATE_RUNNING)
     _err_exit ("watchdog timer must be stopped before running daemon");

For some reason, your BMC think's the watchdog is running from the
start.  You could verify w/ bmc-watchdog --get if if you don't star thte
timer.  Perhaps it's a hardware bug?

As an experiment, would you be willing to try a beta that removed this
check?  The issue is, I have no idea what the consequences of removing
this check will be on your motherboard if there's a bug in it.

Al

On Mon, 2011-01-31 at 15:11 -0800, Robert Hardy wrote:
That would be /var/log/freeipmi/bmc-watchdog.log here and nothing is
logged at startup (or after the unexpected exit) during bootup.

I've put all sorts of debugging lines in my init script for bmc-watchdog.

I finally ended up doing doing this at root:
mv /usr/sbin/bmc-watchdog /usr/sbin/bmc-watchdog.real

and then putting this in /usr/sbin/bmc-watchdog:
#!/bin/bash
strace -fFv -o /tmp/bmcstrace.log -- /usr/sbin/bmc-watchdog.real $@

At bootup the bmc-watchdog initscript does launch a process with a new
PID but it does NOT log the regular "starting bmc-watchdog daemon". It
in fact logs nothing at all to /var/log/freeipmi/bmc-watchdog.log DURING
BOOT UP.

The strace above captured bmc-watchdog running at bootup and the same
process exiting here at the last few lines:

1584  semop(229383, {{0, 1, SEM_UNDO}}, 1) = 0
1584  nanosleep({0, 1000}, NULL)        = 0
1584  write(2, "bmc-watchdog.real: watchdog time"..., 72) = -1 EBADF
(Bad file descriptor)
1584  exit_group(1)                     = ?

I've posted the entire strace here:
http://webcon.ca/~rhardy/bmcdrop/

Can you parse that and make any suggestions as to why it would exit
uncleanly and only on boot up?

I'm not quite sure what is going on, but it seems to be trying to write
on a bad file descriptor, getting an error and then exiting.
  From the strace, file descriptor 2 is in fact closed so that error
makes sense to me. The real question is it trying to write to FD 2?

When I restart bmc-watchdog when it gets to the same place it properly
writes the startup message on file descriptor 0 which is the log file
which was opened earlier...

2466  write(0, "[Jan 31 18:03:23]: starting bmc-"..., 48) = 48

I'm open to debugging suggestions too... Ideas?

Thanks for your help,
Rob

On 2011-01-28 5:37 PM, Albert Chu wrote:
Hey Robert,

That is indeed strange.  Does the bmc-watchdog log say anything? (I
can't remember the exact location, but I think it's /var/log/freeipmi/
something).

Al

On Thu, 2011-01-27 at 13:14 -0800, Robert Hardy wrote:
I'm running bmc-watchdog 0.7.15-2 under a current Ubuntu 10.04 64 bit on
several fairly new unloaded Supermicro servers.

On only one (always the same server) of four servers the bmc-watchdog
process quietly exits shortly after start up leaving the system setup for a
hard reset shortly after bootup.

The options and builds are identical on all of the servers. These are my
options: OPTIONS="-d -u 2 -p 0 -a 1 -F -P -L -S -O -i 300 -e 60"

Through debugging I've confirmed on boot up:

- The init script gets run

- It launches bmc-watchdog  saves a new PID correctly in 
/var/run/bmc-watchdog.pid.

- Checking for a bmc-watchdog process in rc.local shows it isn't running and
     the timer is counting down.

- There is no shutdown message logged when the process disappears during bootup.

- There are no messages suggesting the process was killed

On shutdown the init script gets as far as removing
/var/run/bmc-watchdog.pid and seems to work fine.

If I stuff this in rc.local the bmc-watchdog starts up properly and never
seems to die again until the next reboot:
/usr/sbin/service bmc-watchdog stop
/usr/sbin/service bmc-watchdog start

All in all this is very weird behaviour. Is it possible a newer version of
bmc-watchdog would address this? i.e. is this a known bug?

Any other ideas why this is happening (or how I can debug further)?

Regards,
Rob

_______________________________________________
Freeipmi-users mailing list
address@hidden
http://lists.gnu.org/mailman/listinfo/freeipmi-users




reply via email to

[Prev in Thread] Current Thread [Next in Thread]