freeipmi-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04


From: Albert Chu
Subject: Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04
Date: Tue, 01 Feb 2011 17:48:03 -0800

Hey Robert,

The following beta release has a bmc-watchdog that has (hopefully) fixed
logging.

http://download.gluster.com/pub/freeipmi/qa-release/freeipmi-1.0.2.beta2.tar.gz

If you could check it out, that'd be great.

Al

On Tue, 2011-02-01 at 17:20 -0800, Albert Chu wrote:
> Hi Robert,
> 
> On Tue, 2011-02-01 at 11:40 -0800, Robert Hardy wrote:
> > It is possible that there is a bios option which starts the watchdog 
> > which is enabled.
> > Once I get a chance, I will dig around in the BIOS and see.
> 
> I think a more likely scenario would be the IPMI kernel driver is
> starting up the watchdog and racing w/ the FreeIPMI one.  Are you
> loading the IPMI kernel driver?
> 
> > I would think it would be much better behaviour on startup to do a 
> > equivalent to bmc-watchdog -y then start the watchdog.
> 
> I had to look this up (b/c I couldn't remember, but was fairly certain)
> the IPMI spec indicates that the watchdog timer is required to be turned
> off when a node is rebooted (27.1).
> 
> > Failing to start simply because the BIOS started the countdown seems 
> > very very bad to me especially without logging anything.
> 
> The logging portion of this issue should be fixed w/ the next release.
> 
> > You're left in 
> > a state where the watchdog dies quietly and the server hard reboots 
> > every couple of minutes.
> 
> If the BIOS happens to be starting the countdown, that's *REALLY* bad on
> the part of the BIOS programmers.  Whoever starts the countdown needs to
> manage it.  It can't be trusted for some other random piece of software
> to handle.
> 
> So just so I understand the situation correctly, when you disable the
> bmc-watchdog daemon, does the problem go away?  The FreeIPMI
> bmc-watchdog does not start any timer until it determines the timer is
> stopped.  Since the timer is already running, it never starts it.
> 
> Al
>   
> 
> > I'm willing to test anything you send my way. The server isn't really in 
> > production yet but will be soon.
> > 
> > Ultimately I'm trying to package some better .debs for use on Ubuntu. 
> > The current ones are badly packaged, to the point of of being unusable.  
> > I've re-written the init script for Ubuntu but I'd really like to see an 
> > upstart based one....
> > 
> > Rob
> > 
> > On 2011-02-01 12:54 PM, Albert Chu wrote:
> > > Hey Robert,
> > >
> > > I think I see the problem(s).  I call _err_exit(), which writes to
> > > stderr, instead of _daemon_error_exit() which writes to the log.  That's
> > > the error logging issue, which is secondary to the real one.
> > >
> > > As for the real issue, I think this is being hit:
> > >
> > >    if (timer_state == IPMI_BMC_WATCHDOG_TIMER_TIMER_STATE_RUNNING)
> > >      _err_exit ("watchdog timer must be stopped before running daemon");
> > >
> > > For some reason, your BMC think's the watchdog is running from the
> > > start.  You could verify w/ bmc-watchdog --get if if you don't star thte
> > > timer.  Perhaps it's a hardware bug?
> > >
> > > As an experiment, would you be willing to try a beta that removed this
> > > check?  The issue is, I have no idea what the consequences of removing
> > > this check will be on your motherboard if there's a bug in it.
> > >
> > > Al
> > >
> > > On Mon, 2011-01-31 at 15:11 -0800, Robert Hardy wrote:
> > >> That would be /var/log/freeipmi/bmc-watchdog.log here and nothing is
> > >> logged at startup (or after the unexpected exit) during bootup.
> > >>
> > >> I've put all sorts of debugging lines in my init script for bmc-watchdog.
> > >>
> > >> I finally ended up doing doing this at root:
> > >> mv /usr/sbin/bmc-watchdog /usr/sbin/bmc-watchdog.real
> > >>
> > >> and then putting this in /usr/sbin/bmc-watchdog:
> > >> #!/bin/bash
> > >> strace -fFv -o /tmp/bmcstrace.log -- /usr/sbin/bmc-watchdog.real $@
> > >>
> > >> At bootup the bmc-watchdog initscript does launch a process with a new
> > >> PID but it does NOT log the regular "starting bmc-watchdog daemon". It
> > >> in fact logs nothing at all to /var/log/freeipmi/bmc-watchdog.log DURING
> > >> BOOT UP.
> > >>
> > >> The strace above captured bmc-watchdog running at bootup and the same
> > >> process exiting here at the last few lines:
> > >>
> > >> 1584  semop(229383, {{0, 1, SEM_UNDO}}, 1) = 0
> > >> 1584  nanosleep({0, 1000}, NULL)        = 0
> > >> 1584  write(2, "bmc-watchdog.real: watchdog time"..., 72) = -1 EBADF
> > >> (Bad file descriptor)
> > >> 1584  exit_group(1)                     = ?
> > >>
> > >> I've posted the entire strace here:
> > >> http://webcon.ca/~rhardy/bmcdrop/
> > >>
> > >> Can you parse that and make any suggestions as to why it would exit
> > >> uncleanly and only on boot up?
> > >>
> > >> I'm not quite sure what is going on, but it seems to be trying to write
> > >> on a bad file descriptor, getting an error and then exiting.
> > >>   From the strace, file descriptor 2 is in fact closed so that error
> > >> makes sense to me. The real question is it trying to write to FD 2?
> > >>
> > >> When I restart bmc-watchdog when it gets to the same place it properly
> > >> writes the startup message on file descriptor 0 which is the log file
> > >> which was opened earlier...
> > >>
> > >> 2466  write(0, "[Jan 31 18:03:23]: starting bmc-"..., 48) = 48
> > >>
> > >> I'm open to debugging suggestions too... Ideas?
> > >>
> > >> Thanks for your help,
> > >> Rob
> > >>
> > >> On 2011-01-28 5:37 PM, Albert Chu wrote:
> > >>> Hey Robert,
> > >>>
> > >>> That is indeed strange.  Does the bmc-watchdog log say anything? (I
> > >>> can't remember the exact location, but I think it's /var/log/freeipmi/
> > >>> something).
> > >>>
> > >>> Al
> > >>>
> > >>> On Thu, 2011-01-27 at 13:14 -0800, Robert Hardy wrote:
> > >>>> I'm running bmc-watchdog 0.7.15-2 under a current Ubuntu 10.04 64 bit 
> > >>>> on
> > >>>> several fairly new unloaded Supermicro servers.
> > >>>>
> > >>>> On only one (always the same server) of four servers the bmc-watchdog
> > >>>> process quietly exits shortly after start up leaving the system setup 
> > >>>> for a
> > >>>> hard reset shortly after bootup.
> > >>>>
> > >>>> The options and builds are identical on all of the servers. These are 
> > >>>> my
> > >>>> options: OPTIONS="-d -u 2 -p 0 -a 1 -F -P -L -S -O -i 300 -e 60"
> > >>>>
> > >>>> Through debugging I've confirmed on boot up:
> > >>>>
> > >>>> - The init script gets run
> > >>>>
> > >>>> - It launches bmc-watchdog  saves a new PID correctly in 
> > >>>> /var/run/bmc-watchdog.pid.
> > >>>>
> > >>>> - Checking for a bmc-watchdog process in rc.local shows it isn't 
> > >>>> running and
> > >>>>      the timer is counting down.
> > >>>>
> > >>>> - There is no shutdown message logged when the process disappears 
> > >>>> during bootup.
> > >>>>
> > >>>> - There are no messages suggesting the process was killed
> > >>>>
> > >>>> On shutdown the init script gets as far as removing
> > >>>> /var/run/bmc-watchdog.pid and seems to work fine.
> > >>>>
> > >>>> If I stuff this in rc.local the bmc-watchdog starts up properly and 
> > >>>> never
> > >>>> seems to die again until the next reboot:
> > >>>> /usr/sbin/service bmc-watchdog stop
> > >>>> /usr/sbin/service bmc-watchdog start
> > >>>>
> > >>>> All in all this is very weird behaviour. Is it possible a newer 
> > >>>> version of
> > >>>> bmc-watchdog would address this? i.e. is this a known bug?
> > >>>>
> > >>>> Any other ideas why this is happening (or how I can debug further)?
> > >>>>
> > >>>> Regards,
> > >>>> Rob
> > >>>>
> > >>>> _______________________________________________
> > >>>> Freeipmi-users mailing list
> > >>>> address@hidden
> > >>>> http://lists.gnu.org/mailman/listinfo/freeipmi-users
> > 
-- 
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




reply via email to

[Prev in Thread] Current Thread [Next in Thread]