freeipmi-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-users] problems with bmc-watchdog


From: Al Chu
Subject: Re: [Freeipmi-users] problems with bmc-watchdog
Date: Wed, 05 May 2010 17:26:07 -0700

Hey Dave,

Inlined answers below:

On Wed, 2010-05-05 at 16:00 -0700, Dave Love wrote:
> Al Chu <address@hidden> writes:
> 
> > Let's try some tests.  Could you run bmc-watchdog "by hand" to make sure
> > things look like it's working right?  "by hand", I mean something like
> > run:
> >
> > bmc-watchdog --get (see what the current watchdog settings are)
> > bmc-watchdog --set ... (with same as deamon options, except not the
> > reset interval '-e 60')
> > bmc-watchdog --get (see that things are set)
> > bmc-watchdog --start
> > bmc-watchdog --get (make sure things changed, timer is running)
> > bmc-watchdog --get (make sure timer is counting down)
> > bmc-watchdog --reset
> > bmc-watchdog --get (make sure timer has reset)
> >
> > (and you probably want to do bmc-watchdog --stop at the end)
> 
> I should have said I was puzzled by when it says Stopped.  This is a
> RH5, Sun ILOM 2 system (not ELOM as I thinko'd before).
> 
>   # bmc-watchdog --get
>   Timer Use:                   SMS/OS
>   Timer:                       Stopped
>   Logging:                     Enabled
>   Timeout Action:              Hard Reset
>   Pre-Timeout Interrupt:       None
>   Pre-Timeout Interval:        0 seconds
>   Timer Use BIOS FRB2 Flag:    Set
>   Timer Use BIOS POST Flag:    Set
>   Timer Use BIOS OS Load Flag: Set
>   Timer Use BIOS SMS/OS Flag:  Set
>   Timer Use BIOS OEM Flag:     Set
>   Initial Countdown:           900 seconds
>   Current Countdown:           900 seconds
>   # bmc-watchdog --set -u 4 -p 0 -a 1 -i 900
>   # bmc-watchdog --get 
>   Timer Use:                   SMS/OS
>   Timer:                       Stopped
>   Logging:                     Enabled
>   Timeout Action:              Hard Reset
>   Pre-Timeout Interrupt:       None
>   Pre-Timeout Interval:        0 seconds
>   Timer Use BIOS FRB2 Flag:    Set
>   Timer Use BIOS POST Flag:    Set
>   Timer Use BIOS OS Load Flag: Set
>   Timer Use BIOS SMS/OS Flag:  Set
>   Timer Use BIOS OEM Flag:     Set
>   Initial Countdown:           900 seconds
>   Current Countdown:           900 seconds
>   # bmc-watchdog --start
>   # bmc-watchdog --get
>   Timer Use:                   SMS/OS
>   Timer:                       Stopped
>   Logging:                     Enabled
>   Timeout Action:              Hard Reset
>   Pre-Timeout Interrupt:       None
>   Pre-Timeout Interval:        0 seconds
>   Timer Use BIOS FRB2 Flag:    Clear
>   Timer Use BIOS POST Flag:    Clear
>   Timer Use BIOS OS Load Flag: Clear
>   Timer Use BIOS SMS/OS Flag:  Clear
>   Timer Use BIOS OEM Flag:     Clear
>   Initial Countdown:           900 seconds
>   Current Countdown:           900 seconds
>   # sleep 2
>   # bmc-watchdog --get
>   Timer Use:                   SMS/OS
>   Timer:                       Stopped
>   Logging:                     Enabled
>   Timeout Action:              Hard Reset
>   Pre-Timeout Interrupt:       None
>   Pre-Timeout Interval:        0 seconds
>   Timer Use BIOS FRB2 Flag:    Clear
>   Timer Use BIOS POST Flag:    Clear
>   Timer Use BIOS OS Load Flag: Clear
>   Timer Use BIOS SMS/OS Flag:  Clear
>   Timer Use BIOS OEM Flag:     Clear
>   Initial Countdown:           900 seconds
>   Current Countdown:           898 seconds

Well this answers why you're getting "timer stopped by another process".
For whatever reason, the timer never goes from "Stopped" to "Running."
Even after the timer is clearly enabled and counting down.

>   # bmc-watchdog --reset
>   # bmc-watchdog --get
>   Timer Use:                   SMS/OS
>   Timer:                       Stopped
>   Logging:                     Enabled
>   Timeout Action:              Hard Reset
>   Pre-Timeout Interrupt:       None
>   Pre-Timeout Interval:        0 seconds
>   Timer Use BIOS FRB2 Flag:    Clear
>   Timer Use BIOS POST Flag:    Clear
>   Timer Use BIOS OS Load Flag: Clear
>   Timer Use BIOS SMS/OS Flag:  Clear
>   Timer Use BIOS OEM Flag:     Clear
>   Initial Countdown:           900 seconds
>   Current Countdown:           900 seconds
>   
> > This can help us isolate things.  If the above works, then maybe there
> > is a timing issue within your BMC that we need to get around.  I'm a
> > little perplexed as to why it would work with the openipmi driver.  It's
> > possible it's more generous on some timeouts of packets and such.  Or
> > maybe the openipmi driver's own watchdog implementation/code has done
> > something to massage the BMC that I'm unaware of.
> 
> I probably wasn't clear.  What I meant was:
> 
>   # bmc-watchdog -g --config-file /dev/null
>   ipmi-kcs-driver.c: 749: ipmi_kcs_write: error 'BMC busy' (7)
>   ipmi-kcs-driver.c: 749: ipmi_kcs_write: error 'BMC busy' (7)
>   ipmi-kcs-driver.c: 858: ipmi_kcs_read: error 'BMC busy' (7)
>   ipmi-kcs-driver.c: 749: ipmi_kcs_write: error 'BMC busy' (7)
>   ipmi-kcs-driver.c: 749: ipmi_kcs_write: error 'BMC busy' (7)
>   ipmi-kcs-driver.c: 858: ipmi_kcs_read: error 'BMC busy' (7)
>   bmc-watchdog: Get Watchdog Timer Error: BMC Busy
> 
> in contrast to:
> 
>   # bmc-watchdog -g --config-file /dev/null -D OPENIPMI|head -1
>   Timer Use:                   SMS/OS
>   ...
> 
> and
> 
>   # bmc-info --config-file /dev/null
>   Device ID             : 32
>   ...
> 
> Actually now it's obvious there's something wrong with the ILOM, thanks.
> I've now tried on an x2200M2 with ELOM with the results below (and I
> don't have to specify the openipmi driver).  I guess I won't get
> anywhere with a service request on this -- especially as I'm only doing
> it because Sun couldn't fix the hangups on the Thumper -- but perhaps
> you have a simple idea for a fix?
> 
>   # bmc-watchdog --get
>   Timer Use:                   SMS/OS
>   Timer:                       Stopped
>   Logging:                     Enabled
>   Timeout Action:              Hard Reset
>   Pre-Timeout Interrupt:       None
>   Pre-Timeout Interval:        0 seconds
>   Timer Use BIOS FRB2 Flag:    Clear
>   Timer Use BIOS POST Flag:    Clear
>   Timer Use BIOS OS Load Flag: Clear
>   Timer Use BIOS SMS/OS Flag:  Clear
>   Timer Use BIOS OEM Flag:     Clear
>   Initial Countdown:           900 seconds
>   Current Countdown:           0 seconds
>   # bmc-watchdog --set -u 4 -p 0 -a 1 -i 900
>   # bmc-watchdog --get 
>   Timer Use:                   SMS/OS
>   Timer:                       Stopped
>   Logging:                     Enabled
>   Timeout Action:              Hard Reset
>   Pre-Timeout Interrupt:       None
>   Pre-Timeout Interval:        0 seconds
>   Timer Use BIOS FRB2 Flag:    Clear
>   Timer Use BIOS POST Flag:    Clear
>   Timer Use BIOS OS Load Flag: Clear
>   Timer Use BIOS SMS/OS Flag:  Clear
>   Timer Use BIOS OEM Flag:     Clear
>   Initial Countdown:           900 seconds
>   Current Countdown:           0 seconds
>   # bmc-watchdog --start
>   # bmc-watchdog --get
>   Timer Use:                   SMS/OS
>   Timer:                       Running
>   Logging:                     Enabled
>   Timeout Action:              Hard Reset
>   Pre-Timeout Interrupt:       None
>   Pre-Timeout Interval:        0 seconds
>   Timer Use BIOS FRB2 Flag:    Clear
>   Timer Use BIOS POST Flag:    Clear
>   Timer Use BIOS OS Load Flag: Clear
>   Timer Use BIOS SMS/OS Flag:  Clear
>   Timer Use BIOS OEM Flag:     Clear
>   Initial Countdown:           900 seconds
>   Current Countdown:           900 seconds
>   # sleep 2
>   # bmc-watchdog --get
>   Timer Use:                   SMS/OS
>   Timer:                       Running
>   Logging:                     Enabled
>   Timeout Action:              Hard Reset
>   Pre-Timeout Interrupt:       None
>   Pre-Timeout Interval:        0 seconds
>   Timer Use BIOS FRB2 Flag:    Clear
>   Timer Use BIOS POST Flag:    Clear
>   Timer Use BIOS OS Load Flag: Clear
>   Timer Use BIOS SMS/OS Flag:  Clear
>   Timer Use BIOS OEM Flag:     Clear
>   Initial Countdown:           900 seconds
>   Current Countdown:           898 seconds

In this particular case, it seems that the Timer turns on properly and
goes to "Running".  Does the daemon work properly on this node?

I'm trying to think of some way to deal with your problem on the other
nodes  I think ignoring the flag is a bad idea (it would get around it).
But it would probably solve the problem.

Perhaps I can edit the code to do some workaround check to see if the
countdown is changing.  If it is, assume the timer is running regardless
of what the flag says??  I'm willing to give it a shot if you're
interested.

Al

>   # bmc-watchdog --reset
>   # bmc-watchdog --get
>   Timer Use:                   SMS/OS
>   Timer:                       Running
>   Logging:                     Enabled
>   Timeout Action:              Hard Reset
>   Pre-Timeout Interrupt:       None
>   Pre-Timeout Interval:        0 seconds
>   Timer Use BIOS FRB2 Flag:    Clear
>   Timer Use BIOS POST Flag:    Clear
>   Timer Use BIOS OS Load Flag: Clear
>   Timer Use BIOS SMS/OS Flag:  Clear
>   Timer Use BIOS OEM Flag:     Clear
>   Initial Countdown:           900 seconds
>   Current Countdown:           899 seconds
-- 
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory





reply via email to

[Prev in Thread] Current Thread [Next in Thread]