freeipmi-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freeipmi-devel] Re: Another FreeIPMI beta w/ BMC watchdog workaround fo


From: Albert Chu
Subject: [Freeipmi-devel] Re: Another FreeIPMI beta w/ BMC watchdog workaround for Sun machines
Date: Tue, 06 Jul 2010 09:40:03 -0700

Hey Frank,

On Sun, 2010-07-04 at 23:52 -0700, Frank Steiner wrote:
> Hi Al,
> 
> Albert Chu wrote
> 
> > Hey Dave, Frank,
> > 
> > As discussed in the previous thread, there was a corner case in the
> > bmc-watchdog workaround I previously did.  I then discovered another
> > corner case w/ the workaround.
> > 
> > There is a new beta here.
> 
> sorry, I was away, but I'm going to test the new beta now. During my
> absense the Sun X4100M2 produced two strange things:
> 
> 1) bmc-watchdog: Get Watchdog Timer Error: No error message found for 
>    command 25h, network function 06h, and completion code 80h.  Please 
>    report to <address@hidden>

Congrats, you're the first person to ever report this! :-)  The above
maps to the "Get Watchdog Timer" command, and 80h is a completion code
in the range that is supposed to be defined by the IPMI spec, but
currently is not.  So Sun clearly made up an error code number for
something.  I'll ping some people at Sun and see if they can get the
error message for me.

> 2) The really bad thing was three of the X4100M2 being rebooted by the
>    watchdog as reaction to a "bmc-watchdog -s -k" call I guess. The
>    timer runs 15 minutes and I reset the watchdog by to independent
>    instances  every 3 minutes. On all three machines I found this in
>    the logs:

I'll respond to this in your other post.

Al

>    Jul  3 21:03:01 sunserver8 /usr/sbin/cron[11808]: (root) CMD 
> (/usr/bin/bmc-reset)
>    Jul  3 21:03:04 sunserver8 pm-profiler: Power Button pressed, executing 
> /sbin/shutdown -h now
>    Jul  3 21:03:04 sunserver8 shutdown[11853]: shutting down for system halt
> 
>    The bmc-reset script just does this:
>    for name in `seq 1 15`
>    do
>      # -s -k means: reset if running. Could be that the timer was
>      # stopped because the init script failed to set it up. We should
>      # not start it then.
>      output=`/usr/sbin/bmc-watchdog -s -k 2>&1`
>      exitstatus=$?
>      if [ "$exitstatus" != "0" ]
>      then
>        sleep 3
>      else
>        exit 0
>      fi
>    done
> 
>    There was always 2-3 seconds between the cron entry and the shutdown
>    so I guess the ilom of the Sun initiated the shutdown due to the
>    bmc-watchdog -s -k command. The timer cannot have run down because
>    I get an email for every failed try to reset the watchdog and should
>    have gotten 3-4 of them in the 15 minutes the timer runs.
> 
>    Has anything liks this reported before?
> 
> Btw, Sun first refused to develop a firmware update for the X4100M2 because
> it is EOL, but due to our 5-year-support warranty they are forced to do so ;-)
> Now they are developing a patch for a newer machine, because they stated that
> the error exists in may of the SunFire machines, and will then backport it to
> the 4100.
> 
> cu,
> Frank
> 
> 
> 
> 
> 
-- 
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




reply via email to

[Prev in Thread] Current Thread [Next in Thread]