freeipmi-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fw: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to ge


From: Al Chu
Subject: Re: Fw: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable to get SEL record
Date: Tue, 27 Jan 2009 09:36:56 -0800

Hey Won,

On Mon, 2009-01-26 at 18:53 -0800, Won De Erick wrote:
> ----- Original Message ----
> 
> > From: Al Chu <address@hidden>
> > 
> > Hey Won,
> > 
> > On Sun, 2009-01-25 at 23:00 -0800, Won De Erick wrote:
> > > I am forwarding this to the FreeIPMI users mailing list. Hope, I can get 
> > > hints 
> > from you all.
> > > Thank you.
> > > 
> > > 
> > > 
> > > ----- Forwarded Message ----
> > > From: Won De Erick 
> > > To: Albert Chu 
> > > Cc: address@hidden
> > > Sent: Saturday, January 24, 2009 11:55:24 AM
> > > Subject: Re: [Freeipmi-devel] ibmx3650 reboots after ipmi-sel is unable 
> > > to get 
> > SEL record
> > > 
> > > Pls disregard previous email. I forgot to attach the file. :)
> > 
> > Did you send me the wrong debug file?  I see debug output from
> > ipmi-sensors??
> > 
> 
> I'm sorry, attached is the correct one.

Seems that this has a successful ipmi-sel execution in it.  So not much
I can debug with :-(

> 
> > > Hi Al,
> > > 
> > > With IBM x3650, I  noticed that ipmi-sel is unable to get the SEL record.
> > > 
> > > # ipmi-sel --version
> > > IPMI Sensors [ipmi-sel-0.6.10]
> > > 
> > > # ipmi-sel > ibm3650-dsc2075-sel.txt
> > > ipmi_cmd_get_sel_entry: BMC busy
> > > ipmi-sel: unable to get SEL record
> > > 
> > > After the above, the box automatically rebooted. Is this normal?
> > 
> > I have never seen this behavior before, and I wouldn't consider it
> > "good" in any definition.  This is likely a bug in the IBM
> > implementation.  The "BMC busy" means exactly what it says, the BMC is
> > busy and cannot respond to IPMI requests.  It by itself is not a
> > problem.  For example, some other IPMI tasks are hogging resources.  But
> > you should presumably be able to reach the card eventually.  Is it
> > possible you have other IPMI things running in the background?
> > 
> 
> bmc-watchdog (as daemon) was the only thing running in the background.

This shouldn't be enough to cause enough IPMI to be *that* busy.  Here's
a thought.  Perhaps the ipmi-sel logs went full, the BMC card went busy,
and thus the bmc-watchdog couldn't perform IPMI and timed out, thus
leading to a reboot??  Obviously, it depends on how you setup the
bmc-watchdog.

> 
> > > I then cleared the SEL records, thinking that the reboot might have been 
> > triggered due to a full SEL.
> > 
> > I think this is a reasonable guess.  It could be anything really.  
> > 
> > > # ipmi-sel -c
> > > 
> > > # reboot
> > > # ipmi-sel
> > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00
> > > # ipmi-sel
> > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00
> > > 
> > > # reboot
> > > # ipmi-sel
> > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00
> > > 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00
> > > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00
> > > 
> > > Then retried the previous command that caused an error.
> > > 
> > > # ipmi-sel > ibm3650-dsc2075-sel.txt
> > > 
> > > # cat ibm3650-dsc2075-sel.txt
> > > 1:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00
> > > 2:OEM defined = 00 00 00 00 00 E3 25 86 80 00 00 FF 00
> > > 3:OEM defined = 02 00 00 FF 00 00 00 00 20 00 00 00 00
> > > 
> > > Then the problem didn't occur anymore.
> > > Besides, what is the meaning of this OEM defined? I can't see any log 
> > > that is 
> > > more specific, or something like
> > 
> > The system event log is allowed to store OEM defined information.  Since
> > the information is defined by (in this case) IBM, I have no way to
> > convert the hex into something like what you're used to :-(
> > 
> 
> I think this is cool. So, is it safe to assume that the system
> rebooted if I see similar OEM defined info ( in this case OEM defined
> = 00 00 00 00 00 E3 25 86 80 00 00 FF 00)? Is there any possibility to
> integrate IBM's OEM defined info in the future too? :D

I'd be willing to integrate any vendors OEM defined
interpretation/parsing into FreeIPMI. The problem is, I do not know how
to interpret/parse any of their information :-(  

As a customer, you should tell your vendor support about this.  Each
user that complains makes it more possible for them to release the
information.

Al

> > > 220:19-Sep-2008 14:24:56:Power Unit Sys pwr monitor:Power Off/Power Down
> > > 221:19-Sep-2008 14:25:16:Power Unit Sys pwr monitor:Power Off/Power Down
> > > 
> > > I've attached here the ipmi-sel debug output.
> > > 
> > > Then one side question, I want to ask the possible reasons of the ff
> > > log obtained prior to clearing. I didn't change any in the system.
> > > I just noticed that the system halted serving and went back after 4-5
> > > minutes, w/out any other records in SEL that says the box hang and
> > > rebooted.
> > >
> > > 54:23-Jan-2009 11:28:55:System Event #0:System Reconfigured
> > 
> > I'm not quite sure what you're asking.  Are you asking why the above log
> > message occurs?  I'm not too sure.  It could really be for one of many
> > reasons.  Maybe a BIOS changed for a firmware changed?  The IPMI spec
> > doesn't really define when a "System Reconfigured" event must be
> > reported.  It only defines that a "System Reconfigured" event can occur
> > and that manufacturers are free to determine what events will make that
> > information output to the event log.
> > 
> 
> You exactly got what I should mean. But aside from changes on the BIOS
> or BMC firmware, I want to know too if there are instances that the
> event would be reported if there are changes on the OS level. I just
> wondered why the "System Reconfigured" event log came out, where in
> fact no changes were made on the BIOS firmware or BMC firmware, or on
> the OS level. Sorry, this question may not be related to FreeIPMI
> anymore, but I just want to elicit some ideas from you.
> 
> > Hope I was helpful,
> > 
> > Al
> > 
> > > Thanks,
> > > 
> > > Won
> > > 
> > > 
> > >      
> > -- 
> > Albert Chu
> > address@hidden
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> 
> I am receiving mail delivery error(s) when sending mails to address@hidden; 
> address@hidden
> 
> Thanks for the usual support and help,
> 
> Won
> 
> 
> 
>       
> 
-- 
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory





reply via email to

[Prev in Thread] Current Thread [Next in Thread]