freeipmi-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-users] Temperature sensors disabled or abnormally cold: Ca


From: Albert Chu
Subject: Re: [Freeipmi-users] Temperature sensors disabled or abnormally cold: Can alerts be sent?
Date: Thu, 30 Sep 2010 14:05:07 -0700

Hi Ryan,

Admittedly, this is complex.  The tool was written (to some extent) for
expert IPMI users.  Something more user friendly written on top of it
would be optimal, but given all the different permutations of
motherboards out there (and quite frankly a relative lack of demand)
I've never gotten to it.  Hopefully I can point you in the right
direction.  The following is something I wrote to a user awhile ago [1],
perhaps you can use it as a start, and then ask some questions following
up from it?

----
> In the section PEF_Conf, configure PEF on (that part should be
simple).
> 
> In Community_String and Lan_Alert_Destination_1 setup your SNMP
> configuration appropriately and IP destination.  Hopefully this part
is
> easy too.
> 
> In Alert_Policy_1, the defaults your motherboard set are probably
fine,
> but you probably want to set "Always_Send_To_This_Destination", enable
> the policy to yes, configure the channel for LAN.  This is probably a
> medium difficulty.
> 
> In Event_Filter_1, this is probably the hard part.  The Filter should
be
> set to "Software_Configurable".  Set the Policy Number to the one
> previously configured ('1' if you're following my instructions).  For
> "Generator ID", "sensor number", "event trigger" you probably want to
> set to 0xFF for "any".  For "Sensor Type", set the sensor type
> appropriately, like "Fan", "Temperature", etc.
> 
> Then it gets nasty.  You'll have to read chapter 17 of the IPMI spec
for
> the full answer.  But basically you need to subsequently set bits to
> indicate how the event gets triggered.  It depends on each sensor
type.
> I think the easiest thing to do is set the AND and Compare masks to 0,
> but set the Event_Data1_Offset_Mask to the bitmask of bits you are
> interested in (see chapter 42 of the spec).
> 
> You may need to adjust some of what I said above for your motherboard.
> Perhaps there are some manufacturer configurations that cannot be
> modified, so you need to do policy 2 instead of 1, event filter 3,
etc.
> 
----

Al

[1] - thought it was on the mailing list, but alas, 

On Thu, 2010-09-30 at 10:46 -0700, Ryan Cox wrote:
> I'm trying to use pef-config (or any tool that will work) to have a Dell 
> PowerEdge M610 send alerts when a temperature sensor is disabled or has 
> seriously erroneous data, like reporting a processor to be at 5 degrees 
> C.  Here is an example:
> 
> # ipmitool sdr type Temperature
> Temp             | 01h | ns  |  3.1 | Disabled
> Temp             | 02h | ok  |  3.2 | 5 degrees C
> Ambient Temp     | 08h | ok  |  7.1 | 26 degrees C
> IOH ThermTrip    | 35h | ns  |  7.1 | Disabled
> 
>  From a good server:
> # ipmitool sdr type Temperature
> Temp             | 01h | ok  |  3.1 | 26 degrees C
> Temp             | 02h | ok  |  3.2 | 34 degrees C
> Ambient Temp     | 08h | ok  |  7.1 | 24 degrees C
> IOH ThermTrip    | 35h | ns  |  7.1 | Disabled
> 
> 
> 
> The processor sensors are the ones I care about (3.1 and 3.2).  A server 
> was affected by a power event (surge or sag... not sure) and the 
> processor temperature sensors are having issues for some reason.  The 
> CPUs are throttled as a result and that is logged via syslog.  We can 
> take care of the hardware issues just fine, but I am hoping to have our 
> servers notify us of problems in a way like alerts for an ECC threshold 
> error (entry in event log, snmp trap sent, amber light).  I played 
> around with pef-config for a while and can't figure out how to make it 
> alert when a sensor is disabled.  I'm also not sure if the alert would 
> only happen on a cold boot, etc, so I'm not sure if maybe I do have it 
> configured correctly but just can't test it.  The affected servers are 
> still in use until user jobs on them are finished, so I can't reboot 
> them until that time.
> 
> Here's an example of a config I was working with:
> Section Event_Filter_9
>          ## Possible values: 
> Manufacturer_Pre_Configured/Software_Configurable/Reserved1/Reserved3
>          Filter_Type                                  
> Manufacturer_Pre_Configured
>          ## Possible values: Yes/No
>          Enable_Filter                                Yes
>          ## Possible values: Yes/No
>          Event_Filter_Action_Alert                    Yes
>          ## Possible values: Yes/No
>          Event_Filter_Action_Power_Off                No
>          ## Possible values: Yes/No
>          Event_Filter_Action_Reset                    No
>          ## Possible values: Yes/No
>          Event_Filter_Action_Power_Cycle              No
>          ## Possible values: Yes/No
>          Event_Filter_Action_Oem                      No
>          ## Possible values: Yes/No
>          Event_Filter_Action_Diagnostic_Interrupt     No
>          ## Possible values: Yes/No
>          Event_Filter_Action_Group_Control_Operation  No
>          ## Give a valid number
>          Alert_Policy_Number                          1
>          ## Give a valid number
>          Group_Control_Selector                       0
>          ## Possible values: 
> Unspecified/Monitor/Information/OK/Non_Critical/Critical/Non_Recoverable
>          Event_Severity                               Critical
>          ## Specify a hex Slave Address or Software ID from Event 
> Message or 0xFF to Match Any
>          Generator_Id_Byte_1                          0xFF
>          ## Specify a hex Channel Number or LUN to match or 0xFF to 
> Match Any
>          Generator_Id_Byte_2                          0xFF
>          ## Specify a Sensor Type, For options see the MAN page
>          Sensor_Type                                  Temperature
>          ## Specify a Sensor Number or 0xFF to Match Any
>          Sensor_Number                                0xFF
>          ## Specify a Event/Reading Type Number or 0xFF to Match Any
>          Event_Trigger                                0xFF
>          ## Give a valid number
>          Event_Data1_Offset_Mask                      0x204
>          ## Give a valid number
>          Event_Data1_AND_Mask                         0x00
>          ## Give a valid number
>          Event_Data1_Compare1                         0xFF
>          ## Give a valid number
>          Event_Data1_Compare2                         0x00
>          ## Give a valid number
>          Event_Data2_AND_Mask                         0x00
>          ## Give a valid number
>          Event_Data2_Compare1                         0xFF
>          ## Give a valid number
>          Event_Data2_Compare2                         0x00
>          ## Give a valid number
>          Event_Data3_AND_Mask                         0x00
>          ## Give a valid number
>          Event_Data3_Compare1                         0xFF
>          ## Give a valid number
>          Event_Data3_Compare2                         0x00
> EndSection
> 
> 
> In this attempt, I was trying to have it essentially alert for 
> everything and then narrow it down from there.
> 
> A few things I'm unsure about: What is Event_Data1_Offset_Mask and is it 
> set appropriately for what I want to do (I used an existing temperature 
> policy from the blade as a template)?  This is my first time messing 
> with pef-config, so I'm a little confused by it to be honest.  I know 
> how to checkout, diff, commit, etc, but am having trouble figuring out 
> what to put for some of the values.
> 
> Any thoughts?  Am I going about this the wrong way?
> 
> Thanks
> 
-- 
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




reply via email to

[Prev in Thread] Current Thread [Next in Thread]