freeipmi-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freeipmi-users] Temperature sensors disabled or abnormally cold: Can al


From: Ryan Cox
Subject: [Freeipmi-users] Temperature sensors disabled or abnormally cold: Can alerts be sent?
Date: Thu, 30 Sep 2010 11:46:08 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.12) Gecko/20100915 Lightning/1.0b1 Thunderbird/3.0.8

I'm trying to use pef-config (or any tool that will work) to have a Dell PowerEdge M610 send alerts when a temperature sensor is disabled or has seriously erroneous data, like reporting a processor to be at 5 degrees C. Here is an example:

# ipmitool sdr type Temperature
Temp             | 01h | ns  |  3.1 | Disabled
Temp             | 02h | ok  |  3.2 | 5 degrees C
Ambient Temp     | 08h | ok  |  7.1 | 26 degrees C
IOH ThermTrip    | 35h | ns  |  7.1 | Disabled

From a good server:
# ipmitool sdr type Temperature
Temp             | 01h | ok  |  3.1 | 26 degrees C
Temp             | 02h | ok  |  3.2 | 34 degrees C
Ambient Temp     | 08h | ok  |  7.1 | 24 degrees C
IOH ThermTrip    | 35h | ns  |  7.1 | Disabled



The processor sensors are the ones I care about (3.1 and 3.2). A server was affected by a power event (surge or sag... not sure) and the processor temperature sensors are having issues for some reason. The CPUs are throttled as a result and that is logged via syslog. We can take care of the hardware issues just fine, but I am hoping to have our servers notify us of problems in a way like alerts for an ECC threshold error (entry in event log, snmp trap sent, amber light). I played around with pef-config for a while and can't figure out how to make it alert when a sensor is disabled. I'm also not sure if the alert would only happen on a cold boot, etc, so I'm not sure if maybe I do have it configured correctly but just can't test it. The affected servers are still in use until user jobs on them are finished, so I can't reboot them until that time.

Here's an example of a config I was working with:
Section Event_Filter_9
## Possible values: Manufacturer_Pre_Configured/Software_Configurable/Reserved1/Reserved3 Filter_Type Manufacturer_Pre_Configured
        ## Possible values: Yes/No
        Enable_Filter                                Yes
        ## Possible values: Yes/No
        Event_Filter_Action_Alert                    Yes
        ## Possible values: Yes/No
        Event_Filter_Action_Power_Off                No
        ## Possible values: Yes/No
        Event_Filter_Action_Reset                    No
        ## Possible values: Yes/No
        Event_Filter_Action_Power_Cycle              No
        ## Possible values: Yes/No
        Event_Filter_Action_Oem                      No
        ## Possible values: Yes/No
        Event_Filter_Action_Diagnostic_Interrupt     No
        ## Possible values: Yes/No
        Event_Filter_Action_Group_Control_Operation  No
        ## Give a valid number
        Alert_Policy_Number                          1
        ## Give a valid number
        Group_Control_Selector                       0
## Possible values: Unspecified/Monitor/Information/OK/Non_Critical/Critical/Non_Recoverable
        Event_Severity                               Critical
## Specify a hex Slave Address or Software ID from Event Message or 0xFF to Match Any
        Generator_Id_Byte_1                          0xFF
## Specify a hex Channel Number or LUN to match or 0xFF to Match Any
        Generator_Id_Byte_2                          0xFF
        ## Specify a Sensor Type, For options see the MAN page
        Sensor_Type                                  Temperature
        ## Specify a Sensor Number or 0xFF to Match Any
        Sensor_Number                                0xFF
        ## Specify a Event/Reading Type Number or 0xFF to Match Any
        Event_Trigger                                0xFF
        ## Give a valid number
        Event_Data1_Offset_Mask                      0x204
        ## Give a valid number
        Event_Data1_AND_Mask                         0x00
        ## Give a valid number
        Event_Data1_Compare1                         0xFF
        ## Give a valid number
        Event_Data1_Compare2                         0x00
        ## Give a valid number
        Event_Data2_AND_Mask                         0x00
        ## Give a valid number
        Event_Data2_Compare1                         0xFF
        ## Give a valid number
        Event_Data2_Compare2                         0x00
        ## Give a valid number
        Event_Data3_AND_Mask                         0x00
        ## Give a valid number
        Event_Data3_Compare1                         0xFF
        ## Give a valid number
        Event_Data3_Compare2                         0x00
EndSection


In this attempt, I was trying to have it essentially alert for everything and then narrow it down from there.

A few things I'm unsure about: What is Event_Data1_Offset_Mask and is it set appropriately for what I want to do (I used an existing temperature policy from the blade as a template)? This is my first time messing with pef-config, so I'm a little confused by it to be honest. I know how to checkout, diff, commit, etc, but am having trouble figuring out what to put for some of the values.

Any thoughts?  Am I going about this the wrong way?

Thanks

--
Ryan Cox
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University




reply via email to

[Prev in Thread] Current Thread [Next in Thread]