freeipmi-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-users] request: status info for discrete sensors for monit


From: Albert Chu
Subject: Re: [Freeipmi-users] request: status info for discrete sensors for monitoring purposes
Date: Tue, 22 Jun 2010 10:44:15 -0700

Hi Werner,

Thanks.  You are using a slightly older version of FreeIPMI (I can tell
from the output format), so some of the comments below are related to
newer versions.

On Tue, 2010-06-22 at 04:16 -0700, Werner Fischer wrote:
> Hi Al,
> 
> ipmimonitoring seems to be very useful for my needs. I gave it a try
> with an Intel SR2500 server. I unplugged one power chord from Power
> Supply 1 (PS1) and removed the cover of the cassis:
> 
> ipmimonitoring reports "Critical" in the fourth column, which is great:
>         address@hidden:~$ ipmimonitoring -h 192.168.1.211 -u monitor -p 
> relation -l user | grep "| Critical |"
>         33 | Power Redundancy | Power Unit | Critical | N/A | 'Redundancy 
> Lost' 'Non-redundant:Sufficient Resources from Redundant'
>         36 | Physical Scrty | Physical Security | Critical | N/A | 'General 
> Chassis Intrusion'
>         49 | PS1 Status | Power Supply | Critical | N/A | 'Presence detected' 
> 'Power Supply input lost (AC/DC)'
>         address@hidden:~$
> 
> With ipmitool I got an "ok" for these sensors:
>         address@hidden:~$ ipmitool -I lan -H 192.168.1.211 -U monitor -P 
> relation -L user sdr elist
>         [...]
>         PS1 AC Current   | 78h | ok  | 10.1 | 0.12 Amps
>         PS2 AC Current   | 79h | ok  | 10.2 | 0.93 Amps
>         PS1 +12V Current | 7Ah | ok  | 10.1 | 0 Amps
>         PS2 +12V Current | 7Bh | ok  | 10.2 | 16 Amps
>         PS1 +12V Power   | 7Ch | ok  | 10.1 | 0 Watts
>         PS2 +12V Power   | 7Dh | ok  | 10.2 | 192 Watts
>         P1 Therm Margin  | 99h | ok  |  3.1 | -49 degrees C
>         P2 Therm Margin  | 9Bh | ok  |  3.2 | -54 degrees C
>         P1 Therm Ctrl %  | C0h | ok  |  3.1 | 0 unspecified
>         P2 Therm Ctrl %  | C1h | ok  |  3.2 | 0 unspecified
>         Proc 1 Vccp      | D0h | ok  |  3.1 | 1.23 Volts
>         Proc 2 Vccp      | D1h | ok  |  3.2 | 1.23 Volts
>         Mem Therm Margin | 48h | ns  |  3.2 | No Reading
>         Pwr Unit Stat    | 01h | ok  | 21.1 |
>         Power Redundancy | 02h | ok  | 21.1 | Redundancy Lost, Non-Redundant: 
> Sufficient from Redundant
>         BMC Watchdog     | 03h | ok  |  7.1 |
>         Platform Secu V  | 04h | ok  |  7.1 |
>         Physical Scrty   | 05h | ok  | 23.1 | General Chassis intrusion
>         [...]
> 
> Another test with ipmimonitoring, when PS1 is completely removed:
>         address@hidden:~$ ipmimonitoring -h 192.168.1.211 -u monitor -p 
> relation -l user | grep "| Critical |"
>         32 | Pwr Unit Stat | Power Unit | Nominal | N/A | 'OK'
>         33 | Power Redundancy | Power Unit | Critical | N/A | 'Redundancy 
> Lost' 'Non-redundant:Sufficient Resources from Redundant'
>         [...]
>         49 | PS1 Status | Power Supply | Nominal | N/A | 'OK'
>         50 | PS2 Status | Power Supply | Nominal | N/A | 'Presence detected'
> 
>         (Here ipmimonitoring says 'OK' in the last column, VMware says
>         "Unknown" when a power supply is not installed - see
>         http://*www.*wefi.net/shared/sr2500-example-1.png)

It does depend on how the sensor is implemented.  Here's a layman's idea
of what a power supply sensor can report:

A) sensor reading not available
B) sensor reading available, reports nothing
C) sensor reading available, reports presence detected
D) sensor reading available, reports something wrong (e.g. AC lost)

A, C, & D map to obvious outputs (N/A vs "presence detected" vs "AC
input lost").  B is the one that's hard to deal with.  On some
motherboards, "reports nothing" means the same as "presence
detected" (the sensor reports A, B, or D, but not C).  On some other
motherboards "reports nothing" is the same as "N/A" (the sensor reports
B, C, or D, but not A).  I currently map "reports nothing" to "OK",
which is the same output as many other sensors.

Not knowing much about the sensor software you're using, I would bet
that VMware knows the behavior of their own hardware and has programmed
something unique for it.

> My question: how do you distinguish in ipmimonitoring which of the
> assertion states are ok ("Nominal") and which are not ("Critical")?

You should find a config file /etc/ipmi_monitoring_sensors.conf which
lists the defaults.  You can then tweak as appropriate for your system.

Side note, whenever I release FreeIPMI 0.9.1, the tool ipmimonitoring
will disappear and become a symlink to 'ipmi-sensors
--output-sensor-state' and /etc/ipmi_monitoring_sensors.conf will
become /etc/freeipmi_interpret_sensor.conf.

Al

> Thanks a lot for your great help,
> best regards,
> Werner
> 
> PS: here is the full output of impimonitoring from my first test:
> address@hidden:~$ ipmimonitoring -h 192.168.1.211 -u monitor -p relation -l 
> user
> Record_ID | Sensor Name | Sensor Group | Monitoring Status| Sensor Units | 
> Sensor Reading
> 1 | BB +1.2V Vtt | Voltage | Nominal | V | 1.197000
> 2 | BB +1.5V AUX | Voltage | Nominal | V | 1.466400
> 3 | BB +1.5V | Voltage | Nominal | V | 1.482000
> 4 | BB +1.8V | Voltage | Nominal | V | 1.785000
> 5 | BB +3.3V | Voltage | Nominal | V | 3.354000
> 6 | BB +3.3V STB | Voltage | Nominal | V | 3.354000
> 7 | BB +1.5V ESB | Voltage | Nominal | V | 1.505400
> 8 | BB +5V | Voltage | Nominal | V | 5.070000
> 9 | BB +12V AUX | Voltage | Nominal | V | 11.904000
> 10 | BB +0.9V | Voltage | Nominal | V | 0.897600
> 11 | Serverboard Temp | Temperature | Nominal | C | 29.000000
> 12 | Ctrl Panel Temp | Temperature | Nominal | C | 25.000000
> 13 | Fan 1 | Fan | Nominal | RPM | 5891.000000
> 14 | Fan 2 | Fan | Nominal | RPM | 6278.000000
> 15 | Fan 3 | Fan | Nominal | RPM | 5805.000000
> 16 | Fan 4 | Fan | Nominal | RPM | 6321.000000
> 17 | Fan 5 | Fan | Nominal | RPM | 9052.000000
> 18 | Fan 6 | Fan | Nominal | RPM | 8060.000000
> 19 | PS1 AC Current | Current | Nominal | A | 0.124000
> 20 | PS2 AC Current | Current | Nominal | A | 0.992000
> 21 | PS1 +12V Current | Current | Nominal | A | 0.000000
> 22 | PS2 +12V Current | Current | Nominal | A | 15.000000
> 23 | PS1 +12V Power | N/A | Nominal | W | 0.000000
> 24 | PS2 +12V Power | N/A | Nominal | W | 192.000000
> 25 | P1 Therm Margin | Temperature | Nominal | C | -49.000000
> 26 | P2 Therm Margin | Temperature | Nominal | C | -53.000000
> 27 | P1 Therm Ctrl % | Temperature | Nominal | N/A | 0.000000
> 28 | P2 Therm Ctrl % | Temperature | Nominal | N/A | 0.000000
> 29 | Proc 1 Vccp | Voltage | Nominal | V | 1.227600
> 30 | Proc 2 Vccp | Voltage | Nominal | V | 1.233800
> 32 | Pwr Unit Stat | Power Unit | Nominal | N/A | 'OK'
> 33 | Power Redundancy | Power Unit | Critical | N/A | 'Redundancy Lost' 
> 'Non-redundant:Sufficient Resources from Redundant'
> 34 | BMC Watchdog | Watchdog 2 | Nominal | N/A | 'OK'
> 35 | Platform Secu V | Platform Security Violation Attempt | Nominal | N/A | 
> 'OK'
> 36 | Physical Scrty | Physical Security | Critical | N/A | 'General Chassis 
> Intrusion'
> 37 | FP Interrupt | Critical Interrupt | Nominal | N/A | 'OK'
> 38 | Event Log Disabl | Event Logging Disabled | Nominal | N/A | 'OK'
> 40 | System Event | System Event | Nominal | N/A | 'OK'
> 41 | BB Vbat | Battery | Nominal | N/A | 'OK'
> 42 | Fan 1 Present | Fan | Nominal | N/A | 'Device Inserted/Device Present'
> 43 | Fan 2 Present | Fan | Nominal | N/A | 'Device Inserted/Device Present'
> 44 | Fan 3 Present | Fan | Nominal | N/A | 'Device Inserted/Device Present'
> 45 | Fan 4 Present | Fan | Nominal | N/A | 'Device Inserted/Device Present'
> 46 | Fan 5 Present | Fan | Nominal | N/A | 'Device Inserted/Device Present'
> 47 | Fan 6 Present | Fan | Nominal | N/A | 'Device Inserted/Device Present'
> 48 | Fan Redundancy | Fan | Nominal | N/A | 'Fully Redundant'
> 49 | PS1 Status | Power Supply | Critical | N/A | 'Presence detected' 'Power 
> Supply input lost (AC/DC)'
> 50 | PS2 Status | Power Supply | Nominal | N/A | 'Presence detected'
> 51 | ACPI State | System ACPI Power State | Nominal | N/A | 'S0/G0'
> 52 | Button | Button/Switch | Nominal | N/A | 'OK'
> 56 | Processor 1 Stat | Processor | Nominal | N/A | 'Processor Presence 
> detected'
> 57 | Processor 2 Stat | Processor | Nominal | N/A | 'Processor Presence 
> detected'
> 58 | PCIe Link0 | Critical Interrupt | Nominal | N/A | 'OK'
> 59 | PCIe Link1 | Critical Interrupt | Nominal | N/A | 'OK'
> 60 | PCIe Link2 | Critical Interrupt | Nominal | N/A | 'OK'
> 61 | PCIe Link3 | Critical Interrupt | Nominal | N/A | 'OK'
> 62 | PCIe Link4 | Critical Interrupt | Nominal | N/A | 'OK'
> 63 | PCIe Link5 | Critical Interrupt | Nominal | N/A | 'OK'
> 64 | PCIe Link6 | Critical Interrupt | Nominal | N/A | 'OK'
> 65 | PCIe Link7 | Critical Interrupt | Nominal | N/A | 'OK'
> 66 | PCIe Link8 | Critical Interrupt | Nominal | N/A | 'OK'
> 67 | PCIe Link9 | Critical Interrupt | Nominal | N/A | 'OK'
> 68 | PCIe Link10 | Critical Interrupt | Nominal | N/A | 'OK'
> 69 | PCIe Link11 | Critical Interrupt | Nominal | N/A | 'OK'
> 70 | PCIe Link12 | Critical Interrupt | Nominal | N/A | 'OK'
> 71 | PCIe Link13 | Critical Interrupt | Nominal | N/A | 'OK'
> 76 | CPU Popul Error | Processor | Nominal | N/A | 'OK'
> 77 | DIMM 1A | Slot/Connector | Nominal | N/A | 'Slot/Connector Device 
> installed/attached'
> 79 | DIMM 1B | Slot/Connector | Nominal | N/A | 'Slot/Connector Device 
> installed/attached'
> 81 | DIMM 1C | Slot/Connector | Nominal | N/A | 'Slot/Connector Device 
> installed/attached'
> 83 | DIMM 1D | Slot/Connector | Nominal | N/A | 'Slot/Connector Device 
> installed/attached'
> address@hidden:~$
> 
> 
> On Mon, 2010-06-21 at 09:32 -0700, Al Chu wrote:
> > Hi Werner,
> >
> > > Does anybody know whether one of the other tools like freeipmi or
> > > impiutil has some functionality like this?
> >
> > In FreeIPMI, there is a tool called ipmimonitoring that I believe does
> > what you're asking for (output condensed for readability below).
> >
> > 18 | Fan1            | Nominal  | 14500.00   | RPM   | 'OK'
> > 19 | Fan2            | Nominal  | 14300.00   | RPM   | 'OK'
> > 20 | Fan3/CPU2       | Nominal  | 14300.00   | RPM   | 'OK'
> > 21 | Fan4/CPU1       | Nominal  | 13900.00   | RPM   | 'OK'
> > 22 | Fan5            | Nominal  | 14000.00   | RPM   | 'OK'
> > 23 | Fan6            | Nominal  | 14000.00   | RPM   | 'OK'
> > 24 | Fan7/CPU3       | Critical | 0.00       | RPM   | 'At or Below (<=) 
> > Lower Non-Recoverable Threshold'
> > 25 | Fan8/CPU4       | Critical | 0.00       | RPM   | 'At or Below (<=) 
> > Lower Non-Recoverable Threshold'
> > 26 | Fan9            | Critical | 0.00       | RPM   | 'At or Below (<=) 
> > Lower Non-Recoverable Threshold'
> > 27 | Power Supply 1  | Nominal  | N/A        | N/A   | 'Presence detected'
> > 28 | Power Supply 2  | N/A      | N/A        | N/A   | N/A
> >
> > So for this example, fans with normal RPM are "Nominal", out of range is
> > "Critical", and the power supply that doesn't exist is "N/A".  There is
> > also a "Warning" output when the situation is appropriate.
> >
> > I can speak more of it, but it's probably not best on this mailing.
> > Feel free to ping me on the FreeIPMI mailing list.
> >
> > Al
> >
> > On Mon, 2010-06-21 at 06:08 -0700, Werner Fischer wrote:
> > > Hi ipmitool developers,
> > >
> > > I thought about the problem regarding monitoring discrete IPMI sensors,
> > > that Brian reported back in April:
> > > http://**www.**mail-archive.com/address@hidden/msg01472.html
> > >
> > > I did some in-depth testing and looked how the current VMware ESXi 4.0
> > > reports different states of discrete IPMI sensors.
> > >
> > > I tested two example scenarios with an Intel SR2500 server:
> > >
> > > Test case 1:
> > >   * Power Supply 2 removed
> > >   * Chassis cover removed
> > >   * VMware reports: http://**www.**wefi.net/shared/sr2500-example-1.png
> > >
> > > Test case 2:
> > >   * Power Supply 2 present, but power cable removed
> > >   * Vmware reports: http://**www.**wefi.net/shared/sr2500-example-2.png
> > >
> > > (Below you find some example ipmitool outputs for these two cases).
> > >
> > > The current IPMI specification lists possible sensor-specific-offsets
> > > for each sensor type in table 42-3, Sensor Type Codes.
> > >
> > > To me it seems that VMware uses some mapping, which defines which
> > > offsets (assertions/deassertions) cause a warning or an alarm,
> > > e.g. an offset for the event "General Chassis Intrusion" for a Physical
> > > Security sensor (sensor type code 05h) leads to status "Warning".
> > >
> > > So my request:
> > >       * introduce some new option for ipmitool (something like "ipmitool
> > >         get-server-status") where ipmitool uses such kind of mapping,
> > >         too. We could define which offsets/assertions should cause a
> > >         warning. In this way an end-user would have an easy way to
> > >         quickly find out whether or not everything is ok with his
> > >         hardware...
> > >
> > > Currently using e.g. "ipmitool sdr elist all" returns "ok" for sensor
> > > states like "General Chassis Intrusion" (see below)
> > >
> > > What do you think?
> > > Any other ideas how we could accomplish that?
> > > Does anybody know whether one of the other tools like freeipmi or
> > > impiutil has some functionality like this?
> > >
> > > best regards,
> > > Werner
> > >
> > > PS: Here are the outputs of ipmitool for this:
> > >
> > > Test case 1:
> > >         address@hidden:~$ ipmitool -I lan -H 192.168.1.211 -U monitor -L 
> > > user sdr elist all | grep -i "PS"
> > >         Password:
> > >         PS1 AC Current   | 78h | ok  | 10.1 | 0.93 Amps
> > >         PS2 AC Current   | 79h | ns  | 10.2 | No Reading
> > >         PS1 +12V Current | 7Ah | ok  | 10.1 | 16 Amps
> > >         PS2 +12V Current | 7Bh | ns  | 10.2 | No Reading
> > >         PS1 +12V Power   | 7Ch | ok  | 10.1 | 192 Watts
> > >         PS2 +12V Power   | 7Dh | ns  | 10.2 | No Reading
> > >         PS1 Status       | 70h | ok  | 10.1 | Presence detected
> > >         PS2 Status       | 71h | ok  | 10.2 |
> > >         address@hidden:~$ ipmitool -I lan -H 192.168.1.211 -U monitor -L 
> > > user sdr elist all | grep -i "Physical Scrty"
> > >         Password:
> > >         Physical Scrty   | 05h | ok  | 23.1 | General Chassis intrusion
> > >         address@hidden:~$ ipmitool -I lan -H 192.168.1.211 -U admin raw 
> > > 0x04 0x2d 0x70
> > >         Password:
> > >         Data length = 1
> > >          00 c0 01 00
> > >         address@hidden:~$ ipmitool -I lan -H 192.168.1.211 -U admin raw 
> > > 0x04 0x2d 0x71
> > >         Password:
> > >         Data length = 1
> > >          00 c0 00 00
> > >         address@hidden:~$ ipmitool -I lan -H 192.168.1.211 -U admin -P 
> > > relation sdr get "Physical Scrty"
> > >         Sensor ID              : Physical Scrty (0x5)
> > >          Entity ID             : 23.1 (System Chassis)
> > >          Sensor Type (Discrete): Physical Security
> > >          States Asserted       : Physical Security
> > >                                  [General Chassis intrusion]
> > >          Assertion Events      : Physical Security
> > >                                  [General Chassis intrusion]
> > >          Assertions Enabled    : Physical Security
> > >                                  [General Chassis intrusion]
> > >                                  [System unplugged from LAN]
> > >          Deassertions Enabled  : Physical Security
> > >                                  [General Chassis intrusion]
> > >                                  [System unplugged from LAN]
> > >
> > > Test case 2:
> > >         address@hidden:~$ ipmitool -I lan -H 192.168.1.211 -U monitor -L 
> > > user sdr get "PS2 Status"
> > >         Password:
> > >         Sensor ID              : PS2 Status (0x71)
> > >          Entity ID             : 10.2 (Power Supply)
> > >          Sensor Type (Discrete): Power Supply
> > >          States Asserted       : Power Supply
> > >                                  [Presence detected]
> > >                                  [Power Supply AC lost]
> > >          Assertion Events      : Power Supply
> > >                                  [Presence detected]
> > >                                  [Power Supply AC lost]
> > >          Assertions Enabled    : Power Supply
> > >                                  [Presence detected]
> > >                                  [Failure detected]
> > >                                  [Predictive failure]
> > >                                  [Power Supply AC lost]
> > >                                  [Config Error: Vendor Mismatch]
> > >                                  [Config Error: Revision Mismatch]
> > >                                  [Config Error: Processor Missing]
> > >                                  [Config Error]
> > >          Deassertions Enabled  : Power Supply
> > >                                  [Presence detected]
> > >                                  [Failure detected]
> > >                                  [Predictive failure]
> > >                                  [Power Supply AC lost]
> > >                                  [Config Error: Vendor Mismatch]
> > >                                  [Config Error: Revision Mismatch]
> > >                                  [Config Error: Processor Missing]
> > >                                  [Config Error]
> > >
> > >         address@hidden:~$ ipmitool -I lan -H 192.168.1.211 -U monitor -L 
> > > user sdr elist all | grep -i "PS"
> > >         Password:
> > >         PS1 AC Current   | 78h | ok  | 10.1 | 0.93 Amps
> > >         PS2 AC Current   | 79h | ok  | 10.2 | 0.12 Amps
> > >         PS1 +12V Current | 7Ah | ok  | 10.1 | 16 Amps
> > >         PS2 +12V Current | 7Bh | ok  | 10.2 | 0 Amps
> > >         PS1 +12V Power   | 7Ch | ok  | 10.1 | 192 Watts
> > >         PS2 +12V Power   | 7Dh | ok  | 10.2 | 0 Watts
> > >         PS1 Status       | 70h | ok  | 10.1 | Presence detected
> > >         PS2 Status       | 71h | ok  | 10.2 | Presence detected, Power 
> > > Supply AC lost
> > >         address@hidden:~$ ipmitool -I lan -H 192.168.1.211 -U admin raw 
> > > 0x04 0x2d 0x71
> > >         Password:
> > >         Data length = 1
> > >          00 c0 09 00
> > >         address@hidden:~$
> > >
> > >
> > --
> > Albert Chu
> > address@hidden
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> >
> 
> 
> 
> _______________________________________________
> Freeipmi-users mailing list
> address@hidden
> http://*lists.gnu.org/mailman/listinfo/freeipmi-users
> 
-- 
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




reply via email to

[Prev in Thread] Current Thread [Next in Thread]