freeipmi-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-devel] Trouble w/ HP ProLiant and FreeIPMI (ipmi-s


From: Al Chu
Subject: Re: [Freeipmi-devel] Trouble w/ HP ProLiant and FreeIPMI (ipmi-sensors)
Date: Wed, 10 Oct 2007 09:26:01 -0700

Hey Gregor,

There is a sublety here that I added extra documentation for in the
FreeIPMI 0.5.0 manpage (I didn't backport to 0.4.X b/c didn't think it
was that important, but maybe I should have).  The ipmi-sensors numbers
listed on the left are "record ids", not sensor numbers.  If you use the
verbose options on ipmi-sensors (-v or -vv), you can find the sensor
numbers.  As an example on my system:

Record ID: 22
Sensor Name: Fan5
Group Name: Fan
Sensor Number: 18
Event/Reading Type Code: 1h

you can see the sensor number and record id don't match up.  

I'm not 100% why record ids were chosen for input/output over sensor
numbers in ipmi-sensors (the tool was originally created by others), but
if I had to guess for some reasons why:

- some sensors don't have sensor numbers.  I notice multiple sensors w/
sensor number 0x00 in the ipmitool output below.  I would guess those
sensors don't have a number so they just output 0x00.

- record ids increase in value, while sensor numbers need not, so
outputting record ids looks nicer, maybe? The output order in ipmitool
also seems to be record id based, but they just output the sensor number
instead of the record id.

As an FYI if you were wondering why sensors seem to be missing from
ipmi-sensors, our default output does not output every sensor by
default.  Some are only retrievable via the verbose options.

Hope that helps clarify things.

Al

On Wed, 2007-10-10 at 11:06 +0200, Gregor Dschung wrote:
> Hey Al,
> 
> mmmh.... now, I'm really confused. I thought, the sensor-id has to be 8
> bit long?
> 
> Also I'm confused about the different sensor-ids I'm getting with
> ipmi-sensors (0.4.6.beta2) and `ipmitool sdr elist` (1.8.6). Sure,
> ipmitool is giving me the sensor id as Hex and ipmi-sensors as a decimal
> number... but the converted value should be the same?
> I would like to set up a PEF-Table, but for that, I'll need the right
> sensor-ids :-/
> 
> Example 1:
> 
> p300slg01:/usr/local/src # ipmitool -H gtseval-ipmi -U ADMIN -a sdr
> elist all
> Password:
> Hewlett-Packard  | 00h | ok  |  0.0 | Dynamic MC @ 20h
> ACPI State       | 20h | ok  |  0.0 | S0/G0: working
> System Reset     | 21h | ok  |  0.0 |
> POST Error       | 01h | ns  |  0.0 | Disabled
> Memory ECC       | 02h | ns  |  0.0 | Disabled
> PCI Error        | 03h | ns  |  0.0 | Disabled
> Fan Error        | 04h | ns  |  0.0 | Disabled
> Watchdog         | FEh | ns  |  0.0 | Disabled
> CPU Fan 1        | 31h | ok  |  0.0 | 9592.33 RPM
> CPU Fan 2        | 32h | ok  |  0.0 | 10426.44 RPM
> CPU Fan 3        | 33h | ok  |  0.0 | 9992.01 RPM
> CPU Fan 4        | 34h | ok  |  0.0 | 10900.37 RPM
> CPU Fan 5        | 35h | ok  |  0.0 | 9592.33 RPM
> CPU Fan 6        | 3Ch | ok  |  0.0 | 10900.37 RPM
> CPU Fan 7        | 3Dh | ok  |  0.0 | 9992.01 RPM
> CPU Fan 8        | 3Eh | ok  |  0.0 | 10426.44 RPM
> CPU Fan 9        | 3Fh | ok  |  0.0 | 9592.33 RPM
> CPU Fan 10       | 40h | ok  |  0.0 | 10426.44 RPM
> System Fan 1     | 41h | ok  |  0.0 | 9992.01 RPM
> System Fan 2     | 42h | ok  |  0.0 | 10900.37 RPM
> CPU0 Vcore       | 3Ah | ok  |  3.0 | 1.10 Volts
> CPU1 Vcore       | 3Bh | ns  |  3.1 | No Reading
> Standby 5V       | 37h | ok  |  0.0 | 4.97 Volts
> System 5V        | 36h | ok  |  0.0 | 4.85 Volts
> System 3.3V      | 38h | ok  |  0.0 | 3.23 Volts
> 3V CMOS Sense    | 39h | ok  |  0.0 | 3.03 Volts
> CPU0 Therm Diode | 43h | ns  |  3.0 | Disabled
> CPU1 Therm Diode | 44h | ns  |  3.1 | Disabled
> CPU0 ThermDiode2 | 52h | ns  |  3.0 | Disabled
> CPU1 ThermDiode2 | 53h | ns  |  3.1 | Disabled
> AMB Temp         | 48h | ok  |  0.0 | 29 degrees C
> MultiBit ECC ER  | 4Ah | ok  |  0.0 | State Deasserted
> VDD Power Fail   | 4Ch | ok  |  0.0 | State Deasserted
> Reset            | 4Dh | ok  |  0.0 | State Deasserted
> Identify         | 4Eh | ok  |  0.0 | State Deasserted
> NMI              | 50h | ok  |  0.0 | State Deasserted
> CPU0 Therm-Trip  | 55h | ok  |  3.0 | State Deasserted
> CPU1 Therm-Trip  | 56h | ns  |  3.1 | No Reading
> CPU0 IERR        | 57h | ok  |  3.0 | State Deasserted
> CPU1 IERR        | 58h | ns  |  3.1 | No Reading
> CPU0 Prochot     | 59h | ok  |  3.0 | Limit Not Exceeded
> CPU1 Prochot     | 5Ah | ns  |  3.1 | No Reading
> CPU0 SocketOcc   | 5Bh | ok  |  3.0 | Device Present
> CPU1 SocketOcc   | 5Ch | ok  |  3.1 | Device Absent
> CPU0 Dmn 0 Temp  | 86h | ok  |  3.0 | 45 degrees C
> CPU1 Dmn 0 Temp  | 89h | ns  |  3.1 | No Reading
> CPU0 Dmn 1 Temp  | 8Ch | ok  |  3.0 | 45 degrees C
> CPU1 Dmn 1 Temp  | 8Fh | ns  |  3.1 | No Reading
> FRU0             | 00h | ns  |  0.0 | Logical FRU @00h
> ----------
> p300slg01:/usr/local/src # ipmi-sensors -h gtseval-ipmi -u ADMIN -P
> Password:
> 64: ACPI State (ACPI Power State): [S0/G0 "working"]
> 112: System Reset (Module/Board): [OK]
> 160: POST Error (System Firmware): [Unknown]
> 208: Memory ECC (Memory): [Unknown]
> 256: PCI Error (Critical Interrupt): [Unknown]
> 304: Fan Error (Cooling Device): [Unknown]
> 352: Watchdog (Watchdog 2): [Unknown]
> 400: CPU Fan 1 (Fan): 9992.01 RPM (NA/3475.48): [OK]
> 464: CPU Fan 2 (Fan): 10426.44 RPM (NA/3475.48): [OK]
> 528: CPU Fan 3 (Fan): 9992.01 RPM (NA/3475.48): [OK]
> 592: CPU Fan 4 (Fan): 10900.37 RPM (NA/3475.48): [OK]
> 656: CPU Fan 5 (Fan): 9592.33 RPM (NA/3475.48): [OK]
> 720: CPU Fan 6 (Fan): 10900.37 RPM (NA/3475.48): [OK]
> 784: CPU Fan 7 (Fan): 10426.44 RPM (NA/3475.48): [OK]
> 848: CPU Fan 8 (Fan): 10426.44 RPM (NA/3475.48): [OK]
> 912: CPU Fan 9 (Fan): 9992.01 RPM (NA/3475.48): [OK]
> 976: CPU Fan 10 (Fan): 10426.44 RPM (NA/3475.48): [OK]
> 1040: System Fan 1 (Fan): 9992.01 RPM (NA/3475.48): [OK]
> 1104: System Fan 2 (Fan): 10900.37 RPM (NA/3475.48): [OK]
> 1168: CPU0 Vcore (Voltage): 1.10 V (0.40/1.70): [OK]
> 1232: CPU1 Vcore (Voltage): 0.80 V (0.40/1.70): [OK]
> 1296: Standby 5V (Voltage): 4.97 V (4.26/5.79): [OK]
> 1360: System 5V (Voltage): 4.85 V (4.26/5.79): [OK]
> 1424: System 3.3V (Voltage): 3.23 V (2.82/3.85): [OK]
> 1488: 3V CMOS Sense (Voltage): 3.03 V (2.62/NA): [OK]
> 1680: CPU0 Therm Diode (Temperature): 42.00 C (10.00/80.00): [OK]
> 1744: CPU1 Therm Diode (Temperature): 42.00 C (10.00/80.00): [OK]
> 1808: CPU0 ThermDiode2 (Temperature): 42.00 C (10.00/80.00): [OK]
> 1872: CPU1 ThermDiode2 (Temperature): 42.00 C (10.00/80.00): [OK]
> 1936: AMB Temp (Temperature): 29.00 C (10.00/50.00): [OK]
> 2064: MultiBit ECC ER (Module/Board): [State Deasserted]
> 2112: VDD Power Fail (Power Supply): [State Deasserted]
> 2160: Reset (Button): [State Deasserted]
> 2208: Identify (Button): [State Deasserted]
> 2304: NMI (Button): [State Deasserted]
> 2352: CPU0 Therm-Trip (Processor): [State Deasserted]
> 2400: CPU1 Therm-Trip (Processor): [State Deasserted]
> 2448: CPU0 IERR (Processor): [State Deasserted]
> 2496: CPU1 IERR (Processor): [State Deasserted]
> 2544: CPU0 Prochot (Temperature): [Limit Not Exceeded]
> 2592: CPU1 Prochot (Temperature): [Limit Not Exceeded]
> 2640: CPU0 SocketOcc (Processor): [Device Inserted/Device Present]
> 2688: CPU1 SocketOcc (Processor): [Device Removed/Device Absent]
> 2736: CPU0 Dmn 0 Temp (Temperature): 45.00 C (NA/85.00): [OK]
> 2864: CPU1 Dmn 0 Temp (Temperature): 45.00 C (NA/85.00): [OK]
> 3248: CPU0 Dmn 1 Temp (Temperature): 45.00 C (NA/85.00): [OK]
> 3440: CPU1 Dmn 1 Temp (Temperature): 45.00 C (NA/85.00): [OK]
>
> Example 2:
> p300slg01:/usr/local/src # ipmitool -H gts00-ipmi -U ADMIN -a sdr elist all
> Password:
> pef              | FDh | ns  | 46.1 | Event-Only
> watchdog         | FEh | ns  | 46.1 | Event-Only
> KIM BMC          | 00h | ok  |  0.0 | Dynamic MC @ 20h
> PLTFRM SECURITY  | FCh | ns  |  0.0 | Event-Only
> CPU Temp 1       | 00h | ok  |  3.0 | 22 degrees C
> CPU Temp 2       | 01h | ok  |  3.0 | 21 degrees C
> CPU Temp 3       | 02h | ns  |  3.1 | No Reading
> CPU Temp 4       | 03h | ns  |  3.1 | No Reading
> Sys Temp         | 04h | ok  |  7.0 | 36 degrees C
> CPU1 Vcore       | 05h | ok  |  3.0 | 1.19 Volts
> CPU2 Vcore       | 06h | ok  |  3.1 | 1.21 Volts
> 3.3V             | 07h | ok  |  7.0 | 3.34 Volts
> 5V               | 08h | ok  |  7.0 | 4.99 Volts
> 12V              | 09h | ok  |  7.0 | 11.52 Volts
> -12V             | 0Ah | ok  |  7.0 | -12.30 Volts
> 1.5V             | 0Bh | ok  |  7.0 | 1.47 Volts
> 5VSB             | 0Ch | ok  |  7.0 | 4.92 Volts
> VBAT             | 0Dh | ok  |  7.0 | 3.31 Volts
> Fan1             | 0Eh | ok  |  7.0 | 4400 RPM
> Fan2             | 0Fh | lnr |  7.0 | 0 RPM
> Fan3             | 10h | ok  |  7.0 | 4400 RPM
> Fan4             | 11h | lnr |  7.0 | 0 RPM
> Fan5             | 12h | lnr |  7.0 | 0 RPM
> Fan6             | 13h | lnr |  7.0 | 0 RPM
> Fan7/CPU1        | 14h | lnr |  3.0 | 0 RPM
> Fan8/CPU2        | 15h | lnr |  3.0 | 0 RPM
> Intrusion        | 44h | lnc | 23.1 | 0 unspecified
> Power Supply     | 16h | ok  | 10.0 | 0 unspecified
> CPU0 Internal E  | 17h | ok  |  3.0 | 0 unspecified
> CPU1 Internal E  | 18h | ok  |  3.1 | 0 unspecified
> CPU Overheat     | 19h | ok  |  3.0 | 0 unspecified
> Thermal Trip0    | 1Ah | ok  |  3.0 | 0 unspecified
> Thermal Trip1    | 1Bh | ok  |  3.1 | 0 unspecified
> BIOS             | 00h | ok  |  0.0 |
> --------
> p300slg01:/usr/local/src # ipmi-sensors -h gts00-ipmi -u ADMIN -P
> Password:
> 4: CPU Temp 1 (Temperature): 22.00 C (NA/78.00): [OK]
> 5: CPU Temp 2 (Temperature): 21.00 C (NA/78.00): [OK]
> 6: CPU Temp 3 (Temperature): 0.00 C (NA/78.00): [OK]
> 7: CPU Temp 4 (Temperature): 0.00 C (NA/78.00): [OK]
> 8: Sys Temp (Temperature): 36.00 C (NA/78.00): [OK]
> 9: CPU1 Vcore (Voltage): 1.20 V (1.06/1.63): [OK]
> 10: CPU2 Vcore (Voltage): 1.21 V (1.06/1.63): [OK]
> 11: 3.3V (Voltage): 3.34 V (2.93/3.66): [OK]
> 12: 5V (Voltage): 4.99 V (4.44/5.54): [OK]
> 13: 12V (Voltage): 11.52 V (10.56/13.44): [OK]
> 14: -12V (Voltage): -12.30 V (-10.59/-13.40): [OK]
> 15: 1.5V (Voltage): 1.47 V (1.31/1.68): [OK]
> 16: 5VSB (Voltage): 4.92 V (4.44/5.54): [OK]
> 17: VBAT (Voltage): 3.31 V (2.93/3.66): [OK]
> 18: Fan1 (Fan): 4400.00 RPM (300.00/NA): [OK]
> 19: Fan2 (Fan): 0.00 RPM (300.00/NA): [At or Below (<=) Lower
> Non-Recoverable Threshold]
> 20: Fan3 (Fan): 4300.00 RPM (300.00/NA): [OK]
> 21: Fan4 (Fan): 0.00 RPM (300.00/NA): [At or Below (<=) Lower
> Non-Recoverable Threshold]
> 22: Fan5 (Fan): 0.00 RPM (300.00/NA): [At or Below (<=) Lower
> Non-Recoverable Threshold]
> 23: Fan6 (Fan): 0.00 RPM (300.00/NA): [At or Below (<=) Lower
> Non-Recoverable Threshold]
> 24: Fan7/CPU1 (Fan): 0.00 RPM (300.00/NA): [At or Below (<=) Lower
> Non-Recoverable Threshold]
> 25: Fan8/CPU2 (Fan): 0.00 RPM (300.00/NA): [At or Below (<=) Lower
> Non-Recoverable Threshold]
> 26: Intrusion (Platform Chassis Intrusion): [General Chassis Intrusion]
> 27: Power Supply (Power Supply): [OK]
> 28: CPU0 Internal E (Module/Board): [OK]
> 29: CPU1 Internal E (Module/Board): [OK]
> 30: CPU Overheat (Module/Board): [OK]
> 31: Thermal Trip0 (Module/Board): [OK]
> 32: Thermal Trip1 (Module/Board): [OK]
> 33: BIOS (System Firmware): [Unknown]
> 
> 
> I hope, I only forget something and that's not a new bug.
> 
> Regards,
> Gregor
> 
> 
> Gregor Dschung wrote:
> > Hey Al,
> >
> > whoa!!!
> >
> > THAT is OpenSource :). We've mailed perhaps for a week (I guess it would
> > have taken only about three days, if we had worked both in the same
> > timezone ;) ). And now, the issue seams to be solved:
> > -----------
> > p300slg01:/usr/local/src # ipmi-sensors -h gtseval-ipmi -u admin -P
> > Password:
> > 64: ACPI State (ACPI Power State): [S0/G0 "working"]
> > 112: System Reset (Module/Board): [OK]
> > 160: POST Error (System Firmware): [Unknown]
> > 208: Memory ECC (Memory): [Unknown]
> > 256: PCI Error (Critical Interrupt): [Unknown]
> > 304: Fan Error (Cooling Device): [Unknown]
> > 352: Watchdog (Watchdog 2): [Unknown]
> > 400: CPU Fan 1 (Fan): 9992.01 RPM (NA/3475.48): [OK]
> > 464: CPU Fan 2 (Fan): 10426.44 RPM (NA/3475.48): [OK]
> > 528: CPU Fan 3 (Fan): 9992.01 RPM (NA/3475.48): [OK]
> > 592: CPU Fan 4 (Fan): 10426.44 RPM (NA/3475.48): [OK]
> > 656: CPU Fan 5 (Fan): 9592.33 RPM (NA/3475.48): [OK]
> > 720: CPU Fan 6 (Fan): 10900.37 RPM (NA/3475.48): [OK]
> > 784: CPU Fan 7 (Fan): 9992.01 RPM (NA/3475.48): [OK]
> > 848: CPU Fan 8 (Fan): 10900.37 RPM (NA/3475.48): [OK]
> > 912: CPU Fan 9 (Fan): 9992.01 RPM (NA/3475.48): [OK]
> > 976: CPU Fan 10 (Fan): 10426.44 RPM (NA/3475.48): [OK]
> > 1040: System Fan 1 (Fan): 9592.33 RPM (NA/3475.48): [OK]
> > 1104: System Fan 2 (Fan): 10900.37 RPM (NA/3475.48): [OK]
> > 1168: CPU0 Vcore (Voltage): 1.11 V (0.40/1.70): [OK]
> > 1232: CPU1 Vcore (Voltage): 0.80 V (0.40/1.70): [OK]
> > 1296: Standby 5V (Voltage): 4.97 V (4.26/5.79): [OK]
> > 1360: System 5V (Voltage): 4.85 V (4.26/5.79): [OK]
> > 1424: System 3.3V (Voltage): 3.23 V (2.82/3.85): [OK]
> > 1488: 3V CMOS Sense (Voltage): 3.03 V (2.62/NA): [OK]
> > 1680: CPU0 Therm Diode (Temperature): 42.00 C (10.00/80.00): [OK]
> > 1744: CPU1 Therm Diode (Temperature): 42.00 C (10.00/80.00): [OK]
> > 1808: CPU0 ThermDiode2 (Temperature): 42.00 C (10.00/80.00): [OK]
> > 1872: CPU1 ThermDiode2 (Temperature): 42.00 C (10.00/80.00): [OK]
> > 1936: AMB Temp (Temperature): 29.00 C (10.00/50.00): [OK]
> > 2064: MultiBit ECC ER (Module/Board): [State Deasserted]
> > 2112: VDD Power Fail (Power Supply): [State Deasserted]
> > 2160: Reset (Button): [State Deasserted]
> > 2208: Identify (Button): [State Deasserted]
> > 2304: NMI (Button): [State Deasserted]
> > 2352: CPU0 Therm-Trip (Processor): [State Deasserted]
> > 2400: CPU1 Therm-Trip (Processor): [State Deasserted]
> > 2448: CPU0 IERR (Processor): [State Deasserted]
> > 2496: CPU1 IERR (Processor): [State Deasserted]
> > 2544: CPU0 Prochot (Temperature): [Limit Not Exceeded]
> > 2592: CPU1 Prochot (Temperature): [Limit Not Exceeded]
> > 2640: CPU0 SocketOcc (Processor): [Device Inserted/Device Present]
> > 2688: CPU1 SocketOcc (Processor): [Device Removed/Device Absent]
> > 2736: CPU0 Dmn 0 Temp (Temperature): 45.00 C (NA/85.00): [OK]
> > 2864: CPU1 Dmn 0 Temp (Temperature): 45.00 C (NA/85.00): [OK]
> > 3248: CPU0 Dmn 1 Temp (Temperature): 45.00 C (NA/85.00): [OK]
> > 3440: CPU1 Dmn 1 Temp (Temperature): 45.00 C (NA/85.00): [OK]
> > -------------
> >
> > Thanks a lot for your help.
> >
> > Regards,
> > Gregor
> >
> >
> > Albert Chu wrote:
> >> Hey Gregor,
> >>
> >> Doh!  I forgot a patch.  Here's the next likely FreeIPMI 0.4.6 release :-)
> >>
> >> PLMK if it works.
> >>
> >> Thanks,
> >> Al
> >>
> >>> Hey Gregor,
> >>>
> >>> Attached are two tar.gz files.  One is a likely candiate for the
> >>> FreeIPMI 0.4.6 release and another test tar.gz for debug info if
> >>> something new goes wrong :-)
> >>>
> >>> PLMK how it works out.  Thanks for all the debug help.
> >>>
> >>> Al
> >>>
> >>> On Tue, 2007-10-09 at 17:25 +0200, Gregor Dschung wrote:
> >>>> Hey Al,
> >>>>
> >>>> here is the sdr-cache. 'sdr-cache-p300slg01.10.136.17.128' is the file
> >>>> for gtseval-ipmi, 'sdr-cache-p300slg01.10.136.17.170' is an other cache
> >>>> file from a call of ipmi-sensors which works fine.
> >>>>
> >>>> I'm using FreeIPMI on a system with SUSE 10.1.
> >>>> ---------
> >>>> p300slg01:/usr/local/src # uname -a
> >>>> Linux p300slg01 2.6.16.27-0.9-smp #1 SMP Tue Feb 13 09:35:18 UTC 2007
> >>>> i686 i686 i386 GNU/Linux
> >>>> ---------
> >>>>
> >>>> In your test4-code, I had to change the following lines to compile w/o
> >>>> errors:
> >>>> common/src/pstdout.c
> >>>> -243: fprintf(stderr, "Default stack size = %li bytes \n", mystacksize);
> >>>> +243: fprintf(stderr, "Default stack size = %li bytes \n",
> >>>> (long)mystacksize);
> >>>> +501: va_list vacpy;
> >>>>
> >>>> ---------
> >>>>
> >>>> I've tested FreeIPMI locally again. I was wrong, it crashes, too. I
> >>>> guess, I was confused with IPMItool, which runs fine locally but gives
> >>>> warnings over the network. Don't know whether it helps you:
> >>>> Locally:
> >>>> address@hidden:~/ipmi/usr/bin> ./ipmitool -I open sensor
> >>>> ACPI State       | 0x1        | discrete   | 0x0180| na        |
> >>>> na        | na        | na        | na        | na
> >>>> System Reset     | 0x0        | discrete   | 0x0080| na        |
> >>>> na        | na        | na        | na        | na
> >>>> POST Error       | na         | discrete   | na    | na        |
> >>>> na        | na        | na        | na        | na
> >>>> Memory ECC       | na         | discrete   | na    | na        |
> >>>> na        | na        | na        | na        | na
> >>>> PCI Error        | na         | discrete   | na    | na        |
> >>>> na        | na        | na        | na        | na
> >>>> Fan Error        | na         | discrete   | na    | na        |
> >>>> na        | na        | na        | na        | na
> >>>> Watchdog         | na         | discrete   | na    | na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU Fan 1        | 9992.006   | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> CPU Fan 2        | 10426.441  | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> CPU Fan 3        | 9992.006   | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> CPU Fan 4        | 10426.441  | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> CPU Fan 5        | 9223.391   | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> CPU Fan 6        | 10900.371  | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> CPU Fan 7        | 9992.006   | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> CPU Fan 8        | 10900.371  | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> CPU Fan 9        | 9992.006   | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> CPU Fan 10       | 10426.441  | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> System Fan 1     | 9992.006   | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> System Fan 2     | 10900.371  | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> CPU0 Vcore       | 1.107      | Volts      | ok    | na        |
> >>>> 0.402     | 0.500     | 1.597     | 1.695     | na
> >>>> CPU1 Vcore       | na         | Volts      | na    | na        |
> >>>> 0.402     | 0.500     | 1.597     | 1.695     | na
> >>>> Standby 5V       | 4.969      | Volts      | ok    | na        |
> >>>> 4.263     | 4.528     | 5.527     | 5.792     | na
> >>>> System 5V        | 4.851      | Volts      | ok    | na        |
> >>>> 4.263     | 4.528     | 5.527     | 5.792     | na
> >>>> System 3.3V      | 3.234      | Volts      | ok    | na        |
> >>>> 2.822     | 2.999     | 3.675     | 3.851     | na
> >>>> 3V CMOS Sense    | 3.028      | Volts      | ok    | na        |
> >>>> 2.617     | 2.781     | na        | na        | na
> >>>> CPU0 Therm Diode | na         | degrees C  | na    | na        |
> >>>> 10.000    | na        | 68.000    | 80.000    | 95.000
> >>>> CPU1 Therm Diode | na         | degrees C  | na    | na        |
> >>>> 10.000    | na        | 68.000    | 80.000    | 95.000
> >>>> CPU0 ThermDiode2 | na         | degrees C  | na    | na        |
> >>>> 10.000    | na        | 68.000    | 80.000    | 95.000
> >>>> CPU1 ThermDiode2 | na         | degrees C  | na    | na        |
> >>>> 10.000    | na        | 68.000    | 80.000    | 95.000
> >>>> AMB Temp         | 29.000     | degrees C  | ok    | na        |
> >>>> 10.000    | na        | 30.000    | 45.000    | na
> >>>> MultiBit ECC ER  | 0x0        | discrete   | 0x0180| na        |
> >>>> na        | na        | na        | na        | na
> >>>> VDD Power Fail   | 0x0        | discrete   | 0x0180| na        |
> >>>> na        | na        | na        | na        | na
> >>>> Reset            | 0x0        | discrete   | 0x0180| na        |
> >>>> na        | na        | na        | na        | na
> >>>> Identify         | 0x0        | discrete   | 0x0180| na        |
> >>>> na        | na        | na        | na        | na
> >>>> NMI              | 0x0        | discrete   | 0x0180| na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU0 Therm-Trip  | 0x0        | discrete   | 0x0180| na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU1 Therm-Trip  | na         | discrete   | na    | na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU0 IERR        | 0x0        | discrete   | 0x0180| na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU1 IERR        | na         | discrete   | na    | na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU0 Prochot     | 0x0        | discrete   | 0x0180| na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU1 Prochot     | na         | discrete   | na    | na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU0 SocketOcc   | 0x1        | discrete   | 0x0280| na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU1 SocketOcc   | 0x0        | discrete   | 0x0180| na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU0 Dmn 0 Temp  | 45.000     | degrees C  | ok    | na        |
> >>>> na        | na        | na        | 85.000    | 95.000
> >>>> CPU1 Dmn 0 Temp  | na         | degrees C  | na    | na        |
> >>>> na        | na        | na        | 85.000    | 95.000
> >>>> CPU0 Dmn 1 Temp  | 46.000     | degrees C  | ok    | na        |
> >>>> na        | na        | na        | 85.000    | 95.000
> >>>> CPU1 Dmn 1 Temp  | na         | degrees C  | na    | na        |
> >>>> na        | na        | na        | 85.000    | 95.000
> >>>>
> >>>> Over a RCMP+-Session:
> >>>> [...]
> >>>> System Reset     | 0x0        | discrete   | 0x0080| na        |
> >>>> na        | na        | na        | na        | na
> >>>> Error reading sensor POST Error (#01)
> >>>> Error reading sensor Memory ECC (#02)
> >>>> Error reading sensor PCI Error (#03)
> >>>> Error reading sensor Fan Error (#04)
> >>>> Watchdog         | na         | discrete   | na    | na        |
> >>>> na        | na        | na        | na        | na
> >>>> CPU Fan 1        | 9992.006   | RPM        | ok    | na        |
> >>>> na        | na        | 3996.803  | 3475.480  | na
> >>>> [...]
> >>>>
> >>>> The missing lines are equal.
> >>>> -----------
> >>>>
> >>>> I've called ipmi-sensors from an x86_64 to reach gtseval-ipmi, too. And
> >>>> it crashes with the same error (second attachment).
> >>>>
> >>>> So... Enough debugging for today.
> >>>>
> >>>> Have a nice day,
> >>>> Gregor
> >>>>
> >>>> Al Chu wrote:
> >>>>> Hey Gregor,
> >>>>>
> >>>>> Although it's unlikely your problem, I saw one other potential issue.
> >>>>> So I added a fix in this slightly newer tar.gz.
> >>>>>
> >>>>> Thanks,
> >>>>> Al
> >>>>>
> >>>>> On Mon, 2007-10-08 at 11:51 -0700, Al Chu wrote:
> >>>>>> Hey Gregor,
> >>>>>>
> >>>>>> Here's another tar.gz.  Could you run ./configure with --enable-debug
> >>>>>> and run with --debug again?  The gdb output confirms the line I
> >>>> believed
> >>>>>> was causing the problem, but I still can't quite figure out how the
> >>>>>> corruption is happening.  So I put in a lot more printfs.
> >>>>>>
> >>>>>> I do have atleast two other suspicions, that depend on your system.
> >>>> So
> >>>>>> do you think you could also send me the SDR from
> >>>> ~/.freeipmi/sdr-cache/
> >>>>>> for me to analyze and also could you tell me what linux you are
> >>>> running
> >>>>>> on the i386 box?  I'm wondering if you have some older distribution
> >>>> (b/c
> >>>>>> its i386) and it has slightly different threads behavior that I'm not
> >>>>>> handling properly.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Al
> >>>>>>
> >>>>>>
> >>>>>> On Sun, 2007-10-07 at 12:12 +0200, Gregor Dschung wrote:
> >>>>>>> Hi Al,
> >>>>>>>
> >>>>>>> I attach again the output of the call with --debug and the
> >>>> backtrace. It
> >>>>>>> was the first time that I used gdb, so I hope I understood the
> >>>> tutorials
> >>>>>>> :)
> >>>>>>>
> >>>>>>> At the moment I'm not able to run ipmi-sensors locally, because I'm
> >>>> not
> >>>>>>> root on "gtseval" (the host of gtseval-ipmi) and I've to wait until
> >>>> I get
> >>>>>>> rw-rights for /dev/ipmi0 again. And we have week-end ;)
> >>>>>>>
> >>>>>>> You are right, I'm running the IPMItool and FreeIPMI on an i386. On
> >>>>>>> gtseval is a 64bit-System, so perhaps this is the reason for not
> >>>> crashing
> >>>>>>> locally.
> >>>>>>>
> >>>>>>> Have a nice Sunday,
> >>>>>>> Gregor
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hey Gregor,
> >>>>>>>>
> >>>>>>>> Can't see anything suspicuous in the code.  Here's another tar.gz
> >>>> that I
> >>>>>>>> added a whole bunch of extra printfs to try and give me more
> >>>> information,
> >>>>>>>> could you run again (./configure --enable-debug and run
> >>>> ipmi-sensors with
> >>>>>>>> --debug again).  Also, you mentioned that ipmi-sensors completes
> >>>> locally
> >>>>>>>> without issue.  Are the number of sensor listed below (ending w/
> >>>> CPU1 Dmn
> >>>>>>>> 1 Temp) the same as the number of sensors listed when you run
> >>>> locally?
> >>>>>>>> Also, is a core dump being output by this crash?  Could you run gdb
> >>>>>>>> against the core and get a backtrace?  That'd be a lot of help too.
> >>>>>>>>
> >>>>>>>> Thanks for helping me look into this,
> >>>>>>>>
> >>>>>>>> Al
> >>>>>>>>
> >>>>>>>>> Hi Al,
> >>>>>>>>>
> >>>>>>>>> thanks for your fast answer.
> >>>>>>>>>
> >>>>>>>>> I've tested your test-version and it seems to be on the correct
> >>>> way. It
> >>>>>>>>> still crashes, but now I get sensor-data :) :
> >>>>>>>>>
> >>>>>>>>> [...]
> >>>>>>>>>
> >>>>>>>> --
> >>>>>>>> Albert Chu
> >>>>>>>> address@hidden
> >>>>>>>> 925-422-5311
> >>>>>>>> Computer Scientist
> >>>>>>>> High Performance Systems Division
> >>>>>>>> Lawrence Livermore National Laboratory
> >>>>>>>>
> >>> --
> >>> Albert Chu
> >>> address@hidden
> >>> 925-422-5311
> >>> Computer Scientist
> >>> High Performance Systems Division
> >>> Lawrence Livermore National Laboratory
> >>>
> >
> 
> 
-- 
Albert Chu
address@hidden
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




reply via email to

[Prev in Thread] Current Thread [Next in Thread]