freeipmi-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-devel] clean exit of external process when conmand shuts d


From: Brian Lambert
Subject: Re: [Freeipmi-devel] clean exit of external process when conmand shuts down
Date: Fri, 13 Jan 2012 23:31:32 -0500 (EST)


Al,

Debug output from a clean disconnect is on it's way to you in another email.

I'm seeing the SOL problem on both iDRAC5 and iDRAC6-based Dell BMCs. Specifically, I'm working with Dell M605 blades (iDRAC5), M610 and M915 blades (iDRAC6), running fairly recent, if not the absolute latest, firmware.

The Dell PowerEdge servers can have these DRACs in them to. We have an R805 with a DRAC5 in it, but I haven't tested the SOL capability on it.

thanks,
Brian


On Fri, 13 Jan 2012, Albert Chu wrote:

Hey Brian,

Doh!  We may have hit our limit on working around this issue.  From your
deactivate debug dump you sent me, the code is now properly noticing
that the SOL session is activated.  That's good.  Then libipmiconsole
attempts to deactivate the payload as expected.  Unfortunately we get:

n003-bmc: =====================================================
n003-bmc: IPMI 2.0 Deactivate Payload Response
n003-bmc: =====================================================
<snip>
n003-bmc: IPMI Command Data:
n003-bmc: ------------------
n003-bmc: [              49h] = cmd[ 8b]
n003-bmc: [              80h] = comp_code[ 8b]
<snip>

80h as the completion code means "payload already deactivated".  So
presumably the BMC doesn't recognize SOL is already activated and thus
will not deactivate it.

When you try to connect again, the code notices SOL is activated, tries
to deactivate, gets an error message that it is deactivated already,
tries to activate again, and goes in a loop until libipmiconsole gives
up.

As a random side test, can you send me the --debug output when you
connect w/ SOL and disconnect (i.e. &. in ipmiconsole) cleanly?  I'm
wondering if maybe Dell has some stuff backwards and there will be a
different way to work around this.  But we may be running out of
workaround options.

Oh, and what motherboard/blade are you running against?  It'd be good to
get this documented.

Thanks,

Al

P.S. BTW, you did uncover a completely unrelated bug, where a
--deactivate does not close cleanly.  So I got that fixed too.  You
wouldn't have noticed it in ipmiconsole b/c it would have exited as
expected.

P.S.S.  If the "BMC Implementation" error code was confusing, it ends up
you hit quite the corner case in my code.  But the code was quite well
commented :-)

 /* achu:
  *
  * I've been going back and forth on what this error
  * code should actually be.  It is conceivable that
  * this occurs b/c two different libipmiconsole()
  * threads are attempting to get the same SOL
  * session going, and they are "blocking" each
  * other.
  *
  * For now, we will assume that the above Supermicro
  * issue or something similar is the real problem and it
  * is a flaw due to the implementation of the BMC.
  *
  */
 IPMICONSOLE_CTX_DEBUG (c, ("closing with excessive payload deactivations"));
 ipmiconsole_ctx_set_errnum (c, IPMICONSOLE_ERR_BMC_IMPLEMENTATION);


On Thu, 2012-01-12 at 20:06 -0800, Brian Lambert wrote:
Al,

Thanks for the quick response.  Unfortunately is still doesn't appear to
work.  As before, I started with an activate SOL session.  I then
tried to deactivate the session with the --deactivate option.  As before,
no error was returned but the existing session remained active.  I then
tried connecting a second SOL session.  This time the error returned was:
[error received]: BMC Implementation.

I will send you the debug output in a separate email, so as not to clutter
the thread too much.

thanks,
Brian


On Wed, 11 Jan 2012, Albert Chu wrote:

Hey Brian,

Actually, I thought of a fix that was relatively tiny and easy.  It
won't require a workaround on the command line.  I got it in this
branch.

svn co svn://svn.sv.gnu.org/freeipmi/branches/dellsolinstancecapacity

after checking out, the normal

./autogen.sh; ./configure; make

then

ipmiconsole/ipmiconsole -h host -u user -p pass ...

as before.  PLMK how it works for you.

BTW, for documentation purchases, what motherboard are you seeing this
issue on.

Al

On Wed, 2012-01-11 at 10:54 -0800, Albert Chu wrote:
Hey Brian,

On Tue, 2012-01-10 at 20:14 -0800, Brian Lambert wrote:
I did another test, and have attached debug output.

First, I rebooted the BMC (Dell iDRAC6) to make sure there were no
sessions active.

I then established an initial SOL session, using the following command:
   ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize 
--serial-keepalive

So far, so good.

Instead of killing the first session, I left it active and tried to start
a second session using the same command.  That failed as expected, with a
"BMC Error" message.  Debug output from that first reconnect attempt is
attached in ipmiconsole-reconnect1.txt.

Ok, I see the problem.

n003-bmc: =====================================================
n003-bmc: IPMI 2.0 Get Payload Activation Status Response
n003-bmc: =====================================================
<snip>
n003-bmc: IPMI Command Data:
n003-bmc: ------------------
n003-bmc: [              4Ah] = cmd[ 8b]
n003-bmc: [               0h] = comp_code[ 8b]
n003-bmc: [               0h] = instance_capacity[ 4b]
n003-bmc: [               1h] = reserved[ 4b]
n003-bmc: [               1h] = instance_1[ 1b]
n003-bmc: [               0h] = instance_2[ 1b]
n003-bmc: [               0h] = instance_3[ 1b]
n003-bmc: [               0h] = instance_4[ 1b]

the bug in Dell's implementation is the "0h = instance_capacity".  This
indicates the number of SOL instances that can be done at the same time.
The fact that I ignore that it's 0 is a bug on my part (it should be > 0
always if SOL can be done).

This is then used iterate on instance_1, instance_2, etc. to determine
if SOL is currently activated.  The 1h = instance_1 indicates that SOL
is active.  But because instance_capacity is 0, I never look at it, so
the calculation is that no SOL is currently active.  ipmiconsole
attempts to activate a SOL session as always, but b/c an SOL session is
already active, the activation fails, so it trys again (assuming someone
else raced with libipmiconsole and took SOL before it could).  It checks
again to see if SOL is active, notices it's not, tries to activate
again, fails, and now we have a loop.  Eventually there are too many
failed activation attempts and libipmiconsole errors out.

I then tried to deactivate the existing session using the command:
   ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize  
--serial-keepalive --deactivate

That command completed without error, but the original session was still
active and responding to keystrokes.  Debug output from that attempt is
attached in ipmiconsole-deactivate1.txt.

Now this one makes sense.  Given the above knowledge, libipmiconsole
calculates that the SOL session is already deactivated, so it never
attempts an actual SOL deactivation.

I think this is very workaroundable, although I need to think about how
to do it (via workaround option?  without?) and how I can be
careful/safe with it and not break other systems.  I'll let ya know when
I have something you can try and tell ya the branch it's on.

Al

I then tried to activate a new session a second time.  It failed with the
same error message as the first reconnect attempt.  Debug output from the
second attempt is in ipmiconsole-reconnect2.txt.

Thanks for your help.  Let me know if you need further details or want me
to try anything else.

thanks,
Brian


On Sun, 8 Jan 2012, Al Chu wrote:

Hi Brian,

I've moved the IPMI portion of this thread to freeipmi-devel, since it's
a bit more appropriate for this mailing list.

To start a session, I can use the following FreeIPMI command:
./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
keepalive

I can quit out of that session using the &. escape sequence, and
reconnect right away.  But if I 'kill -9' that process, I get a
"[error received]: BMC Error" message when I try to connect with
another ipmiconsole command.

This indicates an unexpected error code along the way.  ipmiconsole
probably noticed that the previous SOL session was activated and tried
to deactivate it, with some error occurring at some point.  Could you
send the --debug output of ipmiconsole when you try to reconnnect?

This is the same error message I get
when trying the connect when another session is already active.  If I
then issue the command:
./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
keepalive --deactivate
This completes without error, but I still can't reconnect to the
serial console.

Can you give me the --debug output of the later connect attempt?  I'd
like to see why it can't connect again.

I get similar results when using ipmitool.  In that case, when I try
to reconnect, I get:
#ipmitool -U root -P calvin -H n003-bmc -I lanplus sol activate
Info: SOL payload already active on another session

If I try to deactivate the existing session, I get:
# ipmitool -U root -P calvin -H n003-bmc -I lanplus sol deactivate
Info: SOL payload already de-activated

I don't know the exact test situation you're trying, but you could be
racing a bit in some of these scenarios.  When you kill the previous
session with "kill -9", the server/BMC does not immediately end the
IPMI/SOL session.  It lasts for awhile longer until the server/BMC
eventually times out.  So that can explain why your first activate
attempt indicates the session is already activated, but it's deactivated
by the time your try to deactivate.

Once it's in this state, the only thing I've been able to do to regain
access to the serial console is reboot the BMC or wait for the session
to time out.

I have the same experience when connecting to Dell iDRAC5 and iDRAC6,
both running the latest firmware.  Al, if you'd like more information
or debug output from the freeipmi tools I'd be happy to provide it.

Would like to get to the bottom of this.

Al


On Sun, 2012-01-08 at 20:13 -0800, lambert wrote:
After some additional experimentation, it looks like a direct ssh to
the Dell blade iDRAC (BMC) followed by a command to activate the
serial connection may be the way to go with these.  I found that a
SIGKILL to the ssh session was sufficient to close the serial console
session, such that I could start another session with out needing to
wait several minutes for the old session to time out.

I still need to do some more testing, but Chris you may want to wait
before you spend too much time implementing the external process
cleanup coding.  If I get this approach working robustly, a clean
shutdown of the external process will be less important.


As for the IPMI SOL issues:

To start a session, I can use the following FreeIPMI command:
./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
keepalive

I can quit out of that session using the &. escape sequence, and
reconnect right away.  But if I 'kill -9' that process, I get a
"[error received]: BMC Error" message when I try to connect with
another ipmiconsole command.  This is the same error message I get
when trying the connect when another session is already active.  If I
then issue the command:
./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
keepalive --deactivate
This completes without error, but I still can't reconnect to the
serial console.

I get similar results when using ipmitool.  In that case, when I try
to reconnect, I get:
#ipmitool -U root -P calvin -H n003-bmc -I lanplus sol activate
Info: SOL payload already active on another session

If I try to deactivate the existing session, I get:
# ipmitool -U root -P calvin -H n003-bmc -I lanplus sol deactivate
Info: SOL payload already de-activated

Once it's in this state, the only thing I've been able to do to regain
access to the serial console is reboot the BMC or wait for the session
to time out.

I have the same experience when connecting to Dell iDRAC5 and iDRAC6,
both running the latest firmware.  Al, if you'd like more information
or debug output from the freeipmi tools I'd be happy to provide it.

thanks,
Brian

On Jan 7, 6:06 pm, Al Chu <address@hidden> wrote:
Thanks also for the FreeIPMI link.  That list confirms the the issue
I've been seeing with the Dell iDRACs not responding to the sol
deactivate.  I've made Dell aware of the issue, but don't know if they
have any plans to fix it.

When you do a "sol deactivate" does the original ipmitool session just
hang forever?  I imagine you're hitting a scenario where the original
IPMI/SOL session cannot do SOL anymore, but can send/recv IPMI packets.
The IPMI session can send IPMI keepalive packets and stay happy all day
long, but no SOL traffic will ever be received.  The only way to get a
timeout is to send SOL data (i.e. type at prompt), so that the SOL data
transfer eventually times out.

I added a "serial keepalive" into ipmiconsole/libipmiconsole to try and
deal w/ this situation.  As the name suggests, you "keepalive" a session
using SOL data instead of IPMI data so that the original sessions will
eventually time out (and exit, which is the end goal).  In FreeIPMI's
ipmiconsole this is enabled w/ the "--serial-keepalive" option.

I do believe ipmitool has a similar option "usesolkeepalive" (or
something to that affect).  It may be worth trying too.

Al



On Fri, 2012-01-06 at 20:43 -0800, lambert wrote:
I stand corrected, my second example does appear to work in regards to
trapping the signal while in interact mode.  Not sure what I was doing
wrong the other day.

So I fleshed-out the code in the trap to have it log out of the cmc
and exit out of the expect script upon receiving a SIGHUP, and that
appears to work well.  It can't trap a SIGKILL so it will take a
modification to conman, as you suggested, to have an option for
sending different signal types.  Another approach would be to send a
SIGHUP to all external processes by default, followed by a short wait,
and then a SIGKILL to clean up any stragglers.  I can try playing with
that some, if you want to point me toward the relevant routine.

Thanks also for the FreeIPMI link.  That list confirms the the issue
I've been seeing with the Dell iDRACs not responding to the sol
deactivate.  I've made Dell aware of the issue, but don't know if they
have any plans to fix it.

Thanks.

On Jan 6, 3:13 am, Chris Dunlap <address@hidden> wrote:
As for IPMI SOL connections, ConMan uses FreeIPMI.  I know Al Chu
(FreeIPMI maintainer) has encountered bugs in several vendor
implementations, and has implemented various workarounds when possible:

http://www.gnu.org/software/freeipmi/freeipmi-bugs-issues-and-workaro...

You could try the internal IPMI support to see if FreeIPMI is better
able to cope with the Dell blades.

conmand connects to an external process via a fork/exec, duping the
ends of the child's socketpair onto stdin/stdout.  It disconnects
from the process by closing its side of the socketpair and sending
a sigkill to the associated pid.

The signal handler approach seems cleaner, but only if we're able
to handle signals within the interact block.  Just playing around at
the shell, this seems to work:

  #!/usr/bin/expect --
  spawn $env(SHELL)
  trap {send_user " SIG[trap -name] "} {USR1 USR2}
  interact

I'm not sure why your 2nd example doesn't work.  I'll try to look at
this some more in the next few days.

-Chris

On Thu, 2012-01-05 at 07:56am PST, lambert wrote:

What I'm trying to do in this case is issue the following commands to
connect to a virtual serial console, on a Dell blade, through the
chassis management controller.

ssh <cmc host>
connect -m server-<n>

At this point I would issue an interact command in the expect script.

Then, to close the connection requires sending a ^\ to close the
serial connection, followed by an 'exit' to exit out of the cmc ssh
connection.

Note that the Dell blades do support IPMI SOL.  I'm currently using an
external script to drive ipmitool (hadn't realized conman now supports
ipmi sol connections internally).  It's working for the most part, but
I'm hitting the same problem in that 1) I can't issue an 'sol
deactivate' to close the connection when conmand shuts down and 2) The
Dell BMCs don't appear to honor the 'sol deactivate' command anyway.

I'm having some general reliability issues with using IPMI SOL on the
Dell blades, so thought I'd try going through the above approach of
establishing a connection by way of the cmc.

I was thinking along the lines of a signal handler.  How does conman
currently execute the external process, is it just a 'system' call?
Just wondering if the external process is already receiving a SIGKILL
when conmand shuts down.

Just now I experimented with creating a 'trap' inside my expect
script.  It works, up until the interact block.  Once the interact
command is executed, the signal handler is no longer being run:

This works ( I see 'Ouch!' printed with each SIGUSR1 signal):

set timeout -1
spawn /bin/sh
match_max 100000
send -- "ssh cmc1\r"
expect -exact "ssh cmc1\r
address@hidden's password: "
send -- "#####\r"
expect -gl "\$ "
trap {send_user "Ouch!"} SIGUSR1

But once I add the 'interact' command, the signal handler stops
working, and a SIGUSR1 just causes the expect script to exit:
set timeout -1
spawn /bin/sh
match_max 100000
send -- "ssh cmc1\r"
expect -exact "ssh cmc1\r
address@hidden's password: "
send -- "#####\r"
expect -gl "\$ "
trap {send_user "Ouch!"} SIGUSR1
interact

Thanks.

On Jan 5, 3:01=A0am, Chris Dunlap <address@hidden> wrote:
No, ConMan currently has no mechanism to trigger an external process
for cleanup before exiting.

One possibility would be to have config keywords to specify, say,
an ExecExitStr and ExecExitDelay. =A0On exit, conmand would write
the ExecExitStr string into the associated console byte stream,
after which it would wait ExecExitDelay seconds before terminating.
The expect script could specify this ExecExitStr pattern in its
interact block, and upon matching it, perform the necessary sends &
expects to prepare the remote console. =A0The ExecExitDelay would give
it time to run. =A0One downside to this approach is that there is no
way to prevent a connected user from typing the ExecExitStr pattern,
thereby triggering the interact block in the expect script.

Another possibility would be to specify a signal handler within
the expect script, and conmand could signal the associated pid
with an ExecExitSigNum signal before waiting ExecExitDelay seconds
to terminate. =A0But I'd have to do some experimentation to see if I
could craft an appropriate signal handler for an expect script.

Can you elaborate on what you would like to do in order to cleanly
close such a connection?

-Chris

On Wed, 2012-01-04 at 02:41pm PST, lambert wrote:

Is there a way to trigger a clean exit of an external console process,
when the conman daemon is shut down? =A0Say I'm using the ssh.exp
script, when the conman daemon is shut down (/etc/init.d/conman stop),
I'd like to have the ssh.exp script issue commands to cleanly close
the connection.

I'm trying to work around a problem with some Dell blades where if the
virtual serial console connection is not terminated cleanly, I have to
wait several minutes or reboot the BMC in order to regain access.

thanks.

--
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
--
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

--
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory



_______________________________________________
Freeipmi-devel mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/freeipmi-devel
--
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


--
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory





reply via email to

[Prev in Thread] Current Thread [Next in Thread]