freeipmi-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-devel] clean exit of external process when conmand shuts d


From: Albert Chu
Subject: Re: [Freeipmi-devel] clean exit of external process when conmand shuts down
Date: Wed, 11 Jan 2012 10:54:41 -0800

Hey Brian,

On Tue, 2012-01-10 at 20:14 -0800, Brian Lambert wrote:
> I did another test, and have attached debug output.
> 
> First, I rebooted the BMC (Dell iDRAC6) to make sure there were no
> sessions active.
> 
> I then established an initial SOL session, using the following command:
>    ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize 
> --serial-keepalive
> 
> So far, so good.
> 
> Instead of killing the first session, I left it active and tried to start
> a second session using the same command.  That failed as expected, with a
> "BMC Error" message.  Debug output from that first reconnect attempt is
> attached in ipmiconsole-reconnect1.txt.

Ok, I see the problem.  

n003-bmc: =====================================================
n003-bmc: IPMI 2.0 Get Payload Activation Status Response
n003-bmc: =====================================================
<snip>
n003-bmc: IPMI Command Data:
n003-bmc: ------------------
n003-bmc: [              4Ah] = cmd[ 8b]
n003-bmc: [               0h] = comp_code[ 8b]
n003-bmc: [               0h] = instance_capacity[ 4b]
n003-bmc: [               1h] = reserved[ 4b]
n003-bmc: [               1h] = instance_1[ 1b]
n003-bmc: [               0h] = instance_2[ 1b]
n003-bmc: [               0h] = instance_3[ 1b]
n003-bmc: [               0h] = instance_4[ 1b]

the bug in Dell's implementation is the "0h = instance_capacity".  This
indicates the number of SOL instances that can be done at the same time.
The fact that I ignore that it's 0 is a bug on my part (it should be > 0
always if SOL can be done).

This is then used iterate on instance_1, instance_2, etc. to determine
if SOL is currently activated.  The 1h = instance_1 indicates that SOL
is active.  But because instance_capacity is 0, I never look at it, so
the calculation is that no SOL is currently active.  ipmiconsole
attempts to activate a SOL session as always, but b/c an SOL session is
already active, the activation fails, so it trys again (assuming someone
else raced with libipmiconsole and took SOL before it could).  It checks
again to see if SOL is active, notices it's not, tries to activate
again, fails, and now we have a loop.  Eventually there are too many
failed activation attempts and libipmiconsole errors out.

> I then tried to deactivate the existing session using the command:
>    ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize  
> --serial-keepalive --deactivate
> 
> That command completed without error, but the original session was still
> active and responding to keystrokes.  Debug output from that attempt is
> attached in ipmiconsole-deactivate1.txt.

Now this one makes sense.  Given the above knowledge, libipmiconsole
calculates that the SOL session is already deactivated, so it never
attempts an actual SOL deactivation.

I think this is very workaroundable, although I need to think about how
to do it (via workaround option?  without?) and how I can be
careful/safe with it and not break other systems.  I'll let ya know when
I have something you can try and tell ya the branch it's on.

Al

> I then tried to activate a new session a second time.  It failed with the
> same error message as the first reconnect attempt.  Debug output from the
> second attempt is in ipmiconsole-reconnect2.txt.
> 
> Thanks for your help.  Let me know if you need further details or want me
> to try anything else.
> 
> thanks,
> Brian
> 
> 
> On Sun, 8 Jan 2012, Al Chu wrote:
> 
> > Hi Brian,
> >
> > I've moved the IPMI portion of this thread to freeipmi-devel, since it's
> > a bit more appropriate for this mailing list.
> >
> >> To start a session, I can use the following FreeIPMI command:
> >> ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
> >> keepalive
> >>
> >> I can quit out of that session using the &. escape sequence, and
> >> reconnect right away.  But if I 'kill -9' that process, I get a
> >> "[error received]: BMC Error" message when I try to connect with
> >> another ipmiconsole command.
> >
> > This indicates an unexpected error code along the way.  ipmiconsole
> > probably noticed that the previous SOL session was activated and tried
> > to deactivate it, with some error occurring at some point.  Could you
> > send the --debug output of ipmiconsole when you try to reconnnect?
> >
> >> This is the same error message I get
> >> when trying the connect when another session is already active.  If I
> >> then issue the command:
> >> ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
> >> keepalive --deactivate
> >> This completes without error, but I still can't reconnect to the
> >> serial console.
> >
> > Can you give me the --debug output of the later connect attempt?  I'd
> > like to see why it can't connect again.
> >
> >> I get similar results when using ipmitool.  In that case, when I try
> >> to reconnect, I get:
> >> #ipmitool -U root -P calvin -H n003-bmc -I lanplus sol activate
> >> Info: SOL payload already active on another session
> >>
> >> If I try to deactivate the existing session, I get:
> >> # ipmitool -U root -P calvin -H n003-bmc -I lanplus sol deactivate
> >> Info: SOL payload already de-activated
> >
> > I don't know the exact test situation you're trying, but you could be
> > racing a bit in some of these scenarios.  When you kill the previous
> > session with "kill -9", the server/BMC does not immediately end the
> > IPMI/SOL session.  It lasts for awhile longer until the server/BMC
> > eventually times out.  So that can explain why your first activate
> > attempt indicates the session is already activated, but it's deactivated
> > by the time your try to deactivate.
> >
> >> Once it's in this state, the only thing I've been able to do to regain
> >> access to the serial console is reboot the BMC or wait for the session
> >> to time out.
> >>
> >> I have the same experience when connecting to Dell iDRAC5 and iDRAC6,
> >> both running the latest firmware.  Al, if you'd like more information
> >> or debug output from the freeipmi tools I'd be happy to provide it.
> >
> > Would like to get to the bottom of this.
> >
> > Al
> >
> >
> > On Sun, 2012-01-08 at 20:13 -0800, lambert wrote:
> >> After some additional experimentation, it looks like a direct ssh to
> >> the Dell blade iDRAC (BMC) followed by a command to activate the
> >> serial connection may be the way to go with these.  I found that a
> >> SIGKILL to the ssh session was sufficient to close the serial console
> >> session, such that I could start another session with out needing to
> >> wait several minutes for the old session to time out.
> >>
> >> I still need to do some more testing, but Chris you may want to wait
> >> before you spend too much time implementing the external process
> >> cleanup coding.  If I get this approach working robustly, a clean
> >> shutdown of the external process will be less important.
> >>
> >>
> >> As for the IPMI SOL issues:
> >>
> >> To start a session, I can use the following FreeIPMI command:
> >> ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
> >> keepalive
> >>
> >> I can quit out of that session using the &. escape sequence, and
> >> reconnect right away.  But if I 'kill -9' that process, I get a
> >> "[error received]: BMC Error" message when I try to connect with
> >> another ipmiconsole command.  This is the same error message I get
> >> when trying the connect when another session is already active.  If I
> >> then issue the command:
> >> ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
> >> keepalive --deactivate
> >> This completes without error, but I still can't reconnect to the
> >> serial console.
> >>
> >> I get similar results when using ipmitool.  In that case, when I try
> >> to reconnect, I get:
> >> #ipmitool -U root -P calvin -H n003-bmc -I lanplus sol activate
> >> Info: SOL payload already active on another session
> >>
> >> If I try to deactivate the existing session, I get:
> >> # ipmitool -U root -P calvin -H n003-bmc -I lanplus sol deactivate
> >> Info: SOL payload already de-activated
> >>
> >> Once it's in this state, the only thing I've been able to do to regain
> >> access to the serial console is reboot the BMC or wait for the session
> >> to time out.
> >>
> >> I have the same experience when connecting to Dell iDRAC5 and iDRAC6,
> >> both running the latest firmware.  Al, if you'd like more information
> >> or debug output from the freeipmi tools I'd be happy to provide it.
> >>
> >> thanks,
> >> Brian
> >>
> >> On Jan 7, 6:06 pm, Al Chu <address@hidden> wrote:
> >>>> Thanks also for the FreeIPMI link.  That list confirms the the issue
> >>>> I've been seeing with the Dell iDRACs not responding to the sol
> >>>> deactivate.  I've made Dell aware of the issue, but don't know if they
> >>>> have any plans to fix it.
> >>>
> >>> When you do a "sol deactivate" does the original ipmitool session just
> >>> hang forever?  I imagine you're hitting a scenario where the original
> >>> IPMI/SOL session cannot do SOL anymore, but can send/recv IPMI packets.
> >>> The IPMI session can send IPMI keepalive packets and stay happy all day
> >>> long, but no SOL traffic will ever be received.  The only way to get a
> >>> timeout is to send SOL data (i.e. type at prompt), so that the SOL data
> >>> transfer eventually times out.
> >>>
> >>> I added a "serial keepalive" into ipmiconsole/libipmiconsole to try and
> >>> deal w/ this situation.  As the name suggests, you "keepalive" a session
> >>> using SOL data instead of IPMI data so that the original sessions will
> >>> eventually time out (and exit, which is the end goal).  In FreeIPMI's
> >>> ipmiconsole this is enabled w/ the "--serial-keepalive" option.
> >>>
> >>> I do believe ipmitool has a similar option "usesolkeepalive" (or
> >>> something to that affect).  It may be worth trying too.
> >>>
> >>> Al
> >>>
> >>>
> >>>
> >>> On Fri, 2012-01-06 at 20:43 -0800, lambert wrote:
> >>>> I stand corrected, my second example does appear to work in regards to
> >>>> trapping the signal while in interact mode.  Not sure what I was doing
> >>>> wrong the other day.
> >>>
> >>>> So I fleshed-out the code in the trap to have it log out of the cmc
> >>>> and exit out of the expect script upon receiving a SIGHUP, and that
> >>>> appears to work well.  It can't trap a SIGKILL so it will take a
> >>>> modification to conman, as you suggested, to have an option for
> >>>> sending different signal types.  Another approach would be to send a
> >>>> SIGHUP to all external processes by default, followed by a short wait,
> >>>> and then a SIGKILL to clean up any stragglers.  I can try playing with
> >>>> that some, if you want to point me toward the relevant routine.
> >>>
> >>>> Thanks also for the FreeIPMI link.  That list confirms the the issue
> >>>> I've been seeing with the Dell iDRACs not responding to the sol
> >>>> deactivate.  I've made Dell aware of the issue, but don't know if they
> >>>> have any plans to fix it.
> >>>
> >>>> Thanks.
> >>>
> >>>> On Jan 6, 3:13 am, Chris Dunlap <address@hidden> wrote:
> >>>>> As for IPMI SOL connections, ConMan uses FreeIPMI.  I know Al Chu
> >>>>> (FreeIPMI maintainer) has encountered bugs in several vendor
> >>>>> implementations, and has implemented various workarounds when possible:
> >>>
> >>>>> http://www.gnu.org/software/freeipmi/freeipmi-bugs-issues-and-workaro...
> >>>
> >>>>> You could try the internal IPMI support to see if FreeIPMI is better
> >>>>> able to cope with the Dell blades.
> >>>
> >>>>> conmand connects to an external process via a fork/exec, duping the
> >>>>> ends of the child's socketpair onto stdin/stdout.  It disconnects
> >>>>> from the process by closing its side of the socketpair and sending
> >>>>> a sigkill to the associated pid.
> >>>
> >>>>> The signal handler approach seems cleaner, but only if we're able
> >>>>> to handle signals within the interact block.  Just playing around at
> >>>>> the shell, this seems to work:
> >>>
> >>>>>   #!/usr/bin/expect --
> >>>>>   spawn $env(SHELL)
> >>>>>   trap {send_user " SIG[trap -name] "} {USR1 USR2}
> >>>>>   interact
> >>>
> >>>>> I'm not sure why your 2nd example doesn't work.  I'll try to look at
> >>>>> this some more in the next few days.
> >>>
> >>>>> -Chris
> >>>
> >>>>> On Thu, 2012-01-05 at 07:56am PST, lambert wrote:
> >>>
> >>>>>> What I'm trying to do in this case is issue the following commands to
> >>>>>> connect to a virtual serial console, on a Dell blade, through the
> >>>>>> chassis management controller.
> >>>
> >>>>>> ssh <cmc host>
> >>>>>> connect -m server-<n>
> >>>
> >>>>>> At this point I would issue an interact command in the expect script.
> >>>
> >>>>>> Then, to close the connection requires sending a ^\ to close the
> >>>>>> serial connection, followed by an 'exit' to exit out of the cmc ssh
> >>>>>> connection.
> >>>
> >>>>>> Note that the Dell blades do support IPMI SOL.  I'm currently using an
> >>>>>> external script to drive ipmitool (hadn't realized conman now supports
> >>>>>> ipmi sol connections internally).  It's working for the most part, but
> >>>>>> I'm hitting the same problem in that 1) I can't issue an 'sol
> >>>>>> deactivate' to close the connection when conmand shuts down and 2) The
> >>>>>> Dell BMCs don't appear to honor the 'sol deactivate' command anyway.
> >>>
> >>>>>> I'm having some general reliability issues with using IPMI SOL on the
> >>>>>> Dell blades, so thought I'd try going through the above approach of
> >>>>>> establishing a connection by way of the cmc.
> >>>
> >>>>>> I was thinking along the lines of a signal handler.  How does conman
> >>>>>> currently execute the external process, is it just a 'system' call?
> >>>>>> Just wondering if the external process is already receiving a SIGKILL
> >>>>>> when conmand shuts down.
> >>>
> >>>>>> Just now I experimented with creating a 'trap' inside my expect
> >>>>>> script.  It works, up until the interact block.  Once the interact
> >>>>>> command is executed, the signal handler is no longer being run:
> >>>
> >>>>>> This works ( I see 'Ouch!' printed with each SIGUSR1 signal):
> >>>
> >>>>>> set timeout -1
> >>>>>> spawn /bin/sh
> >>>>>> match_max 100000
> >>>>>> send -- "ssh cmc1\r"
> >>>>>> expect -exact "ssh cmc1\r
> >>>>>> address@hidden's password: "
> >>>>>> send -- "#####\r"
> >>>>>> expect -gl "\$ "
> >>>>>> trap {send_user "Ouch!"} SIGUSR1
> >>>
> >>>>>> But once I add the 'interact' command, the signal handler stops
> >>>>>> working, and a SIGUSR1 just causes the expect script to exit:
> >>>>>> set timeout -1
> >>>>>> spawn /bin/sh
> >>>>>> match_max 100000
> >>>>>> send -- "ssh cmc1\r"
> >>>>>> expect -exact "ssh cmc1\r
> >>>>>> address@hidden's password: "
> >>>>>> send -- "#####\r"
> >>>>>> expect -gl "\$ "
> >>>>>> trap {send_user "Ouch!"} SIGUSR1
> >>>>>> interact
> >>>
> >>>>>> Thanks.
> >>>
> >>>>>> On Jan 5, 3:01=A0am, Chris Dunlap <address@hidden> wrote:
> >>>>>>> No, ConMan currently has no mechanism to trigger an external process
> >>>>>>> for cleanup before exiting.
> >>>
> >>>>>>> One possibility would be to have config keywords to specify, say,
> >>>>>>> an ExecExitStr and ExecExitDelay. =A0On exit, conmand would write
> >>>>>>> the ExecExitStr string into the associated console byte stream,
> >>>>>>> after which it would wait ExecExitDelay seconds before terminating.
> >>>>>>> The expect script could specify this ExecExitStr pattern in its
> >>>>>>> interact block, and upon matching it, perform the necessary sends &
> >>>>>>> expects to prepare the remote console. =A0The ExecExitDelay would give
> >>>>>>> it time to run. =A0One downside to this approach is that there is no
> >>>>>>> way to prevent a connected user from typing the ExecExitStr pattern,
> >>>>>>> thereby triggering the interact block in the expect script.
> >>>
> >>>>>>> Another possibility would be to specify a signal handler within
> >>>>>>> the expect script, and conmand could signal the associated pid
> >>>>>>> with an ExecExitSigNum signal before waiting ExecExitDelay seconds
> >>>>>>> to terminate. =A0But I'd have to do some experimentation to see if I
> >>>>>>> could craft an appropriate signal handler for an expect script.
> >>>
> >>>>>>> Can you elaborate on what you would like to do in order to cleanly
> >>>>>>> close such a connection?
> >>>
> >>>>>>> -Chris
> >>>
> >>>>>>> On Wed, 2012-01-04 at 02:41pm PST, lambert wrote:
> >>>
> >>>>>>>> Is there a way to trigger a clean exit of an external console 
> >>>>>>>> process,
> >>>>>>>> when the conman daemon is shut down? =A0Say I'm using the ssh.exp
> >>>>>>>> script, when the conman daemon is shut down (/etc/init.d/conman 
> >>>>>>>> stop),
> >>>>>>>> I'd like to have the ssh.exp script issue commands to cleanly close
> >>>>>>>> the connection.
> >>>
> >>>>>>>> I'm trying to work around a problem with some Dell blades where if 
> >>>>>>>> the
> >>>>>>>> virtual serial console connection is not terminated cleanly, I have 
> >>>>>>>> to
> >>>>>>>> wait several minutes or reboot the BMC in order to regain access.
> >>>
> >>>>>>>> thanks.
> >>>
> >>> --
> >>> Albert Chu
> >>> address@hidden
> >>> Computer Scientist
> >>> High Performance Systems Division
> >>> Lawrence Livermore National Laboratory
> > --
> > Albert Chu
> > address@hidden
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> >
-- 
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory





reply via email to

[Prev in Thread] Current Thread [Next in Thread]