freeipmi-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-devel] clean exit of external process when conmand shuts d


From: Albert Chu
Subject: Re: [Freeipmi-devel] clean exit of external process when conmand shuts down
Date: Wed, 11 Jan 2012 13:48:40 -0800

Hey Brian,

Actually, I thought of a fix that was relatively tiny and easy.  It
won't require a workaround on the command line.  I got it in this
branch.

svn co svn://svn.sv.gnu.org/freeipmi/branches/dellsolinstancecapacity

after checking out, the normal

./autogen.sh; ./configure; make

then

ipmiconsole/ipmiconsole -h host -u user -p pass ...

as before.  PLMK how it works for you.

BTW, for documentation purchases, what motherboard are you seeing this
issue on.

Al

On Wed, 2012-01-11 at 10:54 -0800, Albert Chu wrote:
> Hey Brian,
> 
> On Tue, 2012-01-10 at 20:14 -0800, Brian Lambert wrote:
> > I did another test, and have attached debug output.
> >
> > First, I rebooted the BMC (Dell iDRAC6) to make sure there were no
> > sessions active.
> >
> > I then established an initial SOL session, using the following command:
> >    ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize 
> > --serial-keepalive
> >
> > So far, so good.
> >
> > Instead of killing the first session, I left it active and tried to start
> > a second session using the same command.  That failed as expected, with a
> > "BMC Error" message.  Debug output from that first reconnect attempt is
> > attached in ipmiconsole-reconnect1.txt.
> 
> Ok, I see the problem.
> 
> n003-bmc: =====================================================
> n003-bmc: IPMI 2.0 Get Payload Activation Status Response
> n003-bmc: =====================================================
> <snip>
> n003-bmc: IPMI Command Data:
> n003-bmc: ------------------
> n003-bmc: [              4Ah] = cmd[ 8b]
> n003-bmc: [               0h] = comp_code[ 8b]
> n003-bmc: [               0h] = instance_capacity[ 4b]
> n003-bmc: [               1h] = reserved[ 4b]
> n003-bmc: [               1h] = instance_1[ 1b]
> n003-bmc: [               0h] = instance_2[ 1b]
> n003-bmc: [               0h] = instance_3[ 1b]
> n003-bmc: [               0h] = instance_4[ 1b]
> 
> the bug in Dell's implementation is the "0h = instance_capacity".  This
> indicates the number of SOL instances that can be done at the same time.
> The fact that I ignore that it's 0 is a bug on my part (it should be > 0
> always if SOL can be done).
> 
> This is then used iterate on instance_1, instance_2, etc. to determine
> if SOL is currently activated.  The 1h = instance_1 indicates that SOL
> is active.  But because instance_capacity is 0, I never look at it, so
> the calculation is that no SOL is currently active.  ipmiconsole
> attempts to activate a SOL session as always, but b/c an SOL session is
> already active, the activation fails, so it trys again (assuming someone
> else raced with libipmiconsole and took SOL before it could).  It checks
> again to see if SOL is active, notices it's not, tries to activate
> again, fails, and now we have a loop.  Eventually there are too many
> failed activation attempts and libipmiconsole errors out.
> 
> > I then tried to deactivate the existing session using the command:
> >    ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize  
> > --serial-keepalive --deactivate
> >
> > That command completed without error, but the original session was still
> > active and responding to keystrokes.  Debug output from that attempt is
> > attached in ipmiconsole-deactivate1.txt.
> 
> Now this one makes sense.  Given the above knowledge, libipmiconsole
> calculates that the SOL session is already deactivated, so it never
> attempts an actual SOL deactivation.
> 
> I think this is very workaroundable, although I need to think about how
> to do it (via workaround option?  without?) and how I can be
> careful/safe with it and not break other systems.  I'll let ya know when
> I have something you can try and tell ya the branch it's on.
> 
> Al
> 
> > I then tried to activate a new session a second time.  It failed with the
> > same error message as the first reconnect attempt.  Debug output from the
> > second attempt is in ipmiconsole-reconnect2.txt.
> >
> > Thanks for your help.  Let me know if you need further details or want me
> > to try anything else.
> >
> > thanks,
> > Brian
> >
> >
> > On Sun, 8 Jan 2012, Al Chu wrote:
> >
> > > Hi Brian,
> > >
> > > I've moved the IPMI portion of this thread to freeipmi-devel, since it's
> > > a bit more appropriate for this mailing list.
> > >
> > >> To start a session, I can use the following FreeIPMI command:
> > >> ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
> > >> keepalive
> > >>
> > >> I can quit out of that session using the &. escape sequence, and
> > >> reconnect right away.  But if I 'kill -9' that process, I get a
> > >> "[error received]: BMC Error" message when I try to connect with
> > >> another ipmiconsole command.
> > >
> > > This indicates an unexpected error code along the way.  ipmiconsole
> > > probably noticed that the previous SOL session was activated and tried
> > > to deactivate it, with some error occurring at some point.  Could you
> > > send the --debug output of ipmiconsole when you try to reconnnect?
> > >
> > >> This is the same error message I get
> > >> when trying the connect when another session is already active.  If I
> > >> then issue the command:
> > >> ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
> > >> keepalive --deactivate
> > >> This completes without error, but I still can't reconnect to the
> > >> serial console.
> > >
> > > Can you give me the --debug output of the later connect attempt?  I'd
> > > like to see why it can't connect again.
> > >
> > >> I get similar results when using ipmitool.  In that case, when I try
> > >> to reconnect, I get:
> > >> #ipmitool -U root -P calvin -H n003-bmc -I lanplus sol activate
> > >> Info: SOL payload already active on another session
> > >>
> > >> If I try to deactivate the existing session, I get:
> > >> # ipmitool -U root -P calvin -H n003-bmc -I lanplus sol deactivate
> > >> Info: SOL payload already de-activated
> > >
> > > I don't know the exact test situation you're trying, but you could be
> > > racing a bit in some of these scenarios.  When you kill the previous
> > > session with "kill -9", the server/BMC does not immediately end the
> > > IPMI/SOL session.  It lasts for awhile longer until the server/BMC
> > > eventually times out.  So that can explain why your first activate
> > > attempt indicates the session is already activated, but it's deactivated
> > > by the time your try to deactivate.
> > >
> > >> Once it's in this state, the only thing I've been able to do to regain
> > >> access to the serial console is reboot the BMC or wait for the session
> > >> to time out.
> > >>
> > >> I have the same experience when connecting to Dell iDRAC5 and iDRAC6,
> > >> both running the latest firmware.  Al, if you'd like more information
> > >> or debug output from the freeipmi tools I'd be happy to provide it.
> > >
> > > Would like to get to the bottom of this.
> > >
> > > Al
> > >
> > >
> > > On Sun, 2012-01-08 at 20:13 -0800, lambert wrote:
> > >> After some additional experimentation, it looks like a direct ssh to
> > >> the Dell blade iDRAC (BMC) followed by a command to activate the
> > >> serial connection may be the way to go with these.  I found that a
> > >> SIGKILL to the ssh session was sufficient to close the serial console
> > >> session, such that I could start another session with out needing to
> > >> wait several minutes for the old session to time out.
> > >>
> > >> I still need to do some more testing, but Chris you may want to wait
> > >> before you spend too much time implementing the external process
> > >> cleanup coding.  If I get this approach working robustly, a clean
> > >> shutdown of the external process will be less important.
> > >>
> > >>
> > >> As for the IPMI SOL issues:
> > >>
> > >> To start a session, I can use the following FreeIPMI command:
> > >> ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
> > >> keepalive
> > >>
> > >> I can quit out of that session using the &. escape sequence, and
> > >> reconnect right away.  But if I 'kill -9' that process, I get a
> > >> "[error received]: BMC Error" message when I try to connect with
> > >> another ipmiconsole command.  This is the same error message I get
> > >> when trying the connect when another session is already active.  If I
> > >> then issue the command:
> > >> ./ipmiconsole -h n003-bmc -u root -p calvin -W solpayloadsize --serial-
> > >> keepalive --deactivate
> > >> This completes without error, but I still can't reconnect to the
> > >> serial console.
> > >>
> > >> I get similar results when using ipmitool.  In that case, when I try
> > >> to reconnect, I get:
> > >> #ipmitool -U root -P calvin -H n003-bmc -I lanplus sol activate
> > >> Info: SOL payload already active on another session
> > >>
> > >> If I try to deactivate the existing session, I get:
> > >> # ipmitool -U root -P calvin -H n003-bmc -I lanplus sol deactivate
> > >> Info: SOL payload already de-activated
> > >>
> > >> Once it's in this state, the only thing I've been able to do to regain
> > >> access to the serial console is reboot the BMC or wait for the session
> > >> to time out.
> > >>
> > >> I have the same experience when connecting to Dell iDRAC5 and iDRAC6,
> > >> both running the latest firmware.  Al, if you'd like more information
> > >> or debug output from the freeipmi tools I'd be happy to provide it.
> > >>
> > >> thanks,
> > >> Brian
> > >>
> > >> On Jan 7, 6:06 pm, Al Chu <address@hidden> wrote:
> > >>>> Thanks also for the FreeIPMI link.  That list confirms the the issue
> > >>>> I've been seeing with the Dell iDRACs not responding to the sol
> > >>>> deactivate.  I've made Dell aware of the issue, but don't know if they
> > >>>> have any plans to fix it.
> > >>>
> > >>> When you do a "sol deactivate" does the original ipmitool session just
> > >>> hang forever?  I imagine you're hitting a scenario where the original
> > >>> IPMI/SOL session cannot do SOL anymore, but can send/recv IPMI packets.
> > >>> The IPMI session can send IPMI keepalive packets and stay happy all day
> > >>> long, but no SOL traffic will ever be received.  The only way to get a
> > >>> timeout is to send SOL data (i.e. type at prompt), so that the SOL data
> > >>> transfer eventually times out.
> > >>>
> > >>> I added a "serial keepalive" into ipmiconsole/libipmiconsole to try and
> > >>> deal w/ this situation.  As the name suggests, you "keepalive" a session
> > >>> using SOL data instead of IPMI data so that the original sessions will
> > >>> eventually time out (and exit, which is the end goal).  In FreeIPMI's
> > >>> ipmiconsole this is enabled w/ the "--serial-keepalive" option.
> > >>>
> > >>> I do believe ipmitool has a similar option "usesolkeepalive" (or
> > >>> something to that affect).  It may be worth trying too.
> > >>>
> > >>> Al
> > >>>
> > >>>
> > >>>
> > >>> On Fri, 2012-01-06 at 20:43 -0800, lambert wrote:
> > >>>> I stand corrected, my second example does appear to work in regards to
> > >>>> trapping the signal while in interact mode.  Not sure what I was doing
> > >>>> wrong the other day.
> > >>>
> > >>>> So I fleshed-out the code in the trap to have it log out of the cmc
> > >>>> and exit out of the expect script upon receiving a SIGHUP, and that
> > >>>> appears to work well.  It can't trap a SIGKILL so it will take a
> > >>>> modification to conman, as you suggested, to have an option for
> > >>>> sending different signal types.  Another approach would be to send a
> > >>>> SIGHUP to all external processes by default, followed by a short wait,
> > >>>> and then a SIGKILL to clean up any stragglers.  I can try playing with
> > >>>> that some, if you want to point me toward the relevant routine.
> > >>>
> > >>>> Thanks also for the FreeIPMI link.  That list confirms the the issue
> > >>>> I've been seeing with the Dell iDRACs not responding to the sol
> > >>>> deactivate.  I've made Dell aware of the issue, but don't know if they
> > >>>> have any plans to fix it.
> > >>>
> > >>>> Thanks.
> > >>>
> > >>>> On Jan 6, 3:13 am, Chris Dunlap <address@hidden> wrote:
> > >>>>> As for IPMI SOL connections, ConMan uses FreeIPMI.  I know Al Chu
> > >>>>> (FreeIPMI maintainer) has encountered bugs in several vendor
> > >>>>> implementations, and has implemented various workarounds when 
> > >>>>> possible:
> > >>>
> > >>>>> http://www.gnu.org/software/freeipmi/freeipmi-bugs-issues-and-workaro...
> > >>>
> > >>>>> You could try the internal IPMI support to see if FreeIPMI is better
> > >>>>> able to cope with the Dell blades.
> > >>>
> > >>>>> conmand connects to an external process via a fork/exec, duping the
> > >>>>> ends of the child's socketpair onto stdin/stdout.  It disconnects
> > >>>>> from the process by closing its side of the socketpair and sending
> > >>>>> a sigkill to the associated pid.
> > >>>
> > >>>>> The signal handler approach seems cleaner, but only if we're able
> > >>>>> to handle signals within the interact block.  Just playing around at
> > >>>>> the shell, this seems to work:
> > >>>
> > >>>>>   #!/usr/bin/expect --
> > >>>>>   spawn $env(SHELL)
> > >>>>>   trap {send_user " SIG[trap -name] "} {USR1 USR2}
> > >>>>>   interact
> > >>>
> > >>>>> I'm not sure why your 2nd example doesn't work.  I'll try to look at
> > >>>>> this some more in the next few days.
> > >>>
> > >>>>> -Chris
> > >>>
> > >>>>> On Thu, 2012-01-05 at 07:56am PST, lambert wrote:
> > >>>
> > >>>>>> What I'm trying to do in this case is issue the following commands to
> > >>>>>> connect to a virtual serial console, on a Dell blade, through the
> > >>>>>> chassis management controller.
> > >>>
> > >>>>>> ssh <cmc host>
> > >>>>>> connect -m server-<n>
> > >>>
> > >>>>>> At this point I would issue an interact command in the expect script.
> > >>>
> > >>>>>> Then, to close the connection requires sending a ^\ to close the
> > >>>>>> serial connection, followed by an 'exit' to exit out of the cmc ssh
> > >>>>>> connection.
> > >>>
> > >>>>>> Note that the Dell blades do support IPMI SOL.  I'm currently using 
> > >>>>>> an
> > >>>>>> external script to drive ipmitool (hadn't realized conman now 
> > >>>>>> supports
> > >>>>>> ipmi sol connections internally).  It's working for the most part, 
> > >>>>>> but
> > >>>>>> I'm hitting the same problem in that 1) I can't issue an 'sol
> > >>>>>> deactivate' to close the connection when conmand shuts down and 2) 
> > >>>>>> The
> > >>>>>> Dell BMCs don't appear to honor the 'sol deactivate' command anyway.
> > >>>
> > >>>>>> I'm having some general reliability issues with using IPMI SOL on the
> > >>>>>> Dell blades, so thought I'd try going through the above approach of
> > >>>>>> establishing a connection by way of the cmc.
> > >>>
> > >>>>>> I was thinking along the lines of a signal handler.  How does conman
> > >>>>>> currently execute the external process, is it just a 'system' call?
> > >>>>>> Just wondering if the external process is already receiving a SIGKILL
> > >>>>>> when conmand shuts down.
> > >>>
> > >>>>>> Just now I experimented with creating a 'trap' inside my expect
> > >>>>>> script.  It works, up until the interact block.  Once the interact
> > >>>>>> command is executed, the signal handler is no longer being run:
> > >>>
> > >>>>>> This works ( I see 'Ouch!' printed with each SIGUSR1 signal):
> > >>>
> > >>>>>> set timeout -1
> > >>>>>> spawn /bin/sh
> > >>>>>> match_max 100000
> > >>>>>> send -- "ssh cmc1\r"
> > >>>>>> expect -exact "ssh cmc1\r
> > >>>>>> address@hidden's password: "
> > >>>>>> send -- "#####\r"
> > >>>>>> expect -gl "\$ "
> > >>>>>> trap {send_user "Ouch!"} SIGUSR1
> > >>>
> > >>>>>> But once I add the 'interact' command, the signal handler stops
> > >>>>>> working, and a SIGUSR1 just causes the expect script to exit:
> > >>>>>> set timeout -1
> > >>>>>> spawn /bin/sh
> > >>>>>> match_max 100000
> > >>>>>> send -- "ssh cmc1\r"
> > >>>>>> expect -exact "ssh cmc1\r
> > >>>>>> address@hidden's password: "
> > >>>>>> send -- "#####\r"
> > >>>>>> expect -gl "\$ "
> > >>>>>> trap {send_user "Ouch!"} SIGUSR1
> > >>>>>> interact
> > >>>
> > >>>>>> Thanks.
> > >>>
> > >>>>>> On Jan 5, 3:01=A0am, Chris Dunlap <address@hidden> wrote:
> > >>>>>>> No, ConMan currently has no mechanism to trigger an external process
> > >>>>>>> for cleanup before exiting.
> > >>>
> > >>>>>>> One possibility would be to have config keywords to specify, say,
> > >>>>>>> an ExecExitStr and ExecExitDelay. =A0On exit, conmand would write
> > >>>>>>> the ExecExitStr string into the associated console byte stream,
> > >>>>>>> after which it would wait ExecExitDelay seconds before terminating.
> > >>>>>>> The expect script could specify this ExecExitStr pattern in its
> > >>>>>>> interact block, and upon matching it, perform the necessary sends &
> > >>>>>>> expects to prepare the remote console. =A0The ExecExitDelay would 
> > >>>>>>> give
> > >>>>>>> it time to run. =A0One downside to this approach is that there is no
> > >>>>>>> way to prevent a connected user from typing the ExecExitStr pattern,
> > >>>>>>> thereby triggering the interact block in the expect script.
> > >>>
> > >>>>>>> Another possibility would be to specify a signal handler within
> > >>>>>>> the expect script, and conmand could signal the associated pid
> > >>>>>>> with an ExecExitSigNum signal before waiting ExecExitDelay seconds
> > >>>>>>> to terminate. =A0But I'd have to do some experimentation to see if I
> > >>>>>>> could craft an appropriate signal handler for an expect script.
> > >>>
> > >>>>>>> Can you elaborate on what you would like to do in order to cleanly
> > >>>>>>> close such a connection?
> > >>>
> > >>>>>>> -Chris
> > >>>
> > >>>>>>> On Wed, 2012-01-04 at 02:41pm PST, lambert wrote:
> > >>>
> > >>>>>>>> Is there a way to trigger a clean exit of an external console 
> > >>>>>>>> process,
> > >>>>>>>> when the conman daemon is shut down? =A0Say I'm using the ssh.exp
> > >>>>>>>> script, when the conman daemon is shut down (/etc/init.d/conman 
> > >>>>>>>> stop),
> > >>>>>>>> I'd like to have the ssh.exp script issue commands to cleanly close
> > >>>>>>>> the connection.
> > >>>
> > >>>>>>>> I'm trying to work around a problem with some Dell blades where if 
> > >>>>>>>> the
> > >>>>>>>> virtual serial console connection is not terminated cleanly, I 
> > >>>>>>>> have to
> > >>>>>>>> wait several minutes or reboot the BMC in order to regain access.
> > >>>
> > >>>>>>>> thanks.
> > >>>
> > >>> --
> > >>> Albert Chu
> > >>> address@hidden
> > >>> Computer Scientist
> > >>> High Performance Systems Division
> > >>> Lawrence Livermore National Laboratory
> > > --
> > > Albert Chu
> > > address@hidden
> > > Computer Scientist
> > > High Performance Systems Division
> > > Lawrence Livermore National Laboratory
> > >
> --
> Albert Chu
> address@hidden
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
> 
> _______________________________________________
> Freeipmi-devel mailing list
> address@hidden
> https://lists.gnu.org/mailman/listinfo/freeipmi-devel
-- 
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory





reply via email to

[Prev in Thread] Current Thread [Next in Thread]