[PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to

dejagnu

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to

From:	Maciej W. Rozycki
Subject:	[PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang
Date:	Wed, 20 May 2020 22:22:24 +0100 (BST)
User-agent:	Alpine 2.21 (LFD 202 2017-01-01)

Fix a stuck remote call pipeline comprised of multiple processes causing 
testing to hang and requiring a manual intervention to either terminate 
or proceed, like below (here with the GCC `c' testsuite invoked with 
`execute.exp=postmod-1.c' for 8 compilation and 8 execution tests on a 
remote QEMU target run in the system emulation mode):

PASS: gcc.c-torture/execute/postmod-1.c   -O0  (test for excess errors)
Executing on remote-localhost: .../gcc/testsuite/gcc/postmod-1.exe    (timeout 
= 15)
spawn [open ...]
WARNING: program timed out
got a INT signal, interrupted by user

                === gcc Summary ===

# of expected passes            1

by not killing the pending force-kills in `close_wait_program' and also 
by setting the channel associated with the pipeline to the nonblocking 
mode when it is about to be closed afterwards.

The situation here is as follows.  A connection to the remote target 
board is requested by `rsh_exec' with input redirection requested from 
`/dev/null'.  The request is handled by `local_exec' and the redirection 
causes a Tcl command pipeline channel to be opened.  The list of PIDs of 
the processes comprising the pipeline is determined and then the channel 
is assigned an Expect spawn ID.  The spawn ID is then waited for output 
produced by the remote target (here accessed with SSH) and, ultimately, 
completion marked by the end-of-file condition.

As SSH gets stuck and does not complete the timeout eventually fires and 
a kill sequence is initiated, by calling `close_wait_program' with the 
list of PIDs previously obtained to kill given as one of the procedure's 
arguments.  Seeing the list of PIDs rather than -1 `close_wait_program' 
issues SIGINT to all the requested processes right away and schedules a 
delayed sequence called "force-kills" to them, which sends SIGTERM and 
then, after a further delay, SIGKILL.

Now `close_wait_program' calls `close' on the spawn ID associated with 
the pipeline, but this call doesn't affect the pipeline as its input has 
been redirected from `/dev/null'.  As the next step `wait' is called on 
the same spawn ID and returns successfully right away with a result like 
{0 exp8 0 0} in `wres', where no PID is indicated, consistently with the 
null PID result of the original `spawn' command that assigned the spawn 
ID (`exp8' here) to the pipeline.  The return from the `wait' command 
causes code to be executed for the pending force-kills to be killed.

At this point the process situation is like below:

  PID TTY      STAT   TIME COMMAND
 6908 pts/3    Sl     0:00 expect -- .../share/dejagnu/runtest.exp --tool gcc 
--target_board remote-localhost execute.exp=postmod-1.c
 6976 pts/3    S      0:00  \_ ssh -p 2222 -l macro localhost sh -c 
'.../gcc/testsuite/gcc/postmod-1.exe ; echo XYZ${?}ZYX'
 6977 pts/3    Z      0:00  \_ [cat] <defunct>
 6991 pts/3    Z      0:00  \_ [sh] <defunct>

so `cat' and `sh' have already terminated, the former presumably due to 
SIGINT sent previously and the latter having been the force-kills just 
killed, and only await being wait(2)ed for, however `ssh' is still live 
and in the interruptible sleep, presumably awaiting communication with 
the remote end.

Since there is nothing else to do for `close_wait_program' it returns 
success to `local_exec', which then calls `close' on the pipeline to 
clean up after it.  But that in turn causes wait(2) to be called on the 
individual PIDs comprising the pipeline and when the PID associated with 
`ssh' the call hangs indefinitely preventing the whole testsuite from 
proceeding.

So the solution to the problem is twofold.  First pending force-kills 
are not killed after `wait' if there are more than one PID in the list 
passed to `close_wait_program'.  This follows the observation that if 
there was only one PID on the list, then the process must have been 
created directly by `spawn' rather than by assigning a spawn ID to a 
pipeline and the return from `wait' would mean the process associated 
with the PID must have already been cleaned up after, so it is only when 
there are more there is a possibility any may have been left behind 
live.

Second if a pipeline has been used, then the channel associated with the 
pipeline is set to the nonblocking mode in case any of the processes 
that may have left live is stuck in the noninterruptible sleep (aka D) 
state.  Such a process would necessarily ignore even SIGKILL so long as 
it remains in that state and would cause wait(2) called by `close' to 
hang possibly indefinitely, and we want the testsuite to proceed rather 
than hang even in bad circumstances.

Finally it appears to be safe to leave pending force-kills to complete 
their job after `wait' has been called in `close_wait_program', because 
based on the observation made here the command does not actually call 
wait(2) if issued on a spawn ID associated with a pipeline created by 
`open' rather than a process created by `spawn'.  Instead the PIDs from 
a pipeline are supposed to be cleaned up after by calling wait(2) from 
the `close' command call made on the pipeline channel.  If on the other 
hand the channel is set to the nonblocking mode before `close', then 
even that command does not call wait(2) on the associated PIDs.

Therefore the PIDs on the list passed are not subject to PID reuse and 
the force-kills won't accidentally kill an unrelated process, as a PID 
cannot be allocated by the kernel for a new process until any previous 
process's status has been consumed from its PID by wait(2).  And then 
PIDs of any children that have actually terminated one way or another 
are wait(2)ed for by Tcl automatically, so no mess is left behind.

        * lib/remote.exp (close_wait_program): Only kill the pending 
        force-kills if the PID list has a single entry.
        (local_exec): If we didn't see an EOF, then set the channel 
        about to be closed to the nonblocking mode.

Signed-off-by: Maciej W. Rozycki <address@hidden>
---
Hi,

 I have observed the hang in actual testing, where I ran full GCC testing 
(including all language front-ends and all target libraries) remotely for 
the `riscv64-linux-gnu' target with QEMU in the system emulation mode.  I 
left it running over the UK long weekend of May 8th-10th and upon return 
found the testsuite stuck for several days.  The underlying reason was an 
unindentified issue with QEMU or the Linux installation within causing the 
emulator's virtual network interface to go down.  However the testsuite 
ought not to have got stuck indefinitely on the host system.  The process 
state on the host was as shown above, with `ssh' still live and both `cat' 
and `sh' in the zombie state.

 I have since reproduced the issue with a shell script substituting `ssh' 
with a command that reliably hangs, which let me track down the cause and 
diagnose it as above.  This fix then let GCC testing run to completion, 
despite intermittent issues with QEMU remaining and causing occasional 
time-outs.

 Therefore, please apply.  FAOD this has been formatted for `git am' use.

  Maciej
---
 lib/remote.exp |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

dejagnu-remote-close-wait-unblock.diff
Index: dejagnu/lib/remote.exp
===================================================================
--- dejagnu.orig/lib/remote.exp
+++ dejagnu/lib/remote.exp
@@ -109,10 +109,15 @@ proc close_wait_program { program_id pid
 
     # Reap it.
     set res [catch "wait -i $program_id" wres]
-    if {$exec_pid != -1} {
+    if { $exec_pid != -1 && [llength $pid] == 1 } {
        # We reaped the process, so cancel the pending force-kills, as
        # otherwise if the PID is reused for some other unrelated
        # process, we'd kill the wrong process.
+       #
+       # Do this if the PID list only has a single entry however, as
+       # otherwise `wait' will have returned once any single process
+       # of the pipeline has exited regardless of whether any other
+       # ones have remained.
        #
        # Use `catch' in case the force-kills have completed, so as not
        # to cause TCL to choke if `kill' returns a failure.
@@ -243,6 +248,12 @@ proc local_exec { commandline inp outp t
     }
     set r2 [close_wait_program $spawn_id $pid wres]
     if { $id > 0 } {
+       if { ! $got_eof } {
+           # If timed-out, don't wait for all the processes associated
+           # with the pipeline to terminate as a stuck one would cause
+           # us to hang.
+           set r2 [catch "fconfigure $id -blocking false" res]
+       }
        set r2 [catch "close $id" res]
     } else {
        verbose "waitres is $wres" 2

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 0/2] Remote execution timeout recovery fixes, Maciej W. Rozycki, 2020/05/20
- [PATCH 1/2] remote: Use `catch' in killing pending force-kills, Maciej W. Rozycki, 2020/05/20
  - Re: [PATCH 1/2] remote: Use `catch' in killing pending force-kills, Jacob Bachmeyer, 2020/05/20
    - Re: [PATCH 1/2] remote: Use `catch' in killing pending force-kills, Maciej W. Rozycki, 2020/05/21
    - Re: [PATCH 1/2] remote: Use `catch' in killing pending force-kills, Jacob Bachmeyer, 2020/05/21
- [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang, Maciej W. Rozycki <=
  - Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang, Jacob Bachmeyer, 2020/05/20
    - Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang, Maciej W. Rozycki, 2020/05/21
    - Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang, Jacob Bachmeyer, 2020/05/21

Prev by Date: [PATCH 1/2] remote: Use `catch' in killing pending force-kills
Next by Date: Re: [PATCH 1/2] remote: Use `catch' in killing pending force-kills
Previous by thread: Re: [PATCH 1/2] remote: Use `catch' in killing pending force-kills
Next by thread: Re: [PATCH 2/2] remote: Fix a stuck remote call pipeline causing testing to hang
Index(es):
- Date
- Thread