make-alpha
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU make jobserver redux


From: Paul D. Smith
Subject: GNU make jobserver redux
Date: Fri, 4 May 2001 01:58:24 -0400

Hi all.

You may (or may not) recall the long conversation we had a couple of
years ago about the implementation of the GNU make jobserver feature.
This feature is implemented using simple UNIX pipes and it seems to be
working well "in the wild".

Except for one thing.  The feature requires that we enable interruptible
system calls in order for it to work, and I'm seeing more and more
reports of people having stat(2), etc. in other parts of make
interrupted by SIGCHLD signals as processes die.  Currently hardly any
of the system calls in GNU make loop on EINTR.

I decided to look into this yesterday.

The first thing I've tried is the very straightforward approach: I
created a macro like this:

 # define EINTRLOOP(_v,_c)   while (((_v)=_c)==-1 && errno==EINTR)

Then I started replacing invocations of system calls through make with
this macro, like this:

    EINTRLOOP (r, stat (name, &st));

    EINTRLOOP (fd, open (file->name, O_RDWR | O_TRUNC, 0666));

etc.  But, there are a huge boatload of system calls in make; not just
stat() but read(), write(), open(), fstat(), etc. not to mention basic
stuff like f*() functions, readdir(), etc.  I needed a second form of
the macro for calls that returned pointers, where the comparison of the
return value is against 0, not -1.

This seems like a very big pain, and also like I may well forget some
checks.

So, I'm wondering if we could change make's algorithm to only use
interruptible system calls for the small part where it's required to
avoid jobserver deadlocks.  If we could, then it seems this is a safer
(and less intrusive) way to solve the problem.

I will refresh your memories :).  I'm just going to give the high
points, there are other details not relevant to this (the "free" token,
etc.)

 0) Make starts.  We create (or inherit from our parent make) a
    jobserver token pipe.  We dup the read side, so now we have two FDs
    that can read from the pipe.  We install a SIGCHLD signal handler,
    and enable interruptible system calls.

 1) The SIGCHLD signal handler runs close() on the dup'd FD we created
    in #0.

We do a whole bunch of other stuff :).

 2) We want to start a job.  We need to get a token before we can do so.

 3) We do a blocking read() on the dup'd FD.

 4) If the read() returns with a token, we jump out and start the job.

 5) If the read() returns on an error (it could be either EINTR or
    EBADF), we first check to see if the FD is valid or not (see below);
    if not, we dup it again.

    Then we run reap_children() to handle any children who have died.

 6) We loop back to #2.

The trick is we need the read() to be interruptible on signals,
otherwise we'll never reap any children until we get a token... which
could deadlock us.

But there's a block of time between the last moment reap_children()
checks for dead kids, and when we start the read() again, where we need
to figure out whether more kids died--if we miss them we could deadlock.
We do that by having the signal handler close the FD, so the read will
fail with EBADF.

So, here's the question: can we change the code so that the SIGCHLD
handler is _only_ installed for the time it takes to go through this
loop?  Then, most of make would use restartable system calls, and we
could avoid checking for EINTR all over the place; the parts of make
that could be interrupted are well-defined and could easily be coded
for.

Are there efficiency/portability/other issues with
installing/deinstalling the SIGCHLD handler numerous times, as part of
the above algorithm?

For example, maybe we move the dup and SIGCHLD handler out of where make
is invoked, and put it right near where we get a token:

 2) We want to start a job.  We need to get a token before we can do it.

 3) We install the SIGCHLD handler.

 4) We dup the read FD of the pipe (if necessary--it might still be valid).

 5) We run reap_children() to handle any children who have died.

 6) We do a blocking read on the dup'd FD.

 7) If the read() returns with a token, we jump out and start the job.

 8) If the read() returns on an error (it could be either EINTR or
    EBADF), we loop back to #4.

After we get out of the loop, before we start the job, we reset the
SIGCHLD handler to SIG_DFL, to re-enable restartable system calls.

Thoughts?

-- 
-------------------------------------------------------------------------------
 Paul D. Smith <address@hidden>    HASMAT--HA Software Methods & Tools
 "Please remain calm...I may be mad, but I am a professional." --Mad Scientist
-------------------------------------------------------------------------------
   These are my opinions---Nortel Networks takes no responsibility for them.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]