monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] nvm.interrupts: beginnings of more robust signal ha


From: Nathaniel Smith
Subject: Re: [Monotone-devel] nvm.interrupts: beginnings of more robust signal handling
Date: Fri, 21 Jul 2006 21:21:53 -0700
User-agent: Mutt/1.5.11+cvs20060403

On Fri, Jul 21, 2006 at 10:38:30AM -0700, Zack Weinberg wrote:
> I just pushed to net.venge.monotone.interrupts a big hairy patch which
> is a work-in-progress implementation of (what I consider to be) saner
> handling of Unix signals.  Unfortunately it does a whole bunch of
> things at once, because I kept getting shinything syndrome.
> Highlights:
> 
> * I split up main.cc into Unix and Win32 variants.  I am not a Windows
> expert - win32/main.cc probably needs some care and attention from
> someone who is.  I'm pretty sure the MSVC #ifdefs are correct, but the
> Borland #ifdefs perhaps not so much, and Mingw is totally broken.

Hmm, I don't think anyone has ever compiled monotone using Borland
anyway, but Mingw has always been our supported win32 build platform.

> * The global sanity and UI objects' constructors no longer do any
> work.  This is because global constructors execute outside the
> protection of the signal handlers set up in main.cc.  Instead,
> monotone.cc::main has wrapper RAII-only objects that do the work.
> Similarly, what monotone.cc was doing with an atexit function is now
> done with RAII and a call from user_interface::fatal to
> sanity::dump_buffer (this last might not be ideal).

Getting rid of global constructors is a good thing; they always have
hard-to-find and hard-to-test-for ordering bugs.

> * informative_failure is now a proper std::exception.

Okay.

> * The debugging command 'mtn crash' can now crash in many more ways.
> The test suite uses this to put a lot more of the crash-handling code
> through its paces.

Cool.

> * The giant list of catch clauses is now in monotone.cc::cpp_main
> (which wants more refactoring, but this patch is already too big).

Hmm, I kind of like how the current design keeps all the crazy
signal/exception/etc. handling (that never changes) out in one file,
and then all the actual monotone-related logic is in a different file.

> * And the big one: "User interrupt" signals (SIGTERM, SIGINT, SIGHUP,
> SIGPIPE) no longer kill the process immediately.  They set a flag,
> which will be checked for by the Q() function at appropriate points.
> I have not put Q() anywhere yet.  This causes a genuine failure
> (database_is_closed_on_signal_exit) and a spurious success
> (netsync_is_not_interrupted_on_SIGPIPE).  [Incidentally, part of why
> 'mtn log | less' doesn't give you the prompt back immediately when you
> quit the pager, is that SIGPIPE was being ignored for *everything*,
> not just netsync.]

Oops, that SIGPIPE thing is a bug that I think we've regressed on a
few times; need to fix it and figure out how to properly test for it.

> I have to figure out where Q() calls should go.  Nathaniel's big
> concern with this approach was that we would miss a place and have an
> un-^C-able loop - the patch does contain some emergency logic to
> address that problem, in that if the signal handler is called twice in
> two separate seconds, we give up and die immediately.  Also, ^\
> (SIGQUIT) dies immediately, period.  (But, in neither of those cases
> will the database be cleaned up.)

To refine that slightly, my concern isn't so much that we'd have an
un-^C-able loop -- that'd just be a simple bug, trivially fixed.  My
concern is more that we might have an indefinite number of those,
being added over an indefinite period of time, and no way to detect
when that had happened.  (In the general case, detecting such bugs is
not computable; in practice, it's probably solvable in lots of real
cases and the real-time folks probably know methods, but we have no
tools to actually do such checking, so that doesn't help.)

With this approach, does a conscientious programmer have to add
another thing that they are constantly considering as they work,
and distracting them from the problem at hand?

Is there any way to avoid finding and fixing one of these bugs every n
months (with who knows how many people suffering from the bug in the
mean time without reporting it)?  (Note that in the general case
detecting such bugs is not computable, and while in practice the real
time folks probably know some good tricks, we don't have any tools to
do such analysis anyway.)

Are we confident that we have thought of all the funny corner cases?
(Here's another one: we have libraries that do blocking operations on
our behalf, and therefore need to call Q() before they retry such
operations -- how do we teach them to do this?)

Is the time spent thinking of answers to these questions worth the
gain we get?  AFAIK, the only reason we have these signal handlers is
to try to clean up unsightly -journal files when possible; they don't
actually provide any other changes in user-visible behavior?  I
actually think it'd be quite nice for ordinary users to effectively
never encounter -journal files, but I'm not sure it's worth these
costs :-).

-- Nathaniel

-- 
In mathematics, it's not enough to read the words
you have to hear the music




reply via email to

[Prev in Thread] Current Thread [Next in Thread]