mit-scheme-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[MIT-Scheme-devel] primitives, interrupts, allocation, and robustness


From: Taylor R Campbell
Subject: [MIT-Scheme-devel] primitives, interrupts, allocation, and robustness
Date: Thu, 16 Jun 2011 05:39:58 +0000
User-agent: IMAIL/1.21; Edwin/3.116; MIT-Scheme/9.1

I think we have two pretty deep problems in the way microcode
primitives work.  Both fortunately and unfortunately, they manifest
very seldom in practice: fortunately because, well, most of the time
everything's OK; unfortunately because it makes them all the more
mysterious to diagnose and difficult to debug.

Problem 1 is that if a primitive is interrupted in a system call, it
may either wedge Scheme or quietly have bad effects such as leaking
memory, repeating side effects, &c.

Problem 2 is that primitives that allocate on the heap may send Scheme
into an infinite loop because the bookkeeping of how much space is
needed is wrong.  (And they might also leak memory, repeat side
effects, &c.)

I would like to fix these problems, but it's not straightforward.
Every fix I have come up with entails non-trivial changes throughout
the microcode.  Please let me know if you have any thoughts on the
subject.  Apologies for the long message.


* Problem 1: Interrupted primitives

In order to be responsive to ^G and to let threads make progress, the
system relies on handling interrupts in Scheme soon after the
operating system delivers them -- e.g., when you type ^G at the tty to
deliver SIGINT, or when the thread timer goes off and the OS delivers
SIGALRM.  If a signal arrives during a system call, the system call
will return EINTR.

Until 2009, most primitives making system calls would simply restart
the system call on EINTR.  This is wrong because it makes Scheme
unresponsive and doesn't let other threads run.  For example, if you
tried to open a fifo, or to work with a remote file system when the
network was down, your whole Scheme would freeze until someone opened
the other end of the fifo, or until the network stack timed out.

In 2009 (commit 6cf614a), I changed most primitives so that if a
system call fails with EINTR, and there is a pending interrupt to
handle in Scheme, Scheme will back out of the primitive, handle the
interrupt, and then restart the primitive.  (If there is no pending
interrupt, the primitive will just restart the system call as before.)
But this is wrong too because it is not safe to back out and restart
many primitives at this point -- they have already had permanent side
effects that cannot be rolled back.


I have two ideas for policies to solve this problem.

(a) If a primitive performs a permanent side effect, it may not use
any system call afterward that could ever conceivably block.  This
means eliminating (or at least deprecating) complex primitives such as
FILE-TOUCH and replacing them by Scheme code.

(b) PCLSR: After each permanent side effect, a primitive must mark its
activation record so that if restarted after backing out, it will pick
up where it left off.  This might entail using bind_interpreter_state
after each effect and unbind_interpreter_state before the primitive
returns, or it might entail replacing the primitive apply frame on the
stack.  This may require essentially rewriting a number of primitives
in continuation-passing style.


* Problem 2: Heap allocation bookkeeping in primitives

Whenever a primitive wants to allocate n words on the heap, it checks
whether n words are available.  If not, it requests n words from the
garbage collector and backs out to let the garbage collector run.  If
the garbage collector frees n words, Scheme restarts the primitive;
otherwise, Scheme aborts to the nearest REPL and tells you that it's
out of memory.

I believe most primitives in the system back out for GC only when it
is actually safe to do so -- the only exceptions I know of arise from
the use of primitives as an FFI, such as prpgsql.c, which should be
rewritten anyway.  However, if a primitive allocates n words, and then
m words, in separate steps, the following may happen:

1. Suppose n + (m - k) words are free, where 0 < k <= min(n, m).
2. The primitive allocates n words, so now m - k are free.
3. The primitive tries to allocate m words, but only m - k are free.
4. The primitive asks the garbage collector to make m words free.
5. The garbage collector frees the n words the primitive allocated.
6. Since n + (m - k) >= m words are now free, the primitive restarts.
6. Return to step 1: Scheme is now wedged.


I have three ideas for policies to solve this problem.

(a) Every primitive must count up front how many words it plans to
allocate.  Scheme48 does this, and has a little abstraction for the
accounting (ENSURE-SPACE and allocation keys) that can either verify
the accounting at run-time or be compiled away to zero overhead.  This
requires going through every primitive that allocates, and calculating
an upper bound on how much it may allocate.

(b) For every primitive invocation, the microcode must keep a record
of the number of words the primitive has allocated.  If there is too
little space free for some allocation, then the garbage collector must
make free the total number of words the requested so far by the
primitive, not just the number requested in that allocation, before
restarting the primitive.  This means a possibly significant run-time
overhead for the bookkeeping, but may be doable without changing many
primitives, by remembering Free on entry to each primitive in order to
find how much the primitive has advanced Free so far.

(c) PCLSR: After each allocation, a primitive must mark its activation
record so that the garbage collector won't free what it has just
allocated, and if restarted after backing out, it will pick up where
it left off.  The linker in cmpint.c does this, and it's very hairy.


P.S.  Compiled code also has a problem with allocation...  Every entry
to a procedure checks whether Free < heap_alloc_limit, and branches to
an interrupt if not.  What it needs to check is whether Free + n <
heap_alloc_limit, where n is the number of words it allocates in-line.
The stack checks have the same problem.  Usually this is all OK
because most procedures won't allocate more space than there is in the
buffer between heap_alloc_limit and heap_end, but `usually' is no
substitute for a guarantee.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]