Here is the scheduler assert patch, test cases, and integration with the test suite combined into a single patch!
After a bit more testing with code that used a lot of threads, I ended up looking at how make-thread uses ##sys#make-thread and added some missing code to the patch's usage of ##sys#make-thread.
On Jul 9, 2015, at 4:29 PM, Jonathan Chan <address@hidden> wrote:
... and here is hopefully the final version of the patch for review. In
the end I think hijacking a thread to run the signal handler is too much
prone to error. Although the previous patches try to cope with this
behavior, as soon as the signal handler tries to do something like lock
a mutex via srfi-18 or yield, very strange things will happen depending
on what thread was hijacked, which threads were blocking, etc. Blocking
##sys#schedule from running is a partial fix, but doesn't help when it
would have been expected for ##sys#schedule not to return immediately,
e.g. if a signal handler tries to block on a socket. For example, seems
to make the the tcp unit go into a constant loop while trying to recv.
In this newest patch, the ##sys#interrupt-handler in scheduler.scm will
instead save the old handler in ##sys#signal-vector, change the handler
to one that makes a thread with ##sys#make-thread to call the old
handler and then schedules that thread to run, wrap state so that the
original ##sys#interrupt-handler will reset the handler and then finally
call the old hook so everything starts going.
In addition to the new patch, I have attached a revised version of the
old test case that should be more likely to trigger the assert failure
in unpatched versions of the scheduler as well as a new test case that
1) shows problems with the previous patches 2) ensures that the signal
handler can properly call blocking code and 3) possibly demonstrates a
bug with multiple threads blocking on one fd that I am not certain
about.
Instructions for scheduler-fdset.scm:
1. Run with csi or csc.
2. Send SIGINT (e.g. with Ctrl-C)
3. The test should write "locking... unlocking... unlocked" and then
exit without error.
Instructions for scheduler-fdset-2.scm:
You will need to compile sink.c with gcc sink.c -o sink
1. Run the test
2. Wait a short bit for the child processes to start
3. Start netcat in a new terminal with `nc -l 8080`
4. Send SIGINT (e.g. with ^C)
5. Verify that the process is not using 100% CPU
6. Type `hi` (or anything) in the terminal with netcat and press Return
7. The test should exit without error
Sorry for the many versions and emails! This at least lets me get back
to the work that needed the scheduler in the first place.
I hope that this finally kills the bug...
All the best,
--
Jonathan Chan
address@hidden
On Thu, Jul 9, 2015, at 03:18 AM, Jonathan Chan wrote:
I realized that if I am able to set a flag about whether or not we are
in the interrupt hook, I might as well save the previous block object of
the thread and then restore it instead of possibly causing spurious
wakeups. As an aside, it seems as if force-primordial doesn't currently
do this, and relies on the code that asked the scheduler to block on i/o
to retry on its own. As I stated in the previous email, this seems to be
what the code that uses the scheduler's blocking facilities does anyway.
In any case, this patch also passes my tests (as the previous one did)
but now should more cleanly avoid "spurious wakeups" of threads blocking
on I/O.
Also, I realized that the test case lacked instructions; you should just
be able to start it, wait a second, then send SIGINT (e.g. with Ctrl-C)
and the unpatched scheduler should cause the program to exit with an
assert error while with the patched scheduler you should just see a
nice:
address@hidden tests]$ csi -s scheduler-fdset.scm
^Clocking...
unlocking...
unlocked...
done!
Best regards,
--
Jonathan Chan
address@hidden
On Wed, Jul 8, 2015, at 11:34 PM, Jonathan Chan wrote:
Hi again,
It turns out that changing the current thread's state to 'running
doesn't actually work in all cases because a) unblocking from condition
variables will skip a thread if its state is running b) the call to
change the state back was never called because the call to the old hook
would never return. I found this by doing some more testing with the
program that caused me to run into the original assert bug.
Here is a revised patch which instead sets a variable to indicate
whether or not we are in the interrupt hook and skips the scheduler if
so. I've also included an updated test case that catches the problem
with the previous patch.
Best regards,
--
Jonathan Chan
address@hidden
On Wed, Jul 8, 2015, at 08:12 PM, Jonathan Chan wrote:
Hi all,
I ended up having some free time before the weekend and went back to the
scheduler assert problem. I wrote a test case for the bug and revised
the scheduler assert patch after some further testing.
In the previous patch, the thread in which the interrupt handler ran
would just be removed from ##sys#fd-list if the block object was changed
from a file. This didn't seem to handle some cases where the interrupt
handler ended up unblocking. In one of these cases, when mutex-lock! and
then mutex-unlock! were called in succession, it also turned out that
the "current thread"'s state needed to be saved, set to 'running during
the interrupt handler, then restored in order to avoid mutex-unlock!
from calling the scheduler and causing the program to hang/do strange
things.
I've attached the the test case for the bug and the revised patch.
Again, if someone experienced with the scheduler could review it and
make sure it doesn't cause strange things to happen, it would be very
much appreciated.
Best regards,
--
Jonathan Chan
address@hidden
On Sat, Jul 4, 2015, at 07:49 PM, Jonathan Chan wrote:
Hello all,
I was in the IRC channel earlier trying to find out why my program was
failing due to a scheduler assertion about the fdset for poll. This
would happen consistently when I sent SIGINT.
The error looks like:
scheduler.c:51: C_fd_ready: Assertion `fd == C_fdset_set[pos].fd' failed
After looking at the scheduler code, it looks like this happened because
the SIGINT was causing the interrupt handler to be called, which ended
up unblocking some thread to run a signal handler, which caused a
condition variable broadcast. In the end, this made a thread that was
previously blocking on a file descriptor to now block on a mutex.
When create-fdset procedure iterated over the ##sys#fd-list, it skipped
threads in the list who were not blocking on fds while it added those
who were to the fdset using fdset-set. However, it did not remove these
threads from ##sys#fd-list, which resulted in the scheduler throwing an
error because it expected a fd to be in the fdset when it was not there.
I have attached a patch which modifies create-fdset to remove threads
from ##sys#fd-list who are not blocking on an fd, rather than only skip
them. This seems to fix my problem and has not introduced any bizarre
behvaior in the few tests I have run, but hopefully someone who is more
familiar with the scheduler could look it over and see if it fixes the
issue.
Best regards,
--
Jonathan Chan
address@hidden
_______________________________________________
Chicken-hackers mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/chicken-hackers
Email had 1 attachment:
+ scheduler-assert.diff
2k (text/plain)
_______________________________________________
Chicken-hackers mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/chicken-hackers
Email had 2 attachments:
+ scheduler-assert-fix.diff
5k (application/octet-stream)
+ scheduler-fdset.scm
2k (application/octet-stream)
_______________________________________________
Chicken-hackers mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/chicken-hackers
Email had 3 attachments:
+ scheduler-fdset.scm
2k (application/octet-stream)
+ sink.c
1k (text/plain)
+ scheduler-assert-fix.diff
14k (application/octet-stream)
_______________________________________________
Chicken-hackers mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/chicken-hackers
Email had 1 attachment:
+ scheduler-assert-fix.diff
14k (application/octet-stream)
<scheduler-assert-new.diff><scheduler-fdset-2.scm><scheduler-fdset.scm><sink.c>_______________________________________________
Chicken-hackers mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/chicken-hackers