[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
From: |
Carlo Zancanaro |
Subject: |
[bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services |
Date: |
Mon, 05 Mar 2018 09:35:58 +1100 |
User-agent: |
mu4e 1.0; emacs 25.3.1 |
On Sun, Mar 04 2018, Ludovic Courtès wrote:
Good catch. We could add this in gnu-build-system.scm in
core-updates, though it’s no big deal anyway since these are
throw-away environments.
Thoughts?
The current forking-service.sh test fails in that environment, so
we won't be able to build shepherd on Hurd, or systems with Linux
pre 3.4. This is already the case without my third commit, though,
because the prctl fallback logic isn't in place yet.
I think we should add it in core-updates. It does affect the
behaviour of processes within the build environment, and can lead
to test failures if people rely on pid 1 to reap zombie processes
(which, from what I understand, they should be able to). This
could even be leading to test failures in other packages which we
have just disabled.
+ (match (select (list sock) (list) (list) 0.5)
+ (((sock) _ _)
+ (read-from sock))
+ (_
+ #f))
+ (poll-services)
Here everyone ends up paying some overhead (the 0.5 second
timeout),
which isn’t great.
How about something like:
(define poll-services
(and (not (= 1 (getpid)))
…))
(match (select (list sock) '() '() (if poll-services 0.5 0))
…)
The wait for 0.5 seconds is only an upper-bound for the timeout.
Changing it to a 0 would actually be worse, because it would spend
longer polling for running services. The `select` procedure waits
for `sock` to be ready to read from. When it's ready it returns
immediately, but if `sock` takes more than 0.5 seconds to be ready
then it will return anyway (and take the second branch in the
match, which does nothing).
This should incur no (or extremely minuscule) overhead in how long
it takes to respond to a socket, but provides an opportunity every
half a second (at most) for shepherd to poll the running services.
On reflection, we should also change the commit message for this
commit. I have attached a patch with a more accurate commit
message.
Carlo
From 5b01f79522c815dd8277298e87eef0506c2e8612 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <address@hidden>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fails.
* modules/shepherd.scm (open-server-socket): Set socket to be
non-blocking.
(main): Use select with a timeout. If prctl failed when shepherd started
then call check-for-dead-services between connections/timeouts.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
signal handler.
(respawn-service): Separate logic for respawning services from handling
SIGCHLD.
(handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
symbols.
* doc/shepherd.texi (Slots of services): Add note about service running slot
being a process id.
---
doc/shepherd.texi | 4 ++-
modules/shepherd.scm | 47 ++++++++++++++++++-------
modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++----------------
tests/basic.sh | 4 +--
tests/status-sexp.sh | 4 +--
5 files changed, 95 insertions(+), 46 deletions(-)
diff --git a/doc/shepherd.texi b/doc/shepherd.texi
index 815091f..47005d5 100644
--- a/doc/shepherd.texi
+++ b/doc/shepherd.texi
@@ -608,7 +608,9 @@ way. The default value is @code{#f}, which indicates that
the service
is not running. When an attempt is made to start the service, it will
be set to the return value of the procedure in the @code{start} slot.
It will also be passed as an argument to the procedure in the
address@hidden slot. This slot can not be initialized with a keyword.
address@hidden slot. If it is set a value that is an integer, it is
+assumed to be a process id, and shepherd will monitor the process for
+unexpected exits. This slot can not be initialized with a keyword.
@item
@vindex respawn? (slot of <service>)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index faa1e47..e912d21 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
(with-fluids ((%default-port-encoding "UTF-8"))
(let ((sock (socket PF_UNIX SOCK_STREAM 0))
(address (make-socket-address AF_UNIX file-name)))
+ (fcntl sock F_SETFL (logior O_NONBLOCK
+ (fcntl sock F_GETFL)))
(bind sock address)
(listen sock 10)
sock)))
@@ -49,14 +51,28 @@
;; Main program.
(define (main . args)
- (initialize-cli)
+ (define poll-services
+ (if (= 1 (getpid))
+ (lambda () #f) ;; If we're pid 1 then we don't need to set
+ ;; PR_SET_CHILD_SUBREAPER
+ (catch 'system-error
+ (lambda ()
+ ;; Register for orphaned processes to be reparented onto us when
+ ;; their original parent dies. This lets us handle SIGCHLD from
+ ;; daemon processes that would otherwise have been reparented
+ ;; under pid 1. This is unnecessary when we are pid 1.
+ (prctl PR_SET_CHILD_SUBREAPER 1)
+ (lambda () #f))
+ (lambda args
+ ;; We fall back to polling for services on systems that don't
+ ;; support prctl/PR_SET_CHILD_SUBREAPER
+ (let ((errno (system-error-errno args)))
+ (if (or (= ENOSYS errno) ;; prctl not available
+ (= EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not available
+ check-for-dead-services ;; poll
+ (apply throw args)))))))
- (when (not (= 1 (getpid)))
- ;; Register for orphaned processes to be reparented onto us when their
- ;; original parent dies. This lets us handle SIGCHLD from daemon processes
- ;; that would otherwise have been reparented under pid 1. This is
- ;; unnecessary when we are pid 1.
- (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
+ (initialize-cli)
(let ((config-file #f)
(socket-file default-socket-file)
@@ -225,11 +241,18 @@
(_ #t))
(let next-command ()
- (match (accept sock)
- ((command-source . client-address)
- (setvbuf command-source _IOFBF 1024)
- (process-connection command-source))
- (_ #f))
+ (define (read-from sock)
+ (match (accept sock)
+ ((command-source . client-address)
+ (setvbuf command-source _IOFBF 1024)
+ (process-connection command-source))
+ (_ #f)))
+ (match (select (list sock) (list) (list) 0.5)
+ (((sock) _ _)
+ (read-from sock))
+ (_
+ #f))
+ (poll-services)
(next-command))))))
(define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b6394f2..056483a 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
;; Copyright (C) 2002, 2003 Wolfgang Järling <address@hidden>
;; Copyright (C) 2014 Alex Sassmannshausen <address@hidden>
;; Copyright (C) 2016 Alex Kost <address@hidden>
+;; Copyright (C) 2018 Carlo Zancanaro <address@hidden>
;;
;; This file is part of the GNU Shepherd.
;;
@@ -64,6 +65,7 @@
for-each-service
lookup-services
respawn-service
+ handle-SIGCHLD
register-services
provided-by
required-by
@@ -77,6 +79,7 @@
make-system-destructor
make-init.d-service
+ check-for-dead-services
root-service
make-actions
@@ -800,7 +803,7 @@ false."
its PID."
;; Install the SIGCHLD handler if this is the first fork+exec-command call
(unless %sigchld-handler-installed?
- (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+ (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
(set! %sigchld-handler-installed? #t))
(let ((pid (primitive-fork)))
(if (zero? pid)
@@ -991,7 +994,7 @@ child left."
what (strerror errno))
'(0 . #f)))))))
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
"Handle SIGCHLD, possibly by respawning the service that just died, or
otherwise by updating its state."
(let loop ()
@@ -1010,38 +1013,44 @@ otherwise by updating its state."
;; SERV can be #f for instance when this code runs just after a
;; service's 'stop' method killed its process and completed.
(when serv
- (slot-set! serv 'running #f)
- (if (and (respawn? serv)
- (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
- (car respawn-limit)
- (cdr respawn-limit))))
- (if (not (slot-ref serv 'waiting-for-termination?))
- (begin
- ;; Everything is okay, start it.
- (local-output "Respawning ~a."
- (canonical-name serv))
- (slot-set! serv 'last-respawns
- (cons (current-time)
- (slot-ref serv 'last-respawns)))
- (start serv))
- ;; We have just been waiting for the
- ;; termination. The `running' slot has already
- ;; been set to `#f' by `stop'.
- (begin
- (local-output "Service ~a terminated."
- (canonical-name serv))
- (slot-set! serv 'waiting-for-termination? #f)))
- (begin
- (local-output "Service ~a has been disabled."
- (canonical-name serv))
- (when (respawn? serv)
- (local-output " (Respawning too fast.)"))
- (slot-set! serv 'enabled? #f))))
+ (respawn-service serv))
;; As noted in libc's manual (info "(libc) Process Completion"),
;; loop so we don't miss any terminated child process.
(loop))))))
+(define (respawn-service serv)
+ "Respawn a service that has stopped running unexpectedly. If we have
+attempted to respawn the service a number of times already and it keeps dying,
+then disable it."
+ (slot-set! serv 'running #f)
+ (if (and (respawn? serv)
+ (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+ (car respawn-limit)
+ (cdr respawn-limit))))
+ (if (not (slot-ref serv 'waiting-for-termination?))
+ (begin
+ ;; Everything is okay, start it.
+ (local-output "Respawning ~a."
+ (canonical-name serv))
+ (slot-set! serv 'last-respawns
+ (cons (current-time)
+ (slot-ref serv 'last-respawns)))
+ (start serv))
+ ;; We have just been waiting for the
+ ;; termination. The `running' slot has already
+ ;; been set to `#f' by `stop'.
+ (begin
+ (local-output "Service ~a terminated."
+ (canonical-name serv))
+ (slot-set! serv 'waiting-for-termination? #f)))
+ (begin
+ (local-output "Service ~a has been disabled."
+ (canonical-name serv))
+ (when (respawn? serv)
+ (local-output " (Respawning too fast.)"))
+ (slot-set! serv 'enabled? #f))))
+
;; Add NEW-SERVICES to the list of known services.
(define (register-services . new-services)
(define (register-single-service new)
@@ -1171,6 +1180,21 @@ file when persistence is enabled."
(lambda (p)
(format p "~{~a ~}~%" running-services))))))
+(define (check-for-dead-services)
+ "Poll each process that we expect to be running, and respawn any which have
+unexpectedly stopped running. This procedure is used as a fallback on systems
+where prctl/PR_SET_CHILD_SUBREAPER is unsupported."
+ (define (process-exists? pid)
+ (catch #t
+ (lambda () (kill pid 0) #t)
+ (lambda _ #f)))
+ (for-each-service (lambda (service)
+ (let ((running (slot-ref service 'running)))
+ (when (and (integer? running)
+ (not (process-exists? running)))
+ (local-output "PID ~a (~a) is dead!" running
(canonical-name service))
+ (respawn-service service))))))
+
(define root-service
(make <service>
#:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
(register-services
(make <service>
#:provides '(test-loaded)
- #:start (const 42)
+ #:start (const 'abc)
#:stop (const #f)))
EOF
@@ -166,7 +166,7 @@ $herd status test-loaded
$herd status test-loaded | grep stopped
$herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
$herd stop test-loaded
$herd unload root test-loaded
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
(register-services
(make <service>
#:provides '(foo)
- #:start (const 42)
+ #:start (const 'abc)
#:stop (const #f)
#:docstring "Foo!"
#:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
(service (version 0)
(provides (foo)) (requires ())
(respawn? #t) (docstring \"Foo!\")
- (enabled? #t) (running 42) (conflicts ())
+ (enabled? #t) (running abc) (conflicts ())
(last-respawns ()))
(service (version 0)
(provides (bar)) (requires (foo))
--
2.16.1
signature.asc
Description: PGP signature
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services, Carlo Zancanaro, 2018/03/01
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services, Ludovic Courtès, 2018/03/02
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services, Carlo Zancanaro, 2018/03/02
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services, Ludovic Courtès, 2018/03/02
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services, Carlo Zancanaro, 2018/03/03
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services, Ludovic Courtès, 2018/03/03
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services, Carlo Zancanaro, 2018/03/03
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services, Ludovic Courtès, 2018/03/04
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services,
Carlo Zancanaro <=
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services, Ludovic Courtès, 2018/03/04
- [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services, Carlo Zancanaro, 2018/03/04
- bug#30637: [WIP] shepherd: Poll every 0.5s to find dead forked services, Ludovic Courtès, 2018/03/05