[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-bash] parallel processing in bash (to replace for loop)

From: Greg Wooledge
Subject: Re: [Help-bash] parallel processing in bash (to replace for loop)
Date: Mon, 30 Jan 2012 08:42:29 -0500
User-agent: Mutt/

> >

> I assume that this is your wiki. I can not open it as it is too slow.
> Would you please post the solution here?

Someone posted a link to one the pages to a very popular web site over
the weekend, so access to the whole web server was very slow yesterday,
and probably today as well.  Patience is a virtue.


   This is still a work in progress. Expect some rough edges.
    1. [21]The basics
    2. [22]Simple questions
         1. [23]How do I run a job in the background?
         2. [24]My script runs a job in the background. How do I get its
         3. [25]OK, I have its PID. How do I check that it's still
         4. [26]I want to do something with a process I started earlier
         5. [27]How do I kill a process by name? I need to get the PID
            out of ps aux | grep ....
         6. [28]But I'm on some old legacy Unix system that doesn't have
            pgrep! What do I do?
         7. [29]I want to run something in the background and then log
         8. [30]I'm trying to kill -9 my job but blah blah blah...
         9. [31]Make SURE you have run and understood these commands:
    3. [32]Advanced questions
         1. [33]I want to run two jobs in the background, and then wait
            until they both finish.
         2. [34]How can I check to see if my game server is still
            running? I'll put a script in crontab, and if it's not
            running, I'll restart it...
         3. [35]How do I make sure only one copy of my script can run at
            a time?
         4. [36]I want to process a bunch of files in parallel, and when
            one finishes, I want to start the next. And I want to make
            sure there are exactly 5 jobs running at a time.
         5. [37]My script runs a pipeline. When the script is killed, I
            want the pipeline to die too.
    4. [38]How to work with processes
         1. [39]PIDs and parents
         2. [40]The risk of letting the parent die
         3. [41]The risk of parsing the process tree
         4. [42]Doing it right
              1. [43]Starting a process and remembering its PID
              2. [44]Checking up on your process or terminating it
              3. [45]Starting a "daemon" and checking whether it started
    5. [46]On processes, environments and inheritance
The basics

   A process is a running instance of a program in memory. Every process
   is identified by a number, called the PID, or Process IDentifier. Each
   process has its own privately allocated segment of memory, which is
   not accessible from any other process. This is where it stores its
   variables and other data.
   The kernel keeps track of all these processes, and stores a little bit
   of basic metadata about them in a process table. However, for the most
   part, each process is autonomous, within the privileges allowed to it
   by the kernel. Once a process has been started, it is difficult to do
   anything to it other than suspend (pause) it, or terminate it.
   The metadata stored by the kernel includes a process "name" and
   "command line". These are not reliable; the "name" of a process is
   whatever you said it is when you ran it, and may have no relationship
   whatsoever to the program's file name. (On some systems, running
   processes can also change their own names. For example, sendmail uses
   this to show its status.) Therefore, when working with a process, you
   must know its PID in order to be able to do anything with it. Looking
   for processes by name is extremely fallible.
Simple questions

How do I run a job in the background?

command &

   By the way, '&' is a command separator in bash and other Bourne
   shells. It can be used any place ';' can (but not in addition to ';'
   -- you have to choose one or the other). Thus, you can write this:
command one & command two & command three &

   which runs all three in the background simultaneously, and is
   equivalent to:
command one &
command two &
command three &

for i in 1 2 3; do some_command $i & done

   While both & and ; can be used to separate commands, & runs them in
   the background and ; runs them in sequence.
My script runs a job in the background. How do I get its PID?

   The $! special parameter holds the PID of the most recently executed
   background job. You can use that later on in your script to keep track
   of the job, terminate it, record it in a PID file (shudder), or
myjob &

OK, I have its PID. How do I check that it's still running?

   kill -0 $PID will check to see whether a signal is deliverable (i.e.,
   the process still exists). If you need to check on a single child
   process asynchronously, that's the most portable solution. You might
   also be able to use the wait shell command to block until the child
   (or children) terminate -- it depends on what your program has to do.
   There is no shell scripting equivalent to the select(2) or poll(2)
   system calls. If you need to manage a complex suite of child processes
   and events, don't try to do it in a shell script. (That said, there
   are a few tricks in the [47]advanced section of this page.)
I want to do something with a process I started earlier

   Store the PID when you start the process and use that PID later on:
    # Bourne
    my child process &

   If you're still in the parent process that started the child process
   you want to do something with, that's perfect. You're guaranteed the
   PID is your child process (dead or alive), for the reasons
   [48]explained below. You can use kill to signal it, terminate it, or
   just check whether it's still running. You can use wait to wait for it
   to end or to get its exit code if it has ended.
   If you're NOT in the parent process that started the child process you
   want to do something with, that's a shame. Try restructuring your
   logic so that you can be. If that's not possible, the things you can
   do are a little more limited and a little more risky.
   The parent process that created the child process should've written
   its PID to some place where you can access it. A PID file is probably
   the best place. Read the PID in from wherever the parent stored it,
   and hope that no other process has accidentally taken control of the
   PID while you weren't looking. You can use kill to signal it,
   terminate it, or just check whether it's still running. You cannot use
   wait to wait for it or get its exit code; this is only possible from
   the child's parent process. If you really want to wait for the process
   to end, you can poll kill -0:
   while sleep 1; do kill -0 $pid || break; done.
   Everything in the preceding paragraph is risky. The PID in the file
   may have been recycled before you even got there. The PID could be
   recycled after you read it from the file but before you send the
   suicide order. The PID could be recycled in the middle of your polling
   loop, leaving you in a non-terminating wait.
   If you need to write programs that manage a process without
   maintaining a parent/child relationship, your best bet is to make sure
   that all of those programs run with the same User ID (UID) which is
   not used by any other programs on the system. That way, if the PID
   gets recycled, your attempt to query/kill it will fail. This is
   infinitely preferable to your sending SIGTERM to some innocent
How do I kill a process by name? I need to get the PID out of ps aux | grep

   No, you don't. Firstly, you probably do NOT want to find a process by
   name AT ALL. Make sure you have the PID of the process and do what the
   above answer says. If you don't know how to get the PID: Only the
   process that created your process knows the real PID. It should have
   stored it in a file for you. If you are IN the parent process, that's
   even better. Put the PID in a variable (process & mypid=$!) and use
   If for some silly reason you really want to get to a process purely by
   name, you understand that this is a broken method, you don't care that
   this may set your hair on fire, and you want to do it anyway, you
   should probably use a command called pkill. You might also take a look
   at the command killall if you're on a legacy GNU/Linux system, but be
   warned: killall on some systems kills every process on the entire
   system. It's best to avoid it unless you really need it.
   (Mac OS X comes with killall but not pkill. To get pkill, go to
   If you just wanted to check for the existence of a process by name,
   use pgrep.
   Please note that checking/killing processes by name is insecure,
   because processes can lie about their names, and there is nothing
   unique about the name of a process.
But I'm on some old legacy Unix system that doesn't have pgrep! What do I

   As stated above, checking or killing processes by name is an extremely
   bad idea in the first place. So rather than agonize about shortcut
   tools like pgrep that you don't have, you'd do better to implement
   some sort of robust process management using the techniques we'll talk
   about later. But people love shortcuts, so let me fill in some legacy
   Unix issues and tricks here, even though you should not be using such
   A legacy Unix system typically has no tool besides ps for inspecting
   running processes as a human system administrator. People then think
   that this is an appropriate tool to use in a script, even though it
   isn't. They fall into the mental trap of thinking that since this is
   the only tool provided by the OS for troubleshooting runaway processes
   as a human being, that it must be an appropriate tool for setting up
   There are two entirely different ps commands on legacy Unix systems:
   System V Unix style (ps -ef) and BSD Unix style (ps auxw). In some
   slightly-less-old Unix systems, the two different syntaxes are
   combined, and the presence or absence of a hyphen tells ps which set
   of option letters is being used. (If you ever see ps -auxw with a
   hyphen, throw the program away immediately.) POSIX uses the System V
   style, and adds a -o option to tell ps which fields you want, so you
   don't have to write things like ps ... | awk '{print $2}' any more.
   Now, the second real problem with ps -ef | grep foo (after the fact
   that process names are inherently unreliable) is that there is a
   [50]RaceCondition in the output. In this pipeline, both the ps and the
   grep are spawned either simultaneously or nearly simultaneously.
   Depending on just how nearly simultaneously they are spawned, the grep
   process might or might not show up in the ps output. And the grep foo
   command is going to match both processes -- the foo daemon or
   whatever, and the grep foo command as well. Assuming both of them show
   up. You might get just one.
   There are two workarounds for that issue. The first is to filter out
   the grep command. This is typically done by running
   ps -ef | grep -v grep | grep foo. Note that the grep -v is done first
   so that it is not the final command in the pipeline. This is so that
   the final command in the pipeline is the one whose exit status
   actually matters. This allows commands like the following to work
  if ps -ef | grep -v grep | grep -q foo; then

   The second workaround involves crafting a grep command that will match
   the foo process but not the grep itself. There are many variants on
   this theme, but one of the most common is:
  if ps -ef | grep '[f]oo'; then

   You'll likely run into this a few times. The [51]RegularExpression
   [f]oo matches only the literal string foo; it does not match the
   literal string [f]oo, and therefore the grep command won't be matched
   either. This approach saves one forked process (the grep -v), and some
   people find it clever.
   I've seen one person try to do this:
  if ps -ef | grep -q -m 1 foo; then

   Not only does this use a nonstandard GNU extension (grep -m -- stop
   after M matches), but it completely fails to avoid the race condition.
   If the race condition produces both grep and foo lines, there's no
   guarantee the foo one will be first! So, this is even worse than what
   we started with.
   Anyway, these are just explanations of tricks you might see in other
   people's code, so that you can guess what they're attempting to do.
   You won't be writing such hacks, I hope.
I want to run something in the background and then log out.

   If you want to be able to reconnect to it later, use screen or tmux.
   Launch either, then run whatever you want to run in the foreground,
   and detach (screen with Ctrl-A d and tmux with Ctrl-B d). You can
   reattach (as long as you didn't reboot the server) with screen -x to
   screen and with tmux attach to tmux. You can even attach multiple
   times, and each attached terminal will see (and control) the same
   thing. This is also great for remote teaching situations.
   If you can't or don't want to do that, the traditional approach still
   works: nohup something &
   Bash also has a disown command, if you want to log out with a
   background job running, and you forgot to nohup it initially.
sleep 1000

   If you need to logout of an ssh session with background jobs still
   running, make sure their file descriptors have been redirected so they
   aren't holding the terminal open, or [52]the ssh client may hang.
I'm trying to kill -9 my job but blah blah blah...

   Woah! Stop right there! Do not use kill -9, ever. For any reason.
   Unless you wrote the program to which you're sending the SIGKILL, and
   know that you can clean up the mess it leaves. Because you're
   debugging it.
   If a process is not responding to normal signals, it's probably in
   "state D" (as shown on ps u), which means it's currently executing a
   system call. If that's the case, you're probably looking at a dead
   hard drive, or a dead NFS server, or a kernel bug, or something else
   along those lines. And you won't be able to kill the process anyway,
   SIGKILL or not.
   If the process is ignoring normal SIGTERMs, then get the source code
   and fix it!
   If you have an employee whose first instinct any time a job needs to
   be terminated is to break out the fucking howitzers, then fire him.
   If you don't understand why this is a case of slicing bread with a
   chain saw, read [53]Who's [sic] idea was this? and [54]The UUOK9 Form
Make SURE you have run and understood these commands:

     * help kill
     * help trap
     * man pkill
     * man pgrep
   OK, now let's move on to the interesting stuff....
Advanced questions

I want to run two jobs in the background, and then wait until they both

   By default, wait waits for all of your shell's children to exit.
job1 &
job2 &

   You can specify one or more jobs (either by PID, or by jobspec -- see
   Job Control for that). The help wait page is misleading (implying that
   only one argument may be given); refer to the full Bash manual
   There is no way to wait for "child process foo to end, OR something
   else to happen", other than [55]setting a trap, which will only help
   if "something else to happen" is a signal being sent to the script.
   There is also no way to wait for a process that is not your child. You
   can't hang around the schoolyard and pick up someone else's kids.
How can I check to see if my game server is still running? I'll put a script
in crontab, and if it's not running, I'll restart it...

   We get that question (in various forms) way too often. A user has some
   daemon, and they want to restart it whenever it dies. Yes, one could
   probably write a bash script that would try to parse the output of ps
   (or preferably pgrep if your system has it), and try to guess which
   process ID belongs to the daemon we want, and try to guess whether
   it's not there any more. But that's haphazard and dangerous. There are
   much better ways.
   Most Unix systems already have a feature that allows you to respawn
   dead processes: init and inittab. If you want to make a new daemon
   instance pop up whenever the old one dies, typically all you need to
   do is put an appropriate line into /etc/inittab with the "respawn"
   action in column 3, and your process's invocation in column 4. Then
   run telinit q or your system's equivalent to make init re-read its
   Some Unix systems don't have inittab, and some system administrators
   might want finer control over the daemons and their logging. Those
   people may want to look into [56]daemontools, or [57]runit.
   This leads into the issue of self-daemonizing programs. There was a
   trend during the 1980s for Unix daemons such as inetd to put
   themselves into the background automatically. It seems to be
   particularly common on BSD systems, although it's widespread across
   all flavors of Unix.
   The problem with this is that any sane method of managing a daemon
   requires that you keep track of it after starting it. If init is told
   to respawn a command, it simply launches that command as a child, then
   uses the wait() system call; so, when the child exits, the parent can
   spawn another one. Daemontools works the same way: a user-supplied run
   script establishes the environment, and then execs the process,
   thereby giving the daemontools supervisor direct parental authority
   over the process, including standard input and output, etc.
   If a process double-forks itself into the background (the way inetd
   and sendmail and named do), it breaks the connection to its parent --
   intentionally. This makes it unmanageable; the parent can no longer
   receive the child's output, and can no longer wait() for the child in
   order to be informed of its death. And the parent won't even know the
   new daemon's process ID. The child has run away from home without even
   leaving a note.
   So, the Unix/BSD people came up with workarounds... they created "PID
   files", in which a long-running daemon would write its process ID,
   since the parent had no other way to determine it. But PID files are
   not reliable. A daemon could have died, and then some other process
   could have taken over its PID, rendering the PID file useless. Or the
   PID file could simply get deleted, or corrupted. They came up with
   pgrep and pkill to attempt to track down processes by name instead of
   by number... but what if the process doesn't have a unique name? What
   if there's more than one of it at a time, like nfsd or Apache?
   These workarounds and tricks are only in place because of the original
   hack of self-backgrounding. Get rid of that, and everything else
   becomes easy! Init or daemontools or runit can just control the child
   process directly. And even the most raw beginner could write their own
   [58]wrapper script:
   while :; do
      /my/game/server -foo -bar -baz >> /var/log/mygameserver 2>&1

   Then simply arrange for that to be executed at boot time, with a
   simple & to put it in the background, and voila! An instant one-shot
   Most modern software packages no longer require self-backgrounding;
   even for those where it's the default behavior (for compatibility with
   older versions), there's often a switch or a set of switches which
   allows one to control the process. For instance, Samba's smbd now has
   a -F switch specifically for use with daemontools and other such
   If all else fails, you can try using [59]fghack (from the daemontools
   package) to prevent the self-backgrounding.
How do I make sure only one copy of my script can run at a time?

   First, ask yourself why you think that restriction is necessary. Are
   you using a temporary file with a fixed name, rather than
   [60]generating a new temporary file in a secure manner each time? If
   so, correct that bug in your script. Are you using some system
   resource without locking it to prevent corruption if multiple
   processes use it simultaneously? In that case, you should probably use
   file locking, by rewriting your application in a language that
   supports it.
   The naive answer to this question, which is given all too frequently
   by well-meaning but inexperienced scripters, would be to run some
   variant of ps -ef | grep -v grep | grep "$(basename "$0")" | wc -l to
   count how many copies of the script are in existence at the moment. I
   won't even attempt to describe how horribly wrong that approach is...
   if you can't see it for yourself, you'll simply have to take my word
   for it.
   Unfortunately, bash has no facility for locking a file. [61]Bash FAQ
   #45 contains examples of using a directory, a symlink, etc. as a means
   of mutual exclusion; but you cannot lock a file directly.
     * I believe you can use (set -C; >lockfile) to atomically create a
       lockfile, please verify this. (see: [62]Bash FAQ #45) --Andy753421
   You could also run your program or shell script under the [63]setlock
   program from the daemontools package. Presuming that you use the same
   lockfile to prevent concurrent or simultaneous execution of your
   script(s), you have effectively made sure that your script will only
   run once. Here's an example where we want to make sure that only one
   "sleep" is running at a given time.
$ setlock -nX lockfile sleep 100 &
[1] 1169
$ setlock -nX lockfile sleep 100
setlock: fatal: unable to lock lockfile: temporary failure

   If environmental restrictions require the use of a shell script, then
   you may be stuck using that. Otherwise, you should seriously consider
   rewriting the functionality you require in a more powerful language.
I want to process a bunch of files in parallel, and when one finishes, I
want to start the next. And I want to make sure there are exactly 5 jobs
running at a time.

   Many xargs allow running tasks in parallel, including FreeBSD, OpenBSD
   and GNU (but not POSIX):
find . -print0 | xargs -0 -n 1 -P 4 command

   One may also choose to use GNU Parallel (if available) instead of
   xargs, as GNU Parallel makes sure the output from different jobs do
   not mix.
find . -print0 | parallel -0 command | use_output_if_needed

   A C program could fork 5 children and manage them closely using
   select() or similar, to assign the next file in line to whichever
   child is ready to handle it. But bash has nothing equivalent to select
   or poll.
   In a script where the loop is very big you can use sem from GNU
   Parallel. Here 10 jobs are run in parallel:
for i in *.log ; do
  echo "$i"
  [ other needed stuff...]
  sem -j10 gzip $i ";" echo done
sem --wait

   If you do not have GNU Parallel installed you're reduced to lesser
   solutions. One way is to divide the job into 5 "equal" parts, and then
   just launch them all in parallel. Here's an example:
[64]   1 #!/usr/local/bin/bash
[65]   2 # Read all the files (from a text file, 1 per line) into an array.
[66]   3 IFS=$'\n' read -r -d '' -a files < inputlist
[67]   4
[68]   5 # Here's what we plan to do to them.
[69]   6 do_it() {
[70]   7    for f; do [[ -f $f ]] && my_job "$f"; done
[71]   8 }
[72]   9
[73]  10 # Divide the list into 5 sub-lists.
[74]  11 i=0 n=0 a=() b=() c=() d=() e=()
[75]  12 while ((i < ${#files[*]})); do
[76]  13     a[n]=${files[i]}
[77]  14     b[n]=${files[i+1]}
[78]  15     c[n]=${files[i+2]}
[79]  16     d[n]=${files[i+3]}
[80]  17     e[n]=${files[i+4]}
[81]  18     ((i+=5, n++))
[82]  19 done
[83]  20
[84]  21 # Process the sub-lists in parallel
[85]  22 do_it "address@hidden" > a.out 2>&1 &
[86]  23 do_it "address@hidden" > b.out 2>&1 &
[87]  24 do_it "address@hidden" > c.out 2>&1 &
[88]  25 do_it "address@hidden" > d.out 2>&1 &
[89]  26 do_it "address@hidden" > e.out 2>&1 &
[90]  27 wait

   See [91]reading a file line-by-line and [92]arrays and
   [93]ArithmeticExpression for explanations of the syntax used in this
   Even if the lists aren't quite identical in terms of the amount of
   work required, this approach is close enough for many purposes.
   Another approach involves using a [94]named pipe to tell a "manager"
   when a job is finished, so it can launch the next job. Here is an
   example of that approach:
[95]   1 #!/bin/bash
[96]   2
[97]   3 # FD 3 will be tied to a named pipe.
[98]   4 mkfifo pipe; exec 3<>pipe
[99]   5
[100]   6 # This is the job we're running.
[101]   7 s() {
[102]   8   echo Sleeping $1
[103]   9   sleep $1
[104]  10 }
[105]  11
[106]  12 # Start off with 3 instances of it.
[107]  13 # Each time an instance terminates, write a newline to the named pipe
[108]  14 { s 5; echo >&3; } &
[109]  15 { s 7; echo >&3; } &
[110]  16 { s 8; echo >&3; } &
[111]  17
[112]  18 # Each time we get a line from the named pipe, launch another job.
[113]  19 while read; do
[114]  20   { s $((RANDOM%5+7)); echo >&3; } &
[115]  21 done <&3

   If you need something more sophisticated than these, you're probably
   looking at the wrong language.
My script runs a pipeline. When the script is killed, I want the pipeline to
die too.

   One approach is to set up a [116]signal handler (or an EXIT trap) to
   kill your child processes right before you die. Then, you need the
   PIDs of the children -- which, in the case of a pipeline, is not so
   easy. You can use a [117]named pipe instead of a pipeline, so that you
   can collect the PIDs yourself:
[118]   1 #!/bin/bash
[119]   2 unset kids
[120]   3 fifo=/tmp/foo$$
[121]   4 trap 'kill "address@hidden"; rm -f "$fifo"' EXIT
[122]   5 mkfifo "$fifo" || exit 1
[123]   6 command 1 > "$fifo" & kids+=($!)
[124]   7 command 2 < "$fifo" & kids+=($!)
[125]   8 wait

   This example sets up a FIFO with one writer and one reader, and stores
   their PIDs in an array named kids. The EXIT trap sends SIGTERM to them
   all, removes the FIFO, and exits. See [126]Bash FAQ #62 for notes on
   the use of temporary files.
   Another approach is to enable job control, which allows whole
   pipelines to be treated as units.
[127]   1 #!/bin/bash
[128]   2 set -m
[129]   3 trap 'kill %%' EXIT
[130]   4 command1 | command2 &
[131]   5 wait

   In this example, we enable job control with set -m. The %% in the EXIT
   trap refers to the current job (the most recently executed background
   pipeline qualifies for that). Telling bash to kill the current job
   takes out the entire pipeline, rather than just the last command in
   the pipeline (which is what we would get if we had stored and used $!
   instead of %%).
How to work with processes

   The best way to do process management in Bash is to start the managed
   process(es) from your script, remember its PID, and use that PID to do
   things with your process later on.
   If at ALL possible, AVOID ps, pgrep, killall, and any other process
   table parsing tools. These tools have no clue what process YOU WANT to
   talk to. They only guess at it based on filtering unreliable
   information. These tools may work fine in your little test
   environment, they may work fine in production for a while, but
   inevitably they WILL fail, because they ARE a broken approach to
   process management.
PIDs and parents

   In UNIX, processes are identified by a number called a PID (for
   Process IDentifier). Each running process has a unique identifier. You
   cannot reliably determine when or how a process was started purely
   from the identifier number: for all intents and purposes, it is
   Each UNIX process also has a parent process. This parent process is
   the process that started it, but can change to the init process if the
   parent process ends before the new process does. (That is, init will
   pick up orphaned processes.) Understanding this parent/child
   relationship is vital because it is the key to reliable process
   management in UNIX. A process's PID will NEVER be freed up for use
   after the process dies UNTIL the parent process waits for the PID to
   see whether it ended and retrieve its exit code. If the parent ends,
   the process is returned to init, which does this for you.
   This is important for one major reason: if the parent process manages
   its child process, it can be absolutely certain that, even if the
   child process dies, no other new process can accidentally recycle the
   child process's PID until the parent process has waited for that PID
   and noticed the child died. This gives the parent process the
   guarantee that the PID it has for the child process will ALWAYS point
   to that child process, whether it is alive or a "zombie". Nobody else
   has that guarantee.
The risk of letting the parent die

   Why is this all so important? Why should you care? Consider what
   happens if we use a "PID file". Assume the following sequence of
    1. You're a boot script (for example, one in /etc/init.d). You are
       told to start the foodaemon.
    2. You start a foodaemon child process in the background and grab its
    3. You write this PID to a file.
    4. You exit, assuming you've done your job.
    5. Later, you're started up again and told to kill the foodaemon.
    6. You look for the child process's PID in a file.
    7. You send the SIGTERM signal to this PID, telling it to clean up
       and exit.
   There is absolutely no way you can be certain that the process you
   told to exit is actually the one you started. The process you wanted
   to check up on could have died and another random new process could
   have easily recycled its PID that was released by init.
The risk of parsing the process tree

   UNIX comes with a set of handy tools, among which is ps. This is a
   very helpful utility that you can use from the command line to get an
   overview of what processes are running on your box and what their
   status is.
   All too many people, however, assume that computers and humans work
   the same way. They think that "I can read ps and see if my process is
   in there, why shouldn't my script do the same?". Here's why: You are
   (hopefully) smarter than your script. You see ps output and you see
   all sorts of information in context. Your brain determines, "Is this
   the process I'm looking for?" and based on what you see it guesses
   "Yeah, it looks like it.". Firstly, your script can't process context
   the way your brain can (no, awk'ing out column 4 and seeing if that
   contains your process's command name isn't good enough). Secondly,
   even if it could do a good job, your script shouldn't be doing any
   guessing whatsoever. It shouldn't need to.
   ps output is unpredictable, highly OS-dependent, and not built for
   parsing. It is next to impossible for your script to distinguish ping
   as a command name from another process's command line which may
   contain a similar word like piping, or a user named ping, etc.
   The same goes for almost any other tool that parses the process list.
   Some are worse than others, but in the end, they all do the wrong
Doing it right

   As mentioned before, the right way to do something with your child
   process is by using its PID, preferably (if at all possible) from the
   parent process that created it.
   You may have come here hoping for a quick hint on how to finish your
   script only to find these recommendations don't apply to any of your
   existing code or setup. That's probably not because your code or setup
   is an exception and you should disregard this; but more likely because
   you need to take the time and re-evaluate your existing code or setup
   and rework it. This will require you to think for a moment. Take that
   moment and do it right.
Starting a process and remembering its PID

   To start a process asynchronously (so the main script can continue
   while the process runs in the "background"), use the & operator. To
   get the PID that was assigned to it, expand the ! parameter. You can,
   for example, save it in a variable:
    # Bourne shell
    myprocess -o myfile -i &

Checking up on your process or terminating it

   At a later time, you may be interested in whether your process is
   still running and if it is, you may decide it's time to terminate it.
   If it's not running anymore, you may be interested in its exit code to
   see whether it experienced a problem or ended successfully.
   To [132]send a process a signal, we use the kill command. Signals can
   be used to tell a process to do something, but kill can also be used
   to check if the process is still alive:
    # Bourne
    kill -0 $mypid && echo "My process is still alive."
    kill    $mypid ;  echo "I just asked my process to shut down."

   kill sends the SIGTERM signal, by default. This tells a program it's
   time to terminate. You can use the -0 option to kill if you don't want
   to terminate the process but just check up on whether it's still
   running. In either case, the kill command will have a 0 exit code
   (success) if it managed to send the signal (or found the process to
   still be alive).
   Unless you intend to send a very specific signal to a process, do not
   use any other kill options; in particular, avoid using -9 or SIGKILL
   at all cost. The KILL signal is a very dangerous signal to send to a
   process and using it is almost always a bug. Send the default SIGTERM
   instead and have patience.
   To wait for a child process to finish or to read in the exit code of a
   process that you know has already finished (because you did a kill -0
   check, for example), use the wait built-in command:
    # Bash
    night() { sleep 10; }              # Define 'night' as a function that take
s 10 seconds.
                                       # Adjust seconds according to current se
ason and latitude
                                       # for a more realistic simulation.

    night & nightpid=$!
    while sleep 1; do
        kill -0 $nightpid || break     # Break the loop when we see the process
 has gone away.
        echo "$(( ++sheep )) sheep jumped over the fence."

    wait $nightpid; nightexit=$?
    echo "The night ended with exit code $nightexit.  We counted $sheep sheep."

Starting a "daemon" and checking whether it started successfully

   This is a very common request. The problem is that there is no answer!
   There is no such thing as "the daemon started up successfully", and if
   your specific daemon were to have a relevant definition to that
   statement, it would be so completely daemon-specific, that there is no
   generic way for us to tell you how to check for that condition.
   What people generally resort to in an attempt to provide something
   "good enough", is: "Let's start the daemon, wait a few seconds, check
   whether the daemon process is still running, and if so, let's assume
   it's doing the right thing.". Ignoring the fact that this is a totally
   lousy check which could easily be defeated by a stressed kernel,
   timing issues, latency or delay in the daemon's operations, and many
   other conditions, let's just see how we would implement this if we
   actually wanted to do this:
    # Bash
    mydaemon -i eth0 & daemonpid=$!
    sleep 2
    if kill -0 $daemonpid ; then
        echo "Daemon started successfully.  I think."
        wait $daemonpid; daemonexit=$?
        echo "Daemon process disappeared.  I suppose something may have gone wr
ong.  Its exit code was $daemonexit."

   To be honest, this problem is much better solved by doing a
   daemon-specific check. For example, say you're starting a web server
   called httpd. The sensible thing to check in order to determine
   whether the web server started successfully... is whether it's
   actually serving your web content! Who'd have thought!
    # Bourne(?)
    httpd -h & httpdpid=$!
    while sleep 1; do
        nc -z 80 && break             # See if we can establish a TCP
 connection to port 80.

    echo "httpd ready for duty."

   If something goes wrong, though, this will wait forever trying to
   connect to port 80. So let's check whether httpd died unexpectedly or
   whether a certain "timeout" time elapsed:
    # Bash
    httpd -h & httpdpid=$!
    time=0 timeout=60
    while sleep 1; do
        nc -z 80 && break             # See if we can establish a TCP
 connection to port 80.

        # Connection not yet available.
        if ! kill -0 $httpdpid; then
            wait $httpdpid; httpdexit=$?
            echo "httpd died unexpectedly with exit code: $httpdexit"
            exit $httpdexit
        if (( ++time > timeout )); then
            echo "httpd hasn't gotten ready after $time seconds.  Something mus
t've gone wrong.."
            # kill $httpdpid; wait $httpdpid    # You could terminate httpd her
e, if you like.

    echo "httpd ready for duty."

On processes, environments and inheritance

   Every process on a Unix system (except init) has a parent process from
   which it inherits certain things. A process can change some of these
   things, and not others. You cannot change things inside another
   process other than by being its parent, or attaching (attacking?) it
   with a debugger.
   It is of paramount importance that you understand this model if you
   plan to use or administer a Unix system successfully. For example, a
   user with 10 windows open might wonder why he can't tell all of his
   shells to change the contents of their PATH variable, short of going
   to each one individually and running a command. And even then, the
   changed PATH variable won't be set in the user's window manager or
   desktop environment, which means any new windows he creates will still
   get the old variable.
   The solution, of course, is that the user needs to edit a shell
   [133]dot file, then logout and back in, so that his top-level
   processes will get the new variable, and can pass it along to their
   Likewise, a system administrator might want to tell her in.ftpd to use
   a default [134]umask of 002 instead of whatever it's currently using.
   Achieving that goal will require an understanding of how in.ftpd is
   launched on her system, either as a child of inetd or as a standalone
   daemon with some sort of [135]boot script; making the appropriate
   modifications; and restarting the appropriate daemons, if any.
   So, let's take a closer look at how processes are created.
   The Unix process creation model revolves around two system calls:
   fork() and exec(). (There is actually a family of related system calls
   that begin with exec which all behave in slightly different manners,
   but we'll treat them all equally for now.) fork() creates a child
   process which is a duplicate of the parent who called fork() (with a
   few exceptions). The parent receives the child process's PID (Process
   ID) number as the return value of the fork() function, while the child
   gets a "0" to tell it that it's the child. exec() replaces the current
   process with a different program.
   So, the usual sequence is:
     * A program calls fork() and checks the return value of the system
       call. If the status is greater than 0, then it's the parent
       process, so it calls wait() on the child process ID (unless we
       want it to continue running while the child runs in the
     * If the status is 0, then it's the child process, so it calls
       exec() to do whatever it's supposed to be doing.
     * But before that, the child might decide to close() some file
       descriptors, open() new ones, set environment variables, change
       resource limits, and so on. All of these changes will remain in
       effect after the exec() and will affect the task that is executed.
     * If the return value of fork() is negative, something bad happened
       (we ran out of memory, or the process table filled up, etc.).
   Let's take an example of a shell command:
echo hello world 1>&2

   The process executing this is a shell, which reads commands and
   executes them. For external commands, it uses the standard
   fork()/exec() model to do so. Let's show it step by step:
     * The parent shell calls fork().
     * The parent gets the child's process ID as the return value of
       fork() and waits for it to terminate.
     * The child gets a 0 from fork() so it knows it's the child.
     * The child is supposed to redirect standard output to standard
       error (due to the 1>&2 directive). It does this now:
          + Close file descriptor 1.
          + Duplicate file descriptor 2, and make sure the duplicate is
            FD 1.
     * The child calls
       exec("echo", "echo", "hello", "world", (char *)NULL) or something
       similar to execute the command. (Here, we're assuming echo is an
       external command.)
     * Once the echo terminates, the parent's wait call also terminates,
       and the parent resumes normal operation.
   There are other things the child of the shell might do before
   executing the final command. For example, it might set environment
http_proxy=http://tempproxy:3128/ lynx http://someURL/

   In this case, the child will put http_proxy=http://tempproxy:3128/
   into the environment before calling exec(). The parent's environment
   is unaffected.
   A child process inherits many things from its parent:
     * Open file descriptors. The child gets copies of these, referring
       to the same files.
     * Environment variables. The child gets its own copies of these, and
       [136]changes made by the child do not affect the parent's copy.
     * Current working directory. If the child changes its working
       directory, [137]the parent will never know about it.
     * User ID, group ID and supplementary groups. A child process is
       spawned with the same privileges as its parent. Unless the child
       process is running with superuser UID (UID 0), it cannot change
       these privileges.
     * System resource limits. The child inherits the limits of its
       parent. A process that runs as superuser UID can raise its
       resource limits (setrlimit(2)). A process running as non-superuser
       can only lower its resource limits; it can't raise them.
     * [138]umask.
   An active Unix system may be perceived as a tree of processes, with
   parent/child relationships shown as vertical ("branch") connections
   between nodes. For example,
     bash .xinitrc
     /     |    \
 rxvt    rxvt   fvwm2
  |        |        \
 bash   screen       \____________________
       /   |  \              |      |     \
    bash bash  bash        xclock  xload  firefox ...
           |     |
         mutt  rtorrent

   This is a simplified version of an actual set of processes run by one
   user on a real system. I have omitted many, to keep things readable.
   The root of the tree, shown as (init), as well as the first child
   process (login), are running as root (superuser UID 0). Here is how
   this scenario came about:
     * The kernel (Linux in this case) is hard-coded to run /sbin/init as
       process number 1 when it has finished its startup. init never
       dies; it is the ultimate ancestor of every process on the system.
     * init reads /etc/inittab which tells it to spawn some getty
       processes on some of the Linux virtual terminal devices (among
       other things).
     * Each getty process presents a bit of information plus a login
     * After reading a username, getty exec()s /bin/login to read the
       password. (Thus, getty no longer appears in the tree; it has
       replaced itself.)
     * If the password is valid, login fork()s the user's login shell (in
       this case bash). Presumably, it hangs around (instead of using
       exec()) because it wants to do some clean-up after the user's
       shell has terminated.
     * The user types exec startx at the bash shell prompt. This causes
       bash to exec() startx (and therefore the login shell no longer
       appears in the tree).
     * startx is a wrapper that launches an X session, which includes an
       X server process (not shown -- it runs as root), and a whole slew
       of client programs. On this particular system, .xinitrc in the
       user's home directory is a script that tells which X client
       programs to run.
     * Two rxvt terminal emulators are launched from the .xinitrc file
       (in the background using &), and each of them runs a new copy of
       the user's shell, bash.
          + In one of them, the user has typed exec screen (or something
            similar) to replace bash with screen. Screen, in turn, has
            three bash child processes of its own, two of which have
            terminal-based programs running in them (mutt, rtorrent).
     * The user's window manager, fvwm2, is run in the foreground by the
       .xinitrc script. A window manager or desktop environment is
       usually the last thing run by the .xinitrc script; when the WM or
       DE terminates, the script terminates, and brings down the whole
     * The window manager runs several processes of its own (xclock,
       xload, firefox, ...). It typically has a menu, or icons, or a
       control panel, or some other means of launching new programs. We
       will not cover window manager configurations here.
   Other parts of a Unix system use similar process trees to accomplish
   their goals, although few of them are quite as deep or complex as an X
   session. For example, inetd runs as a daemon which listens on several
   UDP and TCP ports, and launches programs (ftpd, telnetd, etc.) when it
   receives network connections. lpd runs as a managing daemon for
   printer jobs, and will launch children to handle individual jobs when
   a printer is ready. sshd listens for incoming SSH connections, and
   launches children when it receives them. Some electronic mail systems
   (particularly [139]qmail) use relatively large numbers of small
   processes working together.
   Understanding the relationship among a set of processes is vital to
   administering a system. For example, suppose you would like to change
   the way your FTP service behaves. You've located a configuration file
   that it is known to read at startup time, and you've changed it. Now
   what? You could reboot the entire system to be sure your change takes
   effect, but most people consider that overkill. Generally, people
   prefer to restart only the minimal number of processes, thereby
   causing the least amount of disruption to the other services and the
   other users of the system.
   So, you need to understand how your FTP service starts up. Is it a
   standalone daemon? If so, you probably have some system-specific way
   of restarting it (either by running a [140]BootScript, or manually
   killing and restarting it, or perhaps by issuing some special service
   management command). More commonly, an FTP service runs under the
   control of inetd. If this is the case, you don't need to restart
   anything at all. inetd will launch a fresh FTP service daemon every
   time it receives a connection, and the fresh daemon will read the
   changed configuration file every time.
   On the other hand, suppose your FTP service doesn't have its own
   configuration file that lets you make the change you want (for
   example, changing its umask for the default [141]Permissions of
   uploaded files). In this case, you know that it inherits its umask
   from inetd, which in turn gets its umask from whatever boot script
   launched it. If you would like to change FTP's umask in this scenario,
   you would have to edit inetd's boot script, and then kill and restart
   inetd so that the FTP service daemons (inetd's children) will inherit
   the new value. And by doing this, you are also changing the default
   umask of every other service that inetd manages! Is that acceptable?
   Only you can answer that. If not, then you may have to change how your
   FTP service runs, possibly moving it to a standalone daemon. This is a
   system administrator's job.
   [142]CategoryShell [143]CategoryUnix
   ProcessManagement (last edited 2011-10-24 16:08:10 by [144]GreyCat)
     * [145]Edit (Text)
     * [146]Comments
     * [147]Info
     * [148]Attachments
     * More Actions:
       [Raw Text................] Do
     * [149]MoinMoin Powered
     * [150]Python Powered
     * [151]GPL licensed
     * [152]Valid HTML 4.01



reply via email to

[Prev in Thread] Current Thread [Next in Thread]