[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

cfservd under load - SplayTime quirks

From: Eric Sorenson
Subject: cfservd under load - SplayTime quirks
Date: Tue, 7 Dec 2004 13:25:04 -0800 (PST)

Hi, after upgrading to 2.1.11 last week, our RH9-based cfservd master server
starting behaving oddly. cfservd crashed a lot (four or five times a day) and
when I tried to debug it, sometimes it would go into an unkillable state (like
a zombie but reported by linux as 'defunct').  New cfservd's wouldn't be able
to bind to *:tcp/5308 and rebooting the machine was our only recourse to bring
it back.  I investigated a bit further today, and here's a writeup of a few
things I've noticed.  This is more of a narrative than a concise bug report
but there are a couple of subtle bugs described herein.

(About our setup: we have about 1200 mostly linux machines, all recently
upgraded to 2.1.11, talking to a RH9 masterserver, also now running 2.1.11. Cfengine runs from cron every hour, with a SplayTime of 30 minutes. The config
mostly does copy: actions with a few local shellcommands.)

The first thing I noticed was that despite the SplayTime, right at 25 min past
the hour (when the cron job is scheduled), the server gets absolutely pounded with clients. cfservd crashes within a few seconds, sometimes with
no corefile and nothing logged, and sometimes with stuff like:

Dec  5 04:25:51 sinistar cfservd[20038]:  Unable to lookup hostname 
( or cfengine service: Temporary failure in name resolution
Dec 5 04:25:51 sinistar cfservd[20038]: Couldn't open last-seen database /var/cfengine/cf_lastseen.db Dec 5 04:25:51 sinistar cfservd[20038]: db_open: Too many open files

Ok, that seems straightforward, there are a ton of clients connecting, each
one eating up a few file descriptors, and at some point we run out. But
'ulimit -n' permits 1024 open files, and /proc/sys/fs/file-nr shows
"2668 1593 52422" (allocated, used, maximum). When I've been able to snatch
a 'lsof' from a busy cfservd, there's maybe 100 fds in use, so I don't think
either of these system limits are being hit. This led me in two directions:
first, to investigate splaying out the clients more, and second, to tune cfservd to behave more nicely when it's getting pummeled with connections.

Well, right off I realized I'd made an error.  My cfagent.conf was set to a 30
minute splay, but update.conf was only set to 5 minutes.  And while the docs
say (from the Tutorial):

    Every machine will go to sleep for a different length of time, which is no
    longer than the time you specify in minutes. A hashing algorithm, based on
    the fully qualified name of the host, is used to compute a unique time for
    hosts. The shorter the interval, the more clustered the hosts will be.

However, if you use update.conf, the SplayTime in your cfagent.conf gets
ignored entirely -- something really I wasn't expecting!

 * (update context)
Sleeping for SplayTime 398 seconds
 * (main context)
Time splayed once already - not repeating

I guess this is the intended behavior (?), if so it could stand to be docuemnted
better.  The comment at the code example in the tutorial says:

    # Put this in update.conf, so that the updates are also splayed

But what it means is:

    # If you put this in update.conf, the whole run will splay to this value

FWIW I wrote a little program to understand the SplayTime hashing algorithm better, the curious can see it here

That is enough jabber for this post, in part two I'll cover cfservd tuning.


 - Eric Sorenson - Explosive Networking - -

reply via email to

[Prev in Thread] Current Thread [Next in Thread]