help-cfengine
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

cfservd under load - db crash, MaxConnections


From: Eric Sorenson
Subject: cfservd under load - db crash, MaxConnections
Date: Tue, 7 Dec 2004 15:05:05 -0800 (PST)

As I was working on the SplayTime curiosity described in my last post,
I was also investigating a couple of things on the server side.  The
first was derived from a couple of coredumps I got that both looked like this, with different 'mipaddr' values:

(gdb) bt
#0  0x40057822 in ?? ()
#1  0x4004c9c2 in ?? ()
#2  0x4006f819 in ?? ()
#3  0x40069659 in ?? ()
#4  0x08050877 in IsWildKnownHost (oldkey=0x81f21b8, newkey=0x8344d50, mipaddr=0x83423b8 
"10.10.18.187",
    username=0x8341fb4 "root") at cfservd.c:3154
#5  0x08050452 in CheckStoreKey (conn=0x8341b98, key=0x8344d50) at 
cfservd.c:3038
#6  0x0804eb5a in AuthenticationDialogue (conn=0x8341b98, recvbuffer=0x4cffdb7c 
"", recvlen=280) at cfservd.c:2315
#7  0x0804d021 in BusyWithConnection (conn=0x8341b98) at cfservd.c:1252
#8  0x0804c775 in HandleConnection (conn=0x8341b98) at cfservd.c:1133
#9  0x401d32b6 in ?? ()

I suspected those innermost frames for which there is no symbol data
were calls out to the berkeley db libraries, and looking at the code, we were manipulating the /var/cfengine/ppkeys/dynamic key
database. I suspect that one of my earlier crashes corrupted an entry
in it, so I just rm'ed it and let it be re-created, and I haven't seen
any more of these problems -- so this might help others who are seeing cfservd
crashes and have DynamicAddresses turned on for some hosts.

Another pathology I saw were 'too many open files' errors from cfservd. At the time our MaxConnections setting in cfservd.conf was 1000, which is
the maximum allowable value (cfservd.c:392) and clearly, we were hitting some
wall below that.  So I set it to 100 on a lark and saw this:

Dec  6 17:25:01 sinistar cfservd[5518]:  Too many threads (>=100) -- increase 
MaxConnections?
Dec  6 17:25:02 sinistar last message repeated 64 times
Dec  6 17:25:02 sinistar cfservd[5518]:  Server seems to be paralyzed. DOS 
attack? Committing apoptosis...
Dec  6 17:25:02 sinistar cfservd[5518]:  Received signal 0 (NOSIG) while doing 
[cfservd]
Dec  6 17:25:02 sinistar cfservd[5518]:  Logical start time Mon Dec  6 17:16:01 
2004
Dec  6 17:25:02 sinistar cfservd[5518]:  This sub-task started really at Mon 
Dec  6 17:16:01 2004

'Apoptosis' is apparently an oncological term meaning 'scheduled cell death'.

With MaxConnections at 100, cfservd survived long enough to push out
updated configurations with the increased SplayTime to some clients, easing
the load off itself. But still I'm seeing an onslaught right at one second
past the opening of the execution window, and I can't tell if, during the
time these messages are happening, forward progress is being made by the
100 running threads below the apoptosis threshold, or whether there is a
big set of clients which will not be able to receive their updated configs
except by package update, because the server will always be slammed when
they try to connect. I'll try upping the value to 500 to see what happens,
but I'm wondering if there's a more scientific (or at least, less naive)
way to tune MaxConnections so that it fits inside OS limits but will handle
lots of client connections.

--

 - Eric Sorenson - Explosive Networking - http://eric.explosive.net -




reply via email to

[Prev in Thread] Current Thread [Next in Thread]