help-cfengine
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SUMMARY: Too many cfagents running. Was: Load problem with cfserv d


From: Baker, Darryl
Subject: SUMMARY: Too many cfagents running. Was: Load problem with cfserv d
Date: Thu, 17 Mar 2005 09:11:11 -0500

 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Finally got all the problems fix. First I installed the latest snap
shot and the load dropped. We had some scripts using ssh that were
setting up X window tunnels for no reason. We fixed those scripts and
the load dropped. I switched the configuration to "schedule = (
Min00_05 Min30_35 )" but left the rules saying Q1 and Q3. Nothing
changed. Then I had an inspiration and removed /var/cfengine/*db. The
load and the contention for the mutex dropped like a rock. Why?

We believe that the main problem was that the kernel mutex we were
spinning on is part of /dev/random and that between all the ssh
clients this machine spawns and the cfengine connections that it was
just now being able to generate enough randomness and was blocking.
Then why would removing those files improve things?

_____________________________________________________________________
Darryl Baker
gedas USA, Inc.
Operational Services Business Unit
3800 Hamlin Road
Auburn Hills, MI 48326
US
phone   +1-248-754-5341
fax     +1-248-754-6399
Darryl.Baker@gedas.com
http://www.gedasusa.com
_____________________________________________________________________

> -----Original Message-----
> From: help-cfengine-bounces+darryl.baker=gedas.com@gnu.org
> [mailto:help-cfengine-bounces+darryl.baker=gedas.com@gnu.org]On
> Behalf Of Baker, Darryl
> Sent: Wednesday, March 16, 2005 9:14 AM
> To: help-cfengine@gnu.org
> Subject: RE: Too many cfagents running. Was: Load problem with
> cfservd  
> 
> 
>  
> 
> *** PGP Signature Status: good
> *** Signer: Darryl Philip Baker <darryl.baker@gedas.com>
> *** Signed: 3/16/2005 9:14:32 AM
> *** Verified: 3/17/2005 9:01:32 AM
> *** BEGIN PGP VERIFIED MESSAGE ***
> 
> Follow-up:
>       What I found is cfexecd is spawning cfagents every 5 
> minutes during
> the scheduled quarter hour. So in Q1 it spawns one at 0,5,10 and in
> Q3 it spawns one at 30,35,40. Therefore I get and increased load by
> a factor of 3 on the server rather than reducing the load as I was
> trying to do.
> 
> ____________________________________________________________________
> _ Darryl Baker
> gedas USA, Inc.
> Operational Services Business Unit
> 3800 Hamlin Road
> Auburn Hills, MI 48326
> US
> phone +1-248-754-5341
> fax   +1-248-754-6399
> Darryl.Baker@gedas.com
> http://www.gedasusa.com
> ____________________________________________________________________
> _  
> 
> > -----Original Message-----
> > From: help-cfengine-bounces+darryl.baker=gedas.com@gnu.org
> > [mailto:help-cfengine-bounces+darryl.baker=gedas.com@gnu.org]On
> > Behalf Of Baker, Darryl
> > Sent: Tuesday, March 15, 2005 12:31 PM
> > To: help-cfengine@gnu.org
> > Subject: Too many cfagents running. Was: Load problem with
> > cfservd  
> > 
> > 
> >  
> > 
> > *** PGP Signature Status: good
> > *** Signer: Darryl Philip Baker <darryl.baker@gedas.com>
> > *** Signed: 3/15/2005 12:31:14 PM
> > *** Verified: 3/16/2005 9:10:01 AM
> > *** BEGIN PGP VERIFIED MESSAGE ***
> > 
> >  
> > *** PGP Signature Status: good
> > *** Signer: Darryl Philip Baker <darryl.baker@gedas.com>
> > *** Signed: 3/15/2005 12:28:32 PM
> > *** Verified: 3/15/2005 12:30:08 PM
> > *** BEGIN PGP VERIFIED MESSAGE ***
> > 
> > Installing the latest snapshot has reduced the problem with
> > system loading on the master. 
> > 
> > Now I'm finding that cfexecd is starting one cfagent every 5
> > minutes even though I have the schedule set to only run in Q1 and
> > Q4."schedule = ( Q1 Q3 )" Why?
> > 
> > 
> > 
> > __________________________________________________________________
> > __ _ Darryl Baker
> > gedas USA, Inc.
> > Operational Services Business Unit
> > 3800 Hamlin Road
> > Auburn Hills, MI 48326
> > US
> > phone       +1-248-754-5341
> > fax +1-248-754-6399
> > Darryl.Baker@gedas.com
> > http://www.gedasusa.com
> > __________________________________________________________________
> > __ _  
> > 
> > > -----Original Message-----
> > > From: help-cfengine-bounces+darryl.baker=gedas.com@gnu.org
> > > [mailto:help-cfengine-bounces+darryl.baker=gedas.com@gnu.org]On
> > > Behalf Of Baker, Darryl
> > > Sent: Monday, March 14, 2005 4:08 PM
> > > To: help-cfengine@gnu.org
> > > Subject: Load problem with cfservd
> > > 
> > > 
> > > 
> > > *** PGP Signature Status: good
> > > *** Signer: Darryl Philip Baker <darryl.baker@gedas.com>
> > > *** Signed: 3/14/2005 4:08:02 PM
> > > *** Verified: 3/15/2005 10:54:48 AM
> > > *** BEGIN PGP VERIFIED MESSAGE ***
> > > 
> > > My master machine is Solaris 9 and all systems are running
> > > Solaris 8 or 9 and cfengine 2.1.13.
> > > 
> > > The problem we have with cfservd manifests itself as a periodic
> > > clog that takes about a minute to resolve. This period is
> > > characterized by the following symptoms:
> > > 
> > > 1. Load average spike from ~3 (on a 4-processor system) to the
> > > 6-8 range. Occasionally the spike breaks into double digits. 
> > > 2. Increase in concurrent  port 5308 (cfengine) sessions from a
> > > base level of 0-4 to peaks in the 12-30 range, with the number
> > > of LWP's in the cfservd processes tracking the number of
> > > connections linearly. (Client systems are set to connect twice
> > > an hour with a 25-minute
> > > 'splay time.)
> > > 3. Running lockstat shows severe contention for a single
> > > adaptive mutex:
> > > 
> > > root@sysadm05:proc# lockstat sleep 5
> > > 
> > > Adaptive mutex spin: 157416 events in 5.040 seconds (31233
> > > events/sec)
> > > Count indv cuml rcnt     spin Lock                   Caller    
> > >   
> > >   
> > >   
> > >        
> > > ----------------------------------------------------------------
> > > -- -- -- ---------
> > > 136805  87%  87% 1.00       75 0x152ec90             
> > > sfmmu_mlist_enter+0x84        
> > > [...] 
> > > Adaptive mutex block: 648 events in 5.040 seconds (129
> > > events/sec) Count indv cuml rcnt     nsec Lock                 
> > >  Caller         
> > >   
> > >        
> > > ----------------------------------------------------------------
> > > -- -- -- ---------
> > >   547  84%  84% 1.00   391652 0x152ec90             
> > > sfmmu_mlist_enter+0x84  
> > > 
> > > Both of those types of lock run about 2 orders of magnitude
> > > lower in total, with the specific lock running as much as 3
> > > orders of magnitude lower, (i.e. ~100 spins and no blocks) 
> > > when the system is in its 'calm' state. 
> > > 
> > > 4. The cfservd process becomes by far the top cpu user, eating
> > > 10-25% of total cpu on a 4-processor system. 
> > > 5. The system retains some idle time (5-30%) but the time used
> > > by the kernel jumps to the 40-70% range. 
> > > 
> > > The history of troubleshooting this leads me to believe that
> > > the heavy ssh usage on this host is a significant compounding
> > > factor, i.e. that we are hitting some common bottleneck when we
> > > have cfservd accepting connections and are spawning batches of
> > > 30-100 outbound ssh connections at once. Reducing the herds of
> > > outbound ssh's has reduced the frequency and severity of these
> > > clog periods, but every time we change much of anything on the
> > > system, we end up getting back to a state where these clogs
> > > become common. 
> > > 
> > > 
> > > 
> > > ________________________________________________________________
> > > __ __ _ Darryl Baker
> > > gedas USA, Inc.
> > > Operational Services Business Unit
> > > 3800 Hamlin Road
> > > Auburn Hills, MI 48326
> > > US
> > > phone     +1-248-754-5341
> > > fax       +1-248-754-6399
> > > Darryl.Baker@gedas.com
> > > http://www.gedasusa.com
> > > ________________________________________________________________
> > > __ __ _  
> > > 
> > > 
> > > 
> > > 
> > > *** END PGP VERIFIED MESSAGE ***
> > > 
> > > 
> > > 
> > 
> > 
> > *** END PGP VERIFIED MESSAGE ***
> >  
> > 
> > 
> > *** END PGP VERIFIED MESSAGE ***
> >  
> > 
> > 
> 
> 
> *** END PGP VERIFIED MESSAGE ***
>  
> 
> 

-----BEGIN PGP SIGNATURE-----
Version: PGP Personal Security 7.0.3

iQA/AwUBQjmQAle1Bhkj9lZeEQJwgACfSFPFHJFqDULpbTEdCvPSjoMdXh8An3qF
CjVV9g5YGWDkp5tvXVmRaoed
=LKdL
-----END PGP SIGNATURE-----
 

Attachment: Baker, Darryl.vcf
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]