[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bootstrapping

From: Luke A. Kanies
Subject: Re: Bootstrapping
Date: Wed, 18 Feb 2004 16:56:39 -0600 (CST)

On Wed, 18 Feb 2004, Eric Sorenson wrote:

> Cfengine has the same problem, except when the host key changes
> you have to track down why this one machine can't get updates and
> the users are complaining.

This is another problem that I consider unsolved.  How do you know all of
your hosts are correctly updating themselves?  How do you even define

At my previous client I was reading all syslog messages from a pipe
written to by syslog-ng, and then storing those logs in a database.  I
tacked a small filter on that reader and had it start storing last-seen
records in LDAP for every host (with some throttling so I didn't spam the
LDAP server).  Then I defined 'recent' for my various services (cfengine
and ISconf, in that case), and had a script which could easily check
whether all of my hosts were 'recent'.  I never went so far as to connect
it to a tool like Nagios, but I would have liked to.

This was a pretty good method in that it used my master host list to tell
me the status of every host in the list.  However, it had a serious
failing:  It didn't have a good definition of correct.  Of course, it was
also subject to failures of the syslog system (syslog-ng dies, the reading
script dies, etc.), but that was solvable through other methods.

So, as to 'correctly updating':  If a client can successfully copy
_anything_ is it working?  What about if it's just running cfagent at all?
What if it has some errors, such as being incapable of starting a process?

I don't believe it's possible to have cfagent collect the number of
errors, or to classify a portion of an update as 'critical' or 'optional',
but that would certainly be useful.  If I could collect that information
and then use it to have the client update my LDAP repository as the last
stage in any run, then I would believe I had a good definition of a
functional system.  Just a simple (No errors/Some Minor Errors/Some
Critical Errors/Total nonfunction/No Data) stat of some kind would be very

I'm working on it....


Health is merely the slowest possible rate at which one can die.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]