[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: checksum woes

From: Frank Ranner
Subject: Re: checksum woes
Date: Sat, 31 Jan 2004 18:23:22 +1100
User-agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)

Tod Oace wrote:
This reminds me of a bug that was in an old version. Are you
up to date with upgrades?

Frank's problem seems different than the one I was experiencing. I was just experiencing copies sporadically misfiring. And to follow up... I had reported that this was still happening even after I disabled checksum databases (client and server). Actually what I found was that after I killed and restarted all my cfservd's the problem completely disappeared.

So my problem was that sometimes the checksum database lookups would not find data when they should have. I've been meaning to try BerkeleyDB 4.2 and see if that helps. The change list between 4.1 and 4.2 looked pretty long. I was and still am using db-4.1.25 with Cfengine 2.1.0p1.

I have an objection to how cfservd reacts to the lookup failure. When the database lookup fails cfservd tells cfagent that the checksum has changed and cfagent goes ahead with its copy, even though the file may be exactly the same. Ideally BerkeleyDB wouldn't ever fail, but if it does, or if you blow away your checksum database then cfservd causes unnecessary copies because its not comparing the local and remote checksums.

If cfservd can't do the database lookup it should compute and compare the checksum before stating that it is different. It looks like misc.c:ChecksumChanged already computes and stores a checksum on the cfservd side. ChecksumChanged could compute the checksum a bit earlier on and then use that result for a comparison. If the checksums are equal then it should stash the checksum in the database and report the checksums as equal.

Again, I'm looking at 2.1.0p1. My apologies if you've already reworked this in 2.1.1. See my 2003-Dec-24 post for more details, including debug output: -8&

Hopefully Frank's problem can be solved with an upgrade.   -Tod


I created a mini-config that duplicated the problem and then ran cfservd and cfagent in debug/verbose mode. It was reporting an MD5 mismatch, even though the source and dest files were the same. I used a standalone md5 program to compute the checksum and verified that that was what what reported in the trace. I also db_dumped the database and verified that the filepath was present and the checksum matched.

While looking at syslog I noticed a lot of cfenvd errors complaining about the database. This led me to the conclusion that I had mixed up db versions. The cf programs were linked with db-3.3, but somewhere along the way I had done a test version linked with db-4.2 (while trying to solve the database corruption/crash problem). Of course cfengine treated my test version as damage and replaced it with the old versions, which then didn't like the database.

I have since compiled and relinked the programs against db-4.2 and put them into the distribution. The extraneous copies appear to have stopped.

However I still believe that the checksum database access needs work. Sleepycat documentation states that you need to set up an environment element and provide that enviroment to all instances of db_create, if you want to use multi-reader/single-writer operation. That will be a bit of work to set up. In the meantime I may just put a big pthread lock around the call to ChecksumChanged.

Frank Ranner

Tod Oace wrote:

A couple weeks ago I posted a message about trouble I'm having with
type=checksum network copies occasionally firing off when files  have
not changed on the server.

I'd be VERY interested to hear if you solve this one.  I'm having  the
EXACT same issue on one of my servers.  The difference in my case is
that I'm not using a checksum database of any kind. All the checksums
get computed in real-time (server-side AND client-side).

Well that's disturbing/interesting. Yesterday I tried disabling the
checksum database on the server side and have still been seeing the
problems. So earlier today I disabled it on the client side and have
seen a couple more cases of it since then. I'm not sure if it's  slowed
down any, but I'll know for sure tomorrow. I've been tracking one
particular problem for the past couple weeks and have a good  baseline.

I've been beating my head against the wall on this for a while.

I'm glad I'm not the only one. I guess.  :)

I'll try and capture and analyze more cfservd debug output soon.

I am having the same problem. However it is happening every time on some
copies. Not only that, it then tries to save the file in
/var/spool/cfengine, and finds an entry already there. It then
recursively moves the saved files, and after a while I get files with
multiple instances of _var_spool_cfengine at the beginning and umpteen
.cfsaved extensions on the end.

I haven't looked into the problem yet. I only found it because  'locate'
was segfaulting. Doing `locate '*' | tail` showed the segfault  occuring
after printing some of the overlong cfengine spool files.

It is interesting that the extraneous copies occur regardless of the
checksum database. I suspected that the problem was related to the
unsafe concurrent access to the checksum DB. It appears not. One of  the
files that gets copied every time is nedit. The destination is
/usr/local/bin. There is definitely an entry for nedit in the checksum
database. The database can be examined using db_dump with the -p  option
to show human readable output instead of hexified text.

Since the problem is solid for me I will try and duplicate it with the
smallest config file I can manage. Then I should be able to do full
debugging, trussing, network snooping, etc.

Frank Ranner

Help-cfengine mailing list

Work: +47 22453272            Email:  address@hidden
Fax : +47 22453205            WWW  :

Help-cfengine mailing list

reply via email to

[Prev in Thread] Current Thread [Next in Thread]