[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Fri, 28 Nov 2003 13:35:15 +0100
We are running nscd from glibc-2.2.5 from the Debian/woody distribution. What
I observe is that sometimes a single hostname starts failing. It won't be
resolved through the gethostbyXXX functions. Nslookup still works though.
The failure sometimes exists for a long time (a few hours).
Since it is all 'sometimes' on varying hostnames, it is very hard to get
a grip on. But with 500 servers running, about 5-10 servers a day are
exhibiting the problem.
I know that there are sometimes network hickups and I suppose that is where
the initial problem starts. But since the negative ttl. in the nscd.conf
for the host entries is 20secs., one would not expect it to last for hours.
I observed that ncsd was still returning host not found while at the same
time nslookup returned the correct answers. Also, some applications became
pretty upset with the resolver error and started retrying and logging and
thus generating a steady stream of lookups. And it seems that the latter
situations is not very well handled by nscd. The prune_cache code depends
on a poll() with a timeout of 0 to return 0. But when there is also a pending
event on on of the descriptors, this might not happen. In the kernel code
that I read, first the descriptor events were handled and thereafter the
timeout was used. AFAIK, there is no defined behaviour in the specification
of the poll systemcall.
I used the following patch:
--- connections.c.org Fri Nov 28 09:09:57 2003
+++ connections.c Fri Nov 28 09:11:52 2003
@@ -443,7 +443,12 @@
- int nr = poll (&conn, 1, timeout);
+ int nr;
+ if (timeout == 0)
+ nr = 0;
+ nr = poll (&conn, 1, timeout);
if (nr == 0)
This seems to be helpful, I haven't seen the problem for about a week now.
|[Prev in Thread]
||[Next in Thread]|
- nscd bug?,
Leo Weppelman <=