bug-glibc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

DNS resolver problems when one nameserver is down


From: James Pearson
Subject: DNS resolver problems when one nameserver is down
Date: Tue, 30 Sep 2003 15:13:38 +0100

I've recently had a major problem when one of my internal DNS servers
went down and I'm trying to work out a way of improving the situation.

I'm have a network of mainly RedHat 7.2 based machines that each have a
/etc/resolv.conf like:

domain my.domain
nameserver 1.2.3.4
nameserver 1.2.3.5
options rotate

The 2nd listed nameserver above crashed and _all_ my linux clients had
problems resolving hostnames - which has a massive knock-on effect,
grinding everything to a halt.

I'm now trying to get a better understanding of how the resolver works
and how I can improve matters if this happens again.

According to the resolv.conf man page, the 'options rotate' should
spread the load amongst the nameservers - but in my subsequent tests,
this doesn't happen - all it does is force the resolver to use the 2nd
nameserver first for _every_ lookup - so when the 2nd nameserver
crashed, every lookup times out after 5 seconds before using the 1st
nameserver. It appears that if I hadn't used the rotate option, I would
have been OK when the 2nd nameserver went down (but not if the 1st
did!).

Should the rotate option work with RH7.2 (glibc 2.2.4)?

I can improve matters if I reduce the timeout to 1 second, but it
appears the resolver code is not intelligent enough to realize that it
keeps timing out on the same nameserver with subsequent lookups.

I guess I could use something like nscd - but that again still uses the
same nameserver for subsequent lookups of hostnames that are not cached.

Is there something analogous to the NIS 'ypbind' for DNS lookups? i.e.
something like nscd that instead of caching hostnames, caches the good
nameserver to use?

I had a look an the libresolv code and hacked something into res_nsend()
that reorders the nsaddr_list[] arrya to make sure the last successfully
used nameserver is used first on subsequent lookups. This appears to
work OK - if I make a non-existent nameserver the first listed in
/etc/resolv.conf, (without the 'options rotate'), then something like
'rup' will timeout after 5 secs on the first lookup, then subsequent
lookups use the second listed nameserver and work fine.

Of course, this initial timeout happens with each new application, but
if I use this modified libresolv with nscd, then it appears I can quite
happily resolve hostnames etc. without problems.

Is doing this likely to cause other problems? Is there a better way to
do this? 

Sorry if this is in a FAQ somewhere, but as it has always appeared to
work OK, I've never really had to think about this before ...

Thanks

James Pearson




reply via email to

[Prev in Thread] Current Thread [Next in Thread]