bug-glibc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

cached binding problem in nis/ypclnt.c


From: Chris Barrera
Subject: cached binding problem in nis/ypclnt.c
Date: Sat, 26 Jul 2003 15:48:45 -0500
User-agent: Mutt/1.4i

Glibc version   : 2.2.5 (appears to be also in 2.3.2)
OS Platform     : Redhat Linux 7.3 x86
Source file     : nis/ypclnt.c
functions       : __ypbind and do_ypcall
Impact:         : Severe. Using linux as an enterprise solution in
                  a dynamic high-performance load-sharing cluster
                  environment. Bug creates huge risk and exposure
                  to any NIS service instability.

Overview
========
When a NIS server crashes and restarts, NIS client calls to that server
fail continuously. Specifically, they are not able to inform the ypbind
process to go obtain a new binding when USE_BINDINGDIR is defined.

Background
==========
When USE_BINDINGDIR is defined, the nis client, via __ypbind() in nis/ypclnt.c,
attempts to get its binding information (server, port, rpc client) from the
file "/var/yp/binding/DOMAIN.YPBINDVERS". This file is managed by the ypbind
process.

The problem occurs when the current NIS server referenced by that binding
crashes and restarts. The port and rpc client information is no longer
valid. Any use by the NIS client code via do_ypcall() will fail and return
errors. do_ypcall() will make 2 such attempts on a failure to contact the
NIS server, but even on the second attempt, it calls __ypbind() which still
reads the same invalid information from the binding file. 

__ypbind() does have a provision for forcing a rebinding by making a local
RPC call to the ypbind rpc server process, but that code is never executed
in this situation, because the data read from the binding file is assumed
correct and not validated (it probably shouldn't have to be validated at
that specific code location). The end result is that the glibc nis does not
have the ability on its own to force a new binding.

This problem is mitigated to some extent by the behavior of some
implementations of ypbind (i.e. ypbind-mt), where the ypbind process has a
20 sec interval where it actively validates the binding against the ypserver
and rebinds if there is a problem. However, there is still that 20 sec window
where calls fail and processes can experience major problems. Additionally, in
our local site, we have so many clients in production that this bind testing
adversely affects the performance of our nis servers and does not scale to
a large installation of linux systems. We would like to turn off the the 20
second checks so that ypbind behaves more like Sun's where it only revalidates
the binding when there is a problem. However, doing so increases the risk
and exposure to ypserv crashes since the nis client does not recover well
on its own.

Describing the problem in more detail:

0. The ypbind process creates a binding and stores it in
   /var/yp/binding/domain.vers.

1. The ypserver referenced by the binding crashes and is restarted
   (via server host reboot or re-invoked after being killed).

2. Any NIS table lookup occurs, say for example, we do a "ypmatch".
   a. Eventually, do_ypcall in ypclnt.c is invoked.
   b. do_ypcall needs the binding, and invokes __ypbind().
   c. __ypbind() reads the file cached binding from BINDINGDIR
      and gives that info back to do_ypcall().
   d. do_ypcall() makes the RPC call to the server.
   e. If the call fails, it calls __yp_unbind() or __yp_unbind_locked()
      to undo the binding and calls __ypbind() one more time.
   f. Since __yp_unbind*() routines do not affect the file cached binding,
      __ypbind() reads the same information again and returns it to
      do_ypcall().
   g. do_ypcall() makes the RPC call to the server, if it fails again,
      it returns an error back to the calling module.
   h. The process dies or goes weird.

3. If the local ypbind is killed and restarted, it forces a rebind, and
   things work again, or when ypbind-mt does its 20 sec ping check, it
   rebinds, and all is well.

There are several potential ways to resolve this. I propose one solution
that is concise, only adds 3-5 lines of C code in one spot, and appears
safe. Other solutions may require changes to both modules and some
additional state saved between calls, but I leave it up to the code
maintainers to decide whether and how this should be addressed.

The following code, when added to do_ypcall() in nis/ypclnt.c, appears to
completely resolve the situation. It essentially does this:

   a. do_ypcall() makes the RPC call to the ypserver.

   b. If the call fails,
      i. it calls __yp_unbind() or __yp_unbind_locked()
         to undo the binding
      ii. [NEW CODE HERE]
             do_ypcall() removes the
             /var/yp/binding/domain.vers cached binding file.
      iii. do_ypcall() calls __ypbind() one more time to get a new binding.

   c. __ypbind() attempts to read /var/yp/binding/domain.vers and fails.

   d. __ypbind() continues on and make an RPC call to YPBIND_DOMAIN on
      the local ypbind process.

   e. ypbind tests its bindings and rebinds, updating /var/yp/binding/...

   f. __ypbind() returns this binding back to do_ypcall().

   g. do_ypcall() makes its second attempt contacting the ypserver and
      succeeds.

   h. do_ypcall() returns the results to the calling program.


Code excerpt, showing where the new code is in the existing source

[EXISTING CODE in do_ypcall() routine in nis/ypclnt.c]
....
      result = clnt_call (ydb->dom_client, prog,
                          xargs, req, xres, resp, RPCTIMEOUT);

      if (result != RPC_SUCCESS)
        {
          /* Don't print the error message on the first try. It
             could be that we use cached data which is now invalid. */
          if (try != 0)
            clnt_perror (ydb->dom_client, "do_ypcall: clnt_call");

          if (use_ypbindlist)
            {
              /* We use ypbindlist, and the old cached data is
                 invalid. unbind now and create a new binding */
              yp_unbind_locked (domain);
              __libc_lock_unlock (ypbindlist_lock);
              use_ypbindlist = FALSE;
            }
          else
            {
              __yp_unbind (ydb);
              free (ydb);
            }

[NEW CODE ADDED HERE]
          /* Nuke the cached binding in BINDINDIR because __ypbind()
           * will just give us back the binding that failed. The
           * removal will force __ypbind() to make an RPC call to
           * YPBINDPROC in the localhost's ypbind process, which
           * will update the cached binding with something that
           * works. Otherwise, we must depend upon the YPBIND daemon
           * to eventually figure it out, depending on how often it
           * may or may not test its bindings ( time <= ping_interval
           * secs in ypbind-mt). Until YPBIND figures it out, there
           * is a window where the NIS client calls will fail. */
#if USE_BINDINGDIR
          {
             char path[sizeof (BINDINGDIR) + strlen (domain) + 10];
             sprintf (path, "%s/%s.%d", BINDINGDIR, domain, YPBINDVERS);
             unlink(path);
          }
#endif /* USE_BINDINGDIR */
[NEW CODE ENDS HERE]






reply via email to

[Prev in Thread] Current Thread [Next in Thread]