sks-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Sks-devel] [PATCH] auto-refresh membership DNS


From: Phil Pennock
Subject: Re: [Sks-devel] [PATCH] auto-refresh membership DNS
Date: Sun, 22 Mar 2009 16:46:17 -0700

On 2009-03-22 at 13:45 +0000, Kim Minh Kaplan wrote:
> I see what you mean here...  Except that periodic DNS lookups are *not*
> The Right Thing.  This is one area where I think SKS got it wrong: it
> should call out to the resolver each time it needs to connect to a
> server and let the caching happen in normal ways (DNS TTL).  Please have
> a look at my other message "Keep DNS mappings fresh"[1]

Oh, right, sorry -- I forgot about that, because it was incompatible
as-is with the IPv6 work and so I didn't apply it.

> With my patch the "additional load" is bigger but it will still be
> minuscule when compared to the rest of the traffic needed for the
> reconciliation protocol anyway.  This is not where we should look for
> optimization.

You're quite right -- relying on a functioning DNS cache is the correct
way to go (but see below) -- I was going for the quick and easy solution
(and feeling uncomfortable while doing so).

However, either the membership_reload_interval option needs to be
completely removed or the reconserver.ml needs to support it -- leaving
it as a dbserver-only option seems sub-optimal.

Call it paranoia resulting from maintaining mail-server code in previous
employment, where mtime collisions in a cluster were possible so relying
on mtime-changed was a bad plan.

So, "sks-mshp-timed2.patch" should probably go in -- it fixes the
mailsync reload (as both those patches do) and adds the event handler
for reload.  With the default reload interval of 5 hours, the load
addition is minimal (understatement) and the benefit is that you gain
assurance that the file change *will* be picked up.  Eventually.

> OTOH if the membership reload takes more than the gossip_interval and
> reconciliation_config_timeout setting (typically one minute) then the
> loading never finishes and the server never reconciles.  It happened to
> me when three of my partners' nameservers went out of service.  Making
> the lookup as needed solves this problem.

Yes, the reliance upon functional DNS is good.  But not the looking up
in the main flow of control.  This is where things get very sticky very
quickly.

As is, doesn't your patch lead to a recon connection from a non-peer
while one of your peers is without DNS being a mini-DoS attack?  So once
you have a peer with bad DNS, you become susceptible to recon service
DDoS?

When there's no DNS for a peer, and you try to find the DNS, then the
local DNS cache will respond quickly for a period of time which is the
negative cache TTL imposed by that server for SERVFAIL caching.  So you
go from the old scenario, where you're hung up every
membership_reload_interval/mtime-changed period, which is O(hours) to
hung up every negative cache-entry TTL expiry, which is O(minutes).

Provided you only get recon connections from peers, this only bites when
you get gossip from a peer with bad DNS.  Which isn't going to be too
often, but still more often than the old reload interval.  If you also
get recon connections from non-peers, suddenly your recon thread is hung
up at the whim of anyone willing to issue a connection every few
minutes.  Fortunately, the level of impact only scales up with the
number of peers with bad DNS, so you'll still *mostly* be serving.

Thus while your patch is clearly trying to do the right thing, I think
it's a step backwards in resilience.  (One more than offset by your
memory usage stability patch, but still ...)

The clearest way out of this is to require dbserver/reconserver to have
event handler callbacks for DNS, use asynchronous DNS callback
resolution; populate membership with None entries and at load/reload
fire off lookup for these.  During connection check, if an IP entry is
None and the last reload was more than N seconds ago (!Settings knob,
default to 3. ?) then (1) fire off another async DNS resolution and then
(2) return failure immediately, so that the peer gets penalised for
flaky DNS and your server isn't hanging in the main flow of control.

The gotcha here is async DNS support in O'Caml.

I found an announcement for an O'Caml async DNS library called netdns:
  
http://groups.google.com/group/fa.caml/browse_thread/thread/7bd2ae0a9415340d?pli=1
  http://oss.wink.com/netdns/
which is BSD-licensed and at version 0.1.  It's main documented
incompleteness s that it requires a full resolver -- which is what we
want here anyway.

In addition, from looking at it: it doesn't support AAAA records it uses
incremental xids and I think it's using a constant source port, so you'd
really want to be using a localhost resolver; even then, since it's not
matching source port, you're vulnerable.  I don't think this library is
ready for use.

There's then "adns", which is GPL'd, with bindings in some languages but
not O'Caml (does include Haskell); "c-ares" which is BSD licensed, does
include IPv6 support, is widely used and I'm pretty sure it will have
dealt with the xid/port attacks.  There are various smaller libraries
too, which don't manage to keep their websites working.  But all of
these options will require wrapping the C calls with O'Caml and tying
into the event system -- I frankly lack the knowledge in O'Caml to even
estimate how much work this is.

So, in short, you've bitten off a bigger problem.
-Phil

Attachment: pgpc0UXsIrwAe.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]