chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno repo


From: Jörg F . Wittenberger
Subject: Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting
Date: 20 Mar 2013 14:27:41 +0100

Hi all,

I'm not yet convinced that this patch will fix everything screwed
up by use of the tcp implementation.

The past days I wrote a replacement for my use.  (A bit incomplete
wrt. API compatibility to the tcp unit and thrown into a module
I'm using to drive the SSL/TLS implementation I'm using for the
past couple of years; hence the module contains this code too
plus some SOCK4a setup code for use with tor... If anybody
is interested I'll forward the code or post it here, as you
guys like it.)

During the development I learned that Peter is *absolutely
correct* about the "strange error message" I needed help to
interpret a few days back.  (When the logged error indicated
that a type test - (struct <uri>) it happend to read - failed
while the failed type was the same as the required one.)

This is obviously due to some stack corruption.
 The same appears to apply to several other errors I observed
all sudden.  Most prominent among them "heap full while resizing".

With the tcp replacement code, all those frequent errors are
suddenly gone.  The code runs as stable as before now.

However I found a way to reliably trigger the problem anyway.
(Just run using PLT's dns resolver code against a bind server
AND close the port underneath.  That is, a slightly modified
version, which will re-use the tcp connection.)
I'll explore this in the next days.  For the time I'm not re-using
the tcp connection.


The code I wrote however is a major deviation from the existing
tcp code internally.

1.)  No timeout parameters.  (At least not at the lowest level.)
    Why?

 The Askemos/BALL code implements replication of sqlite3 databases
 and files in a way similar to bittorrent.  This type of p2p
 network applications is subversive.  You're deal with all
 sorts of failures in the network, plus hostile clients.

 a) In such a context it's little fun to maintain the timeout
    at a call-by-call basis.
 b) You want to have all sorts of different timeouts.  E.g.,
    wait for HTTP-keep-alive time for the next request line,
    wait a reasonable short amount of time for the next chunk
    in chunked encoding, even less for the next chunk header line.
 c) Almost all timeouts never kick in.  Thus the overhead of
    inserting them into the timeout queue just to remove them
    a fraction of a second later turns out to be expensive
    and a huge slowdown for the overall i/o throughput.

    This is even true with the scheduler improvements I posted
    here (or at chicken-users ?) before, which would replace
    the linear list for timeouts with an LLRB tree.

    Therefore I'm using a different timeout handling, where
    thimeouts are inserted into a mailbox and the entry is
    kept at the callsite.  Instead of removing the timeout
    from the full list, the  entry is invalidated.  Once a second
    the timeout queue/mailbox is replaced with a fresh one
    and in the next run, those timeouts, which where not yet
    invalidated are actually made active.  Rather complicated
    to describe, but much, much faster to execute.

2.)  The lowlevel code structure is kept more akin to the
    way it's handled in RScheme.  Because this avoids those
    tricks to distinguish ports by their prot-data to
    eventually figure the tcp-adresses out.

3.)  Avoid passing DNS names to tcp-connect.  It depends the
    obsolete (as per Linux manual at least) gethostbyname,
    which could block the threading for too long time.
    Do a DNS hostlookup instead.

4.)  Don't duplicate code from library.scm ##sys#thread-yield!
    to "yield".  Use srfi-18 thread-yield! instead.

Best

/Jörg

PS/BTW: in "extras" read-lin there is a local definition
"fixup", which is unused.


On Mar 18 2013, Jim Ursetto wrote:

Here's a full patch to avoid context switches screwing up the error message
reported to the user, and also consolidates much of the error handling.

I think this patch is sufficient because the only actual issue, as I understand it, is that under high load you will occasionally get an incorrect error message (typically, "operation in progress") instead of the real error message; an exception will still fire regardless. Disabling interrupts instead is probably overkill, unless you know that won't cause hangs.

Also the patch doesn't do any harm and cleans up the code a bit, so you
can still apply a different fix on top of it.

Jim






reply via email to

[Prev in Thread] Current Thread [Next in Thread]