chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno repo


From: Jörg F . Wittenberger
Subject: Re: [Chicken-hackers] [PATCH] Avoid context switch during TCP errno reporting
Date: 20 Mar 2013 20:12:18 +0100

On Mar 20 2013, Peter Bex wrote:

On Wed, Mar 20, 2013 at 02:27:41PM +0100, Jörg F. Wittenberger wrote:
Hi all,

During the development I learned that Peter is *absolutely
correct* about the "strange error message" I needed help to
interpret a few days back.  (When the logged error indicated
that a type test - (struct <uri>) it happend to read - failed
while the failed type was the same as the required one.)

This is obviously due to some stack corruption.
 The same appears to apply to several other errors I observed
all sudden.  Most prominent among them "heap full while resizing".

With the tcp replacement code, all those frequent errors are
suddenly gone.  The code runs as stable as before now.

Is your replacement code also using select()?  It was pointed out that
tcp is still using select(), which is susceptible to a buffer overrun
just like the scheduler.  This will happen in any program which has
a sufficient number of open file descriptors and result in strange
random-seeming errors.

Not at all.  I forgot to mention: it relies only on
##sys#thread-block-for-i/o! and skips the single-fd ##net#select-write stuff entirely
for the sake of fairer scheduling.

I'll need some time to dig in and see why this is being done and
if we need to replace the select() with poll() like we did in the
scheduler or whether the select() stuff can be completely ripped
out of the tcp unit, and rely on the scheduler.

Save your time.  The tcp.scm diff I posted before is in use for
several month at least.  About as long as my first posting
to the list under a subject line "poll works somewhat" or alike.
Since I'm running with poll instead of select.

Nevertheless the tcp code is currently my suspected reason for all those stack corruption. Though let me dig deeper. I'll post my results once I know precisely why the hell I still can almost reliable trigger the "panic] out of memory - heap full while resizing - execution terminated". (Though I'm glad that this is down to once every 2-5h instead of at least once every 20min.)

But I'm not yet 100% sure that this is really the reason.

Unfortunately debugging this heavily threaded and network-i/o
loaded stuff is really not possible with synthetic tests.
I can't trigger it at my laptop, whatever I try.  Not even httperf
will help.  But I can trigger it on an ARM plug, which gets some
traffic from search engines AND sits behind an customer grade ADSL
so far.  There it's pretty reliable within a few minutes.
Sight: if I only had set up my environment for cross-compilation.


CU

/Jörg


Cheers,
Peter




reply via email to

[Prev in Thread] Current Thread [Next in Thread]