gnunet-developers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[GNUnet-developers] Re: lockup


From: Christian Grothoff
Subject: [GNUnet-developers] Re: lockup
Date: Wed, 19 Nov 2003 17:20:47 -0500
User-agent: KMail/1.4.3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

My comments are below.

On Tuesday 18 November 2003 06:15 pm, Igor wrote:
> The gnunetd deadlock looks like this
>
> (gdb) bt
> #0  0x4009c0d8 in send () from /lib/libpthread.so.0
> #1  0x400334e9 in SEND_BLOCKING_ALL (s=18, buf=0x827078c, len=1048) at
> io.c:284
> #2  0x400393b6 in writeToSocket (sock=0xbe9ffa64, buffer=0x827078c)
>     at tcpio.c:206
> #3  0x4021de99 in tellClient3HashReply (sock=0x2d7, hc=0x1d1,
> result=0x2d7)
>     at routing.c:463
> #4  0x4021f341 in execSingleQuery (sender=0x0, sock=0xbe9ffa64, prio=11,
>     ttl=727, query=0x9d, superHash=0) at routing.c:1115
> #5  0x4021ffbf in execQuery (qp=135208836, msg=0x80f1f64, sock=0xbe9ffa64)
>     at routing.c:1492
> #6  0x4021864e in csHandleRequestQuery (sock=0x2d7,
> queryRequest=0x8248d2c)
>     at handler.c:373
> #7  0x080547e8 in processHelper (msg=0x8248d2c, sender=0x2d7)
>     at tcpserver.c:148
> #8  0x080548ac in processData (sockDescriptor=18) at tcpserver.c:174
> #9  0x40095e31 in pthread_start_thread () from /lib/libpthread.so.0
> #10 0x40095eaf in pthread_start_thread_event () from /lib/libpthread.so.0
> #11 0x401b830a in clone () from /lib/libc.so.6
>
> gnunetd has apparently lost all network connections and is
> not accepting gnunet-stats call (hangs). The other threads
> look normal. The respective gnunet-gtk has no search
> open (has been closed), 4 large dls pending at 99%. May
> be just the bug reported by Markku Tavasti.
>
> I included all the thread bts as an attachment.

Very nice.  gnunet-gtk thread 7 shows signs of despair that can not be 
decoded, which is said since we may very well have the culprit right there.

Let me explain what I see.

gnunetd is written to 'trust' the TCP client (here: gnunet-gtk) in that they 
will always (!) read whatever gnunetd writes to the TCP pipe.  If they don't, 
gnunetd may block indefinitely (until the client exits).  This is what we see 
in the gnunetd traces: gnunetd wrote a result to the pipe and gnunet-gtk 
never reads it.  Since the gnunetd thread is holding some locks, this 
eventually blocks the peer entirely.  This will 
a) stall downloads from other peers
b) stall all local downloads
c) be reverted to normal operation once gnunet-gtk is killed and the TCP 
stream is closed.

So all of this is consistent with all of our reports.  It raises the question 
if we should re-visit the decision to 'trust' the TCP clients to this extent.
The problem with not trusting the TCP clients to not block is that the 
communications become unreliable which would break major parts of the 
protocol and make both clients and gnunetd much harder to write, debug and 
understand.  A light-weight alternative might be to have a policy that 
gnunetd 'must not' hold any 'global' locks while doing blocking operations 
with clients.  That is more difficult to enforce but would keep it simple.  
In the example, gnunetd would have to use a separate thread (per download) to 
send the results to gnunet-gtk and consequently never block more than that 
thread if gnunet-gtk stops reading.

Secondly, we clearly have some bug here in gnunet-gtk (see Igor's backtraces, 
attached).  Anyway, let's start with the gnuentd traces.   They fail to show 
what the gnunet-tcpserver (listen) thread is doing, which is why we can not 
really tell why it does not process gnunet-stats or gnunet-gtk's cron-"are 
you there"-request.  Since that cron-request blocks gnunet-gtk's cron (which 
could cause other problems) I'd have really liked to know why tcpserver is 
blocked. Oh well. What we can see is that gnunetd thread 10 while holding at 
least some routing-lock is in the blocking write to gnunet-gtk (I'm just 
guessing where the socket goes here, but  I kind of doubt Igor had anything 
else running). So what is gnunet-gtk doing?

Threads 4, 6, 8, 10, 12, 13 and 14 are trying to do a read.  Since they should 
hold no relevant locks at this point and not be responsible for doing 
anything else, we can probably idemnify them from any guilt. Thread 1 and 2 
are also innocent by virtue.  3 is more critical, since we don't really want 
cron to ever block indefinitely.  Now, I can't think of anything that cron 
should do (other than trigger additional requests) that would possibly 
unblock gnunetd, so it's probably ok.  Nevertheless, the 'checkDaemonRunning' 
as a synchronous cron-job should probably be revisited (extra thread?).

That leaves threads 5, 7 and 9 about which the trace sadly only reveils that 
they are blocked -- but we don't see any trace.  Considering the addresses on 
the stack, I suspect they may just be zombies kept alive by gdb (we have a 
couple of these in the gnunetd traces, too!).

Hmm, so this is bad.  We have a socket that, according to Igor's report, was 
opened by gnunet-gtk (since killing gtk solves the problem), that deadlocks 
gnunetd (with all of the dreaded consequences) and for which I can not easily 
pin-point which gnunet-gtk thread should have been reading from it to unlock 
gnunet-gtk.  

Interesting.

Comments, remarks, patches, dead-threads, etc. welcome.

:-)

Christian


> gnunet-gtk was also hung in the sense that it hung
> when I quit it and didn't exit (one of the stack
> traces clearly show that it was already locked up on
> the read checking the gnunetd up status).
>
> Killing gnunet-gtk with SIGKILL got gnunetd running
> again and answering the stats query. gnunetd also
> spewed a pseudointeresting list of stuff,
>
> Nov 19 01:11:47 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4346334 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> Nov 19 01:11:47 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4166692 from 0E03810E5DA409B42FEFB871303121DEDF2AFFB3
> Nov 19 01:11:47 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4411892 from 0E03810E5DA409B42FEFB871303121DEDF2AFFB3
> Nov 19 01:11:47 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4435332 from 0E03810E5DA409B42FEFB871303121DEDF2AFFB3
> Nov 19 01:11:47 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4372859 from 0E03810E5DA409B42FEFB871303121DEDF2AFFB3
> Nov 19 01:11:47 DEBUG: received content
> 33C1B8312CD45019EAE61D83FEB9532630686EDB from peer
> 0E03810E5DA409B42FEFB871303121DEDF2AFFB3
> Nov 19 01:11:47 DEBUG: no matching query pending for content (not
> indirected)
> Nov 19 01:11:47 DEBUG: received content
> 8B329DC3A15FF37E1E23438F2316F4F0088C624D from peer self
> Nov 19 01:11:47 DEBUG: content is not new.
> Nov 19 01:11:47 DEBUG: received content
> 8B329DC3A15FF37E1E23438F2316F4F0088C624D from peer self
> Nov 19 01:11:47 DEBUG: content is not new.
> Nov 19 01:11:47 DEBUG: received content
> 8B329DC3A15FF37E1E23438F2316F4F0088C624D from peer self
> Nov 19 01:11:47 DEBUG: content is not new.
> Nov 19 01:11:47 DEBUG: received content
> 8B329DC3A15FF37E1E23438F2316F4F0088C624D from peer self
> Nov 19 01:11:47 DEBUG: content is not new.
> Nov 19 01:11:47 DEBUG: received content
> 8B329DC3A15FF37E1E23438F2316F4F0088C624D from peer self
> Nov 19 01:11:47 DEBUG: content is not new.
> <and same line repeated n times>
> Nov 19 01:12:02 INFO: ideal: 9600 bpm, previously transmitted: 67108864
> bpm, will transmit: YES - my limit: 9600 bpm
> Nov 19 01:12:02 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4391077 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> Nov 19 01:12:02 INFO: received 1-query
> 8B329DC3A15FF37E1E23438F2316F4F0088C624D with ttl 4379108 and priority 0
> Nov 19 01:12:02 INFO: GROW - equal existing query exists without replies
> (-16689, 0)
> Nov 19 01:12:02 DEBUG: 0 slots free in routing table for query with 1 hash
> codes, none forwarded
> Nov 19 01:12:02 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4373361 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> Nov 19 01:12:02 INFO: received 1-query
> 8B329DC3A15FF37E1E23438F2316F4F0088C624D with ttl 4358739 and priority 0
> Nov 19 01:12:02 INFO: REPLACE (seen was empty): existing query and TTL
> higher (-41, 0)
> Nov 19 01:12:02 DEBUG: slots free in routing table, forwarded 1 out of 1
> queries
> Nov 19 01:12:02 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4372490 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> Nov 19 01:12:02 INFO: received 1-query
> 8B329DC3A15FF37E1E23438F2316F4F0088C624D with ttl 4358719 and priority 0
> Nov 19 01:12:02 INFO: REPLACE (seen was empty): existing query and TTL
> higher (-2, 0)
> Nov 19 01:12:02 DEBUG: slots free in routing table, forwarded 1 out of 1
> queries
> Nov 19 01:12:02 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4388999 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> Nov 19 01:12:02 INFO: received 1-query
> 8B329DC3A15FF37E1E23438F2316F4F0088C624D with ttl 4376869 and priority 0
> Nov 19 01:12:02 INFO: REPLACE (seen was empty): existing query and TTL
> higher (-26177, 0)
> Nov 19 01:12:02 INFO: REPLACE (seen was empty): existing query and TTL
> higher (-26177, 0)
> Nov 19 01:12:02 INFO: REPLACE (seen was empty): existing query and TTL
> higher (-26177, 0)
> Nov 19 01:12:02 INFO: REPLACE (seen was empty): existing query and TTL
> higher (-26177, 0)
> Nov 19 01:12:02 INFO: REPLACE (seen was empty): existing query and TTL
> higher (-26177, 0)
> Nov 19 01:12:02 INFO: REPLACE (seen was empty): existing query and TTL
> higher (-26177, 0)
> Nov 19 01:12:02 DEBUG: recq E30065F18040A0715978ADEEC1C551D2BDD078B2 ttl
> 1341449 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> Nov 19 01:12:02 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4322848 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> Nov 19 01:12:02 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4304553 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> Nov 19 01:12:02 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4313046 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> Nov 19 01:12:02 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4324897 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> Nov 19 01:12:02 DEBUG: recq 8B329DC3A15FF37E1E23438F2316F4F0088C624D ttl
> 4282784 from 6CEB45E0D57DA750C25E3A71C2C6BFF3863B7CC8
> <cut>
>
> Which seems to indicate that some nasty bs was buffered
> there. Or maybe not. Hope this helps any.

Definitely helps. 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE/u+zA9tNtMeXQLkIRAuwnAKCZeHkk0vJG0N4N3n6KovMZX+n0WwCdEZMq
R5U30KArxVyNLyONcf8rHPA=
=Wt+a
-----END PGP SIGNATURE-----

Attachment: bt.txt
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]