[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[GSoC 2017] Multithreading
From: |
Joan Lledó |
Subject: |
[GSoC 2017] Multithreading |
Date: |
Sun, 9 Jul 2017 22:18:05 +0200 |
After finishing with IPv6, I spent the last week fixing some bugs,
mainly those related with multithreading. This is an issue I never
thought on until now, and turned up to be a real mess. I shall give
you a summary.
First, LwIP has a main thread called "tcpip_thread" where the stack is
actually running, so the stack itself is single-threaded and non
thread-safe. Its multithreading system consists on ensuring the tcpip
thread is the only one to have access to the stack resources, while
the other threads communicate with it by message passing. A global
semaphore makes sure the requests received by the main thread are
serviced sequentially. You may find further information in the LwIP
documentation[1]. Well then, the tcpip thread is initialized when
starting the translator, but in our case, the stack is also restarted
for each call to fsysopts. As a consequence, there were as many tcpip
threads as calls to fsysopts + 1. For some reason, the stack was still
working, but many threads were being wasted. This bug is fixed now
because the stack is not restarted anymore for each call to fsysopts,
instead, old interfaces are removed and new ones created on the same
main thread. However, the thread doesn't like to have its interfaces
changed in run time, so new bugs have arose in which I'll have to work
later.
Another issue was related with the Hurd's architecture. The server has
a component called "Ethernet module[2]" that is responsible for taking
the data generated by the stack and sending it to the driver, and vice
versa. In the Hurd, the communication with the driver is done by
message passing through the device interface, and a thread is needed
for listening to any incoming message from the driver and calling the
demuxer. In our case, there was a design error and a new listening
thread was created for each new interface added to the stack. Further,
each call to fsysopts restarted the stack and created a new thread for
each new interface, without removing the previous ones. I fixed this
problem by starting one single listening thread from main() when the
translator is started.
I discovered the last threading error when I tried to run SSH over
LwIP. After a few seconds, the server used to crash because of an
overflow on a variable used to count the number of threads blocked on
lwip_select(). The type of that variable is uint8_t, so there were...
255 waiting threads!. I spent about three days trying to understand
what was going on, but finally found the error and solved it. It's
worth to examine the error carefully, because it's very useful for
understanding how the Hurd works.
Let's take a look at the hurdselect.c file in Glibc, particularly at
two sections: the one starting at the line 280[3] and the two if
statements at lines 494 and 498[4]. At the line 280 and following
lines, we can see how a call to select() from a user program may lead
to many RPC calls to io_select(). Each RPC call is responsible for one
single socket, so if the user has set, say, three socket among all the
FD_SETs, then three RPCs will be performed simultaneously, one for
each socket and each one with its own thread. When an event occurs on
one of the three sockets, its io_select() operation returns and its
thread is destroyed, but the other two remaining RPCs are blocked
until its timeout is over. If no timeout is given, the threads are
blocked forever. The SSH server calls select() with no timeout over
three sockets for each character it sends or receives, so it can
generate hundreds of blocked threads in a matter of seconds.
This design is pretty smart actually, because it allows the user to
work transparently over multiple TCP/IP stacks. We can see it using
SSHD[5] as an example. As we can see in the code, the server doesn't
assume it's working over a dual-stack, in that case, there would be
enough to create a IPv6 socket to receive messages addressed to IPv4
addresses as well. Instead, SSH creates IPv4 and IPv6 sockets
explicitly, and sets the IPV6_V6ONLY option on the last one, to
prevent it from listening on IPv4 addresses just in case there's a
dual-stack bellow it. In the Hurd, the RPC to get the IPv4 socket
would be addressed to /servers/socket/2 while the IPv6 socket would be
got from /servers/socket/26. Therefore, if the user calls to select()
and includes the two sockets as SSHD does, then one io_select() RPC
will go to /servers/socket/2 and the other one to /servers/socket/26.
But, how can we cancel the pending io_select() threads that have no
timeout? and more importantly: when there's an event in one socket,
how can we know which are the threads that were created at the same
time and are not useful anymore? The answer is at the lines 494 and
498[4] in hurdselect.c. Each thread has a reply port that is destroyed
when the thread is not useful anymore, and the operation receives a
copy of the port name, so it can use it to receive notifications.
Libports has a particular function[6] for that. If we call
ports_interrupt_self_on_notification(), we can cancel the current
thread if something happens on the given port, for instance, when it's
destroyed.
However, after all the pending threads still were not being canceled.
The problem here was that the standard function where the threads
where blocked on, pthread_cond_wait(), didn't respond to cancel
requests from hurd_thread_cancel(). It was strange, because
pthread_cond_wait() is a valid cancellation point. But in the Hurd
servers we need to call our own non-standard version,
pthread_hurd_cond_wait_np()[7], which reacts to requests from
hurd_thread_cancel() and stops blocking the thread.
----------------------------------
[1] http://www.nongnu.org/lwip/2_0_x/raw_api.html
[2] https://github.com/jlledom/lwip-hurd/blob/master/port/netif/hurdethif.c
[3]
http://git.savannah.gnu.org/cgit/hurd/glibc.git/tree/hurd/hurdselect.c?h=tschwinge/Roger_Whittaker#n280
[4]
http://git.savannah.gnu.org/cgit/hurd/glibc.git/tree/hurd/hurdselect.c?h=tschwinge/Roger_Whittaker#n494
[5]
https://github.com/openssh/openssh-portable/blob/151c6e433a5f5af761c78de87d7b5d30a453cf5e/sshd.c#L1014
[6] https://www.gnu.org/software/hurd/doc/hurd_4.html#IDX41
[7]
https://github.com/ragingwind/libpthread/blob/master/sysdeps/mach/hurd/bits/pthread-np.h#L27
- [GSoC 2017] Multithreading,
Joan Lledó <=