[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch6
From: |
Launchpad Bug Tracker |
Subject: |
[Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images |
Date: |
Thu, 18 Jun 2020 09:23:26 -0000 |
This bug was fixed in the package qemu - 1:4.2-3ubuntu6.2
---------------
qemu (1:4.2-3ubuntu6.2) focal; urgency=medium
* d/p/ubuntu/lp-1805256*: Fixes for QEMU on aarch64 ARM hosts
- async: use explicit memory barriers (LP: #1805256)
- aio-wait: delegate polling of main AioContext if BQL not held
-- Rafael David Tinoco <rafaeldtinoco@ubuntu.com> Wed, 27 May 2020
21:19:20 +0000
** Changed in: qemu (Ubuntu Focal)
Status: Fix Committed => Fix Released
--
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1805256
Title:
qemu-img hangs on rcu_call_ready_event logic in Aarch64 when
converting images
Status in kunpeng920:
Fix Committed
Status in kunpeng920 ubuntu-18.04 series:
Fix Committed
Status in kunpeng920 ubuntu-18.04-hwe series:
Fix Committed
Status in kunpeng920 ubuntu-19.10 series:
Fix Committed
Status in kunpeng920 ubuntu-20.04 series:
Fix Committed
Status in kunpeng920 upstream-kernel series:
Invalid
Status in QEMU:
Fix Released
Status in qemu package in Ubuntu:
Fix Released
Status in qemu source package in Bionic:
Fix Committed
Status in qemu source package in Eoan:
Fix Committed
Status in qemu source package in Focal:
Fix Released
Bug description:
[Impact]
* QEMU locking primitives might face a race condition in QEMU Async
I/O bottom halves scheduling. This leads to a dead lock making either
QEMU or one of its tools to hang indefinitely.
[Test Case]
* qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
Hangs indefinitely approximately 30% of the runs in Aarch64.
[Regression Potential]
* This is a change to a core part of QEMU: The AIO scheduling. It
works like a "kernel" scheduler, whereas kernel schedules OS tasks,
the QEMU AIO code is responsible to schedule QEMU coroutines or event
listeners callbacks.
* There was a long discussion upstream about primitives and Aarch64.
After quite sometime Paolo released this patch and it solves the
issue. Tested platforms were: amd64 and aarch64 based on his commit
log.
* Christian suggests that this fix stay little longer in -proposed to
make sure it won't cause any regressions.
* dannf suggests we also check for performance regressions; e.g. how
long it takes to convert a cloud image on high-core systems.
[Other Info]
* Original Description bellow:
Command:
qemu-img convert -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
Hangs indefinitely approximately 30% of the runs.
----
Workaround:
qemu-img convert -m 1 -f qcow2 -O qcow2 ./disk01.qcow2 ./output.qcow2
Run "qemu-img convert" with "a single coroutine" to avoid this issue.
----
(gdb) thread 1
...
(gdb) bt
#0 0x0000ffffbf1ad81c in __GI_ppoll
#1 0x0000aaaaaabcf73c in ppoll
#2 qemu_poll_ns
#3 0x0000aaaaaabd0764 in os_host_main_loop_wait
#4 main_loop_wait
...
(gdb) thread 2
...
(gdb) bt
#0 syscall ()
#1 0x0000aaaaaabd41cc in qemu_futex_wait
#2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>)
#3 0x0000aaaaaabed05c in call_rcu_thread
#4 0x0000aaaaaabd34c8 in qemu_thread_start
#5 0x0000ffffbf25c880 in start_thread
#6 0x0000ffffbf1b6b9c in thread_start ()
(gdb) thread 3
...
(gdb) bt
#0 0x0000ffffbf11aa20 in __GI___sigtimedwait
#1 0x0000ffffbf2671b4 in __sigwait
#2 0x0000aaaaaabd1ddc in sigwait_compat
#3 0x0000aaaaaabd34c8 in qemu_thread_start
#4 0x0000ffffbf25c880 in start_thread
#5 0x0000ffffbf1b6b9c in thread_start
----
(gdb) run
Starting program: /usr/bin/qemu-img convert -f qcow2 -O qcow2
./disk01.ext4.qcow2 ./output.qcow2
[New Thread 0xffffbec5ad90 (LWP 72839)]
[New Thread 0xffffbe459d90 (LWP 72840)]
[New Thread 0xffffbdb57d90 (LWP 72841)]
[New Thread 0xffffacac9d90 (LWP 72859)]
[New Thread 0xffffa7ffed90 (LWP 72860)]
[New Thread 0xffffa77fdd90 (LWP 72861)]
[New Thread 0xffffa6ffcd90 (LWP 72862)]
[New Thread 0xffffa67fbd90 (LWP 72863)]
[New Thread 0xffffa5ffad90 (LWP 72864)]
[Thread 0xffffa5ffad90 (LWP 72864) exited]
[Thread 0xffffa6ffcd90 (LWP 72862) exited]
[Thread 0xffffa77fdd90 (LWP 72861) exited]
[Thread 0xffffbdb57d90 (LWP 72841) exited]
[Thread 0xffffa67fbd90 (LWP 72863) exited]
[Thread 0xffffacac9d90 (LWP 72859) exited]
[Thread 0xffffa7ffed90 (LWP 72860) exited]
<HUNG w/ 3 threads in the stack trace showed before>
"""
All the tasks left are blocked in a system call, so no task left to call
qemu_futex_wake() to unblock thread #2 (in futex()), which would unblock
thread #1 (doing poll() in a pipe with thread #2).
Those 7 threads exit before disk conversion is complete (sometimes in
the beginning, sometimes at the end).
----
On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img
frequently hangs (~50% of the time) with this command:
qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2
Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This
qcow2->qcow2 conversion happens to be something uvtool does every time
it fetches images.
Once hung, attaching gdb gives the following backtrace:
(gdb) bt
#0 0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0,
nfds=187650274213760,
timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950)
at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1 0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized
out>,
__fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3 0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1)
at util/main-loop.c:233
#4 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497
#5 0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at
qemu-img.c:1980
#6 img_convert (argc=<optimized out>, argv=<optimized out>) at
qemu-img.c:2456
#7 0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at
qemu-img.c:4975
Reproduced w/ latest QEMU git (@ 53744e0a182)
To manage notifications about this bug go to:
https://bugs.launchpad.net/kunpeng920/+bug/1805256/+subscriptions
- [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images, Brian Murray, 2020/06/02
- [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images, Ike Panhc, 2020/06/03
- [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images, Christian Ehrhardt , 2020/06/05
- [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images, Andrew Cloke, 2020/06/11
- [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images, Christian Ehrhardt , 2020/06/17
- [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images,
Launchpad Bug Tracker <=
- [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images, Launchpad Bug Tracker, 2020/06/18
- [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images, Launchpad Bug Tracker, 2020/06/18
- [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images, Andrew Cloke, 2020/06/18
- [Bug 1805256] Re: qemu-img hangs on rcu_call_ready_event logic in Aarch64 when converting images, Christian Ehrhardt , 2020/06/30