qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: deadlock when using iothread during backup_clean()


From: Fiona Ebner
Subject: Re: deadlock when using iothread during backup_clean()
Date: Tue, 17 Oct 2023 15:37:11 +0200
User-agent: Mozilla Thunderbird

Am 17.10.23 um 14:12 schrieb Kevin Wolf:
> Am 17.10.2023 um 12:18 hat Fiona Ebner geschrieben:
>> I ran into similar issues now with mirror, (both deadlocks and stuck
>> guest IO at other times), and interestingly, also during job start.
>>
>> Also had a backtrace similar to [0] once, so I took a closer look.
>> Probably was obvious to others already, but for the record:
>>
>> 1. the graph is locked by the main thread
>> 2. the iothread holds the AioContext lock
>> 3. the main thread waits on the AioContext lock
>> 4. the iothread waits for coroutine spawned by blk_is_available()
> 
> Where does this blk_is_available() in the iothread come from? Having it
> wait without dropping the AioContext lock sounds like something that
> we'd want to avoid. Ideally, devices using iothreads shouldn't use
> synchronous requests at all, but I think scsi-disk might have some of
> them.
> 

It's part of the request handling in virtio-scsi:

> #0  0x00007ff7f5f55136 in __ppoll (fds=0x7ff7e40030c0, nfds=8, 
> timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
> #1  0x00005587132615ab in qemu_poll_ns (fds=0x7ff7e40030c0, nfds=8, 
> timeout=-1) at ../util/qemu-timer.c:339
> #2  0x000055871323e8b1 in fdmon_poll_wait (ctx=0x55871598d5e0, 
> ready_list=0x7ff7f288ebe0, timeout=-1) at ../util/fdmon-poll.c:79
> #3  0x000055871323e1ed in aio_poll (ctx=0x55871598d5e0, blocking=true) at 
> ../util/aio-posix.c:670
> #4  0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ec90) at 
> /home/febner/repos/qemu/block/block-gen.h:43
> #5  0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at 
> block/block-gen.c:1426
> #6  0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, 
> d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
> #7  0x0000558712f697e4 in virtio_scsi_handle_cmd_req_prepare 
> (s=0x558716c049c0, req=0x7ff7e400b650) at ../hw/scsi/virtio-scsi.c:788
> #8  0x0000558712f699b0 in virtio_scsi_handle_cmd_vq (s=0x558716c049c0, 
> vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:831
> #9  0x0000558712f69bcb in virtio_scsi_handle_cmd (vdev=0x558716c049c0, 
> vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:867
> #10 0x0000558712f96812 in virtio_queue_notify_vq (vq=0x558716c0d2a8) at 
> ../hw/virtio/virtio.c:2263
> #11 0x0000558712f99b75 in virtio_queue_host_notifier_read (n=0x558716c0d31c) 
> at ../hw/virtio/virtio.c:3575
> #12 0x000055871323d8b5 in aio_dispatch_handler (ctx=0x55871598d5e0, 
> node=0x558716771000) at ../util/aio-posix.c:372
> #13 0x000055871323d988 in aio_dispatch_ready_handlers (ctx=0x55871598d5e0, 
> ready_list=0x7ff7f288eeb0) at ../util/aio-posix.c:401


>> As for why it doesn't progress, blk_co_is_available_entry() uses
>> bdrv_graph_co_rdlock() and can't get it, because the main thread has the
>> write lock. Should be fixed once the AioContext locks are gone, but not
>> sure what should be done to avoid it until then.
> 
> Then the nested event loop in blk_is_available() would probably be
> enough to make progress, yes.
> 
> Maybe we could actually drop the lock (and immediately reacquire it) in
> AIO_WAIT_WHILE() even if we're in the home thread? That should give the
> main thread a chance to make progress.
> 

Seems to work :) I haven't run into the issue with the following change
anymore, but I have to say, running into that specific deadlock only
happened every 10-15 tries or so before. Did 30 tests now. But
unfortunately, the stuck IO issue is still there.

> diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
> index 5449b6d742..da159501ca 100644
> --- a/include/block/aio-wait.h
> +++ b/include/block/aio-wait.h
> @@ -88,7 +88,13 @@ extern AioWait global_aio_wait;
>      smp_mb__after_rmw();                                           \
>      if (ctx_ && in_aio_context_home_thread(ctx_)) {                \
>          while ((cond)) {                                           \
> +            if (unlock && ctx_) {                                  \
> +                aio_context_release(ctx_);                         \
> +            }                                                      \
>              aio_poll(ctx_, true);                                  \
> +            if (unlock && ctx_) {                                  \
> +                aio_context_acquire(ctx_);                         \
> +            }                                                      \
>              waited_ = true;                                        \
>          }                                                          \
>      } else {                                                       \
> 
> But I think we're actually not very far from removing the AioContext
> lock, so if we can't find an easy fix in the meantime, waiting for that
> could be a realistic option.
> 

Glad to hear :) Do you think it will be in time for QEMU 8.2? Otherwise,
I'll go ahead and send what I have for fixing the deadlocks from this
mail thread in the following days. The stuck guest IO can happen even
without any of those changes (on current master, i.e.
ebca80bbdb5c1650e4b753a3d13b43634e7dfe05, at least when starting a
mirror job).

Best Regards,
Fiona




reply via email to

[Prev in Thread] Current Thread [Next in Thread]