qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: deadlock when using iothread during backup_clean()


From: Vladimir Sementsov-Ogievskiy
Subject: Re: deadlock when using iothread during backup_clean()
Date: Wed, 4 Oct 2023 20:08:05 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1

On 28.09.23 11:06, Fiona Ebner wrote:
Am 05.09.23 um 13:42 schrieb Paolo Bonzini:
On 9/5/23 12:01, Fiona Ebner wrote:
Can we assume block_job_remove_all_bdrv() to always hold the job's
AioContext?

I think so, see job_unref_locked(), job_prepare_locked() and
job_finalize_single_locked().  These call the callbacks that ultimately
get to block_job_remove_all_bdrv().
And if yes, can we just tell bdrv_graph_wrlock() that it
needs to release that before polling to fix the deadlock?

No, but I think it should be released and re-acquired in
block_job_remove_all_bdrv() itself.


For fixing the backup cancel deadlock, I tried the following:

diff --git a/blockjob.c b/blockjob.c
index 58c5d64539..fd6132ebfe 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -198,7 +198,9 @@ void block_job_remove_all_bdrv(BlockJob *job)
       * one to make sure that such a concurrent access does not attempt
       * to process an already freed BdrvChild.
       */
+    aio_context_release(job->job.aio_context);
      bdrv_graph_wrlock(NULL);
+    aio_context_acquire(job->job.aio_context);
      while (job->nodes) {
          GSList *l = job->nodes;
          BdrvChild *c = l->data;

but it's not enough unfortunately. And I don't just mean with the later
deadlock during bdrv_close() (via bdrv_cbw_drop()) as mentioned in the
other mail.

Even when I got lucky and that deadlock didn't trigger by chance or with
an additional change to try and avoid that one

diff --git a/block.c b/block.c
index e7f349b25c..02d2c4e777 100644
--- a/block.c
+++ b/block.c
@@ -5165,7 +5165,7 @@ static void bdrv_close(BlockDriverState *bs)
          bs->drv = NULL;
      }
- bdrv_graph_wrlock(NULL);
+    bdrv_graph_wrlock(bs);
      QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
          bdrv_unref_child(bs, child);
      }

often guest IO would get completely stuck after canceling the backup.
There's nothing obvious to me in the backtraces at that point, but it
seems the vCPU and main threads running like usual, while the IO thread
is stuck in aio_poll(), i.e. never returns from the __ppoll() call. This
would happen with both, a VirtIO SCSI and a VirtIO block disk and with
both aio=io_uring and aio=threads.

When IO is stuck, it may be helpful to look at bs->tracked_requests list in 
gdb. If there are requests, each one has field .co, which points to the coroutine 
of request.

Next step is to look at coroutine stack.

Something like (in gdb):

source scripts/qemu-gdb.py
qemu coroutine <coroutine pointer>

may work. ("may", because it was long ago when I used this last time)


I should also mention I'm using

fio --name=file --size=4k --direct=1 --rw=randwrite --bs=4k --ioengine=psync 
--numjobs=5 --runtime=6000 --time_based

inside the guest during canceling of the backup.

I'd be glad for any pointers what to look for and happy to provide more
information.

Best Regards,
Fiona


--
Best regards,
Vladimir




reply via email to

[Prev in Thread] Current Thread [Next in Thread]