qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PULL 00/28] Block layer patches


From: Stefan Hajnoczi
Subject: Re: [PULL 00/28] Block layer patches
Date: Tue, 19 Sep 2023 13:35:14 -0400

On Tue, 19 Sept 2023 at 06:26, Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben:
> > Hi Kevin,
> > I believe that my own commit "block-coroutine-wrapper: use
> > qemu_get_current_aio_context()" breaks this test. The failure is
> > non-deterministic (happens about 1 out of 4 runs).
> >
> > It seems the job hangs and the test times out in vm.run_job('job1', 
> > wait=5.0).
> >
> > I haven't debugged it yet but wanted to share this information to save
> > some time. Tomorrow I'll investigate further.
>
> Yes, it's relatively easily reproducible if I run the test in a loop,
> and I can't seem to reproduce it without the last patch. Should I
> unstage the full series again, or do you think that the last patch is
> really optional this time?

Please drop the last patch. I'm not aware of dependencies on the last patch.

> However, I'm unsure how the stack traces I'm seeing are related to your
> patch. Maybe it just made an existing bug more likely to be triggered?

I'll share my thoughts once I've looked at the crash today.

Regarding AioContext lock removal: I'll work on that and see what
still depends on the lock.

Stefan

> What I'm seeing is that the reader lock is held by an iothread that is
> waiting for its AioContext lock to make progress:
>
> Thread 3 (Thread 0x7f811e9346c0 (LWP 26390) "qemu-system-x86"):
> #0  0x00007f81250aaf80 in __lll_lock_wait () at /lib64/libc.so.6
> #1  0x00007f81250b149a in pthread_mutex_lock@@GLIBC_2.2.5 () at 
> /lib64/libc.so.6
> #2  0x000055b7b170967e in qemu_mutex_lock_impl (mutex=0x55b7b34e3080, 
> file=0x55b7b199e1f7 "../util/async.c", line=728) at 
> ../util/qemu-thread-posix.c:94
> #3  0x000055b7b1709953 in qemu_rec_mutex_lock_impl (mutex=0x55b7b34e3080, 
> file=0x55b7b199e1f7 "../util/async.c", line=728) at 
> ../util/qemu-thread-posix.c:149
> #4  0x000055b7b1728318 in aio_context_acquire (ctx=0x55b7b34e3020) at 
> ../util/async.c:728
> #5  0x000055b7b1727c49 in co_schedule_bh_cb (opaque=0x55b7b34e3020) at 
> ../util/async.c:565
> #6  0x000055b7b1726f1c in aio_bh_call (bh=0x55b7b34e2e70) at 
> ../util/async.c:169
> #7  0x000055b7b17270ee in aio_bh_poll (ctx=0x55b7b34e3020) at 
> ../util/async.c:216
> #8  0x000055b7b170351d in aio_poll (ctx=0x55b7b34e3020, blocking=true) at 
> ../util/aio-posix.c:722
> #9  0x000055b7b1518604 in iothread_run (opaque=0x55b7b2904460) at 
> ../iothread.c:63
> #10 0x000055b7b170a955 in qemu_thread_start (args=0x55b7b34e36b0) at 
> ../util/qemu-thread-posix.c:541
> #11 0x00007f81250ae15d in start_thread () at /lib64/libc.so.6
> #12 0x00007f812512fc00 in clone3 () at /lib64/libc.so.6
>
> On the other hand, the main thread wants to acquire the writer lock,
> but it holds the AioContext lock of the iothread (it takes it in
> job_prepare_locked()):
>
> Thread 1 (Thread 0x7f811f4b7b00 (LWP 26388) "qemu-system-x86"):
> #0  0x00007f8125122356 in ppoll () at /lib64/libc.so.6
> #1  0x000055b7b172eae0 in qemu_poll_ns (fds=0x55b7b34ec910, nfds=1, 
> timeout=-1) at ../util/qemu-timer.c:339
> #2  0x000055b7b1704ebd in fdmon_poll_wait (ctx=0x55b7b3269210, 
> ready_list=0x7ffc90b05680, timeout=-1) at ../util/fdmon-poll.c:79
> #3  0x000055b7b1703284 in aio_poll (ctx=0x55b7b3269210, blocking=true) at 
> ../util/aio-posix.c:670
> #4  0x000055b7b1567c3b in bdrv_graph_wrlock (bs=0x0) at 
> ../block/graph-lock.c:145
> #5  0x000055b7b1554c1c in blk_remove_bs (blk=0x55b7b4425800) at 
> ../block/block-backend.c:916
> #6  0x000055b7b1554779 in blk_delete (blk=0x55b7b4425800) at 
> ../block/block-backend.c:497
> #7  0x000055b7b1554133 in blk_unref (blk=0x55b7b4425800) at 
> ../block/block-backend.c:557
> #8  0x000055b7b157a149 in mirror_exit_common (job=0x55b7b4419000) at 
> ../block/mirror.c:696
> #9  0x000055b7b1577015 in mirror_prepare (job=0x55b7b4419000) at 
> ../block/mirror.c:807
> #10 0x000055b7b153a1a7 in job_prepare_locked (job=0x55b7b4419000) at 
> ../job.c:988
> #11 0x000055b7b153a0d9 in job_txn_apply_locked (job=0x55b7b4419000, 
> fn=0x55b7b153a110 <job_prepare_locked>) at ../job.c:191
> #12 0x000055b7b1538b6d in job_do_finalize_locked (job=0x55b7b4419000) at 
> ../job.c:1011
> #13 0x000055b7b153a886 in job_completed_txn_success_locked 
> (job=0x55b7b4419000) at ../job.c:1068
> #14 0x000055b7b1539372 in job_completed_locked (job=0x55b7b4419000) at 
> ../job.c:1082
> #15 0x000055b7b153a71b in job_exit (opaque=0x55b7b4419000) at ../job.c:1103
> #16 0x000055b7b1726f1c in aio_bh_call (bh=0x7f8110005470) at 
> ../util/async.c:169
> #17 0x000055b7b17270ee in aio_bh_poll (ctx=0x55b7b3269210) at 
> ../util/async.c:216
> #18 0x000055b7b1702c05 in aio_dispatch (ctx=0x55b7b3269210) at 
> ../util/aio-posix.c:423
> #19 0x000055b7b1728a14 in aio_ctx_dispatch (source=0x55b7b3269210, 
> callback=0x0, user_data=0x0) at ../util/async.c:358
> #20 0x00007f8126c31c7f in g_main_dispatch (context=0x55b7b3269720) at 
> ../glib/gmain.c:3454
> #21 g_main_context_dispatch (context=0x55b7b3269720) at ../glib/gmain.c:4172
> #22 0x000055b7b1729c98 in glib_pollfds_poll () at ../util/main-loop.c:290
> #23 0x000055b7b1729572 in os_host_main_loop_wait (timeout=27462700) at 
> ../util/main-loop.c:313
> #24 0x000055b7b1729452 in main_loop_wait (nonblocking=0) at 
> ../util/main-loop.c:592
> #25 0x000055b7b119a1eb in qemu_main_loop () at ../softmmu/runstate.c:772
> #26 0x000055b7b14c102d in qemu_default_main () at ../softmmu/main.c:37
> #27 0x000055b7b14c1068 in main (argc=44, argv=0x7ffc90b05d58) at 
> ../softmmu/main.c:48
>
> At first I thought we just need to look into the AioContext locking in
> job completion and drop it in the right places.
>
> But in fact, first of all, blk_remove_bs() needs to make up its mind if
> it wants the caller to hold the AioContext or not and document that.
> Because it calls both bdrv_drained_begin() (which requires holding the
> AioContext lock) and bdrv_graph_wrlock(NULL) (which forbids it).
>
> If we could fully get rid of the AioContext lock (as we originally
> stated as a goal), that would automatically solve this kind of
> deadlocks.
>
> Kevin
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]