[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v8 17/20] job.c: enable job lock/unlock and remove Aiocontext
From: |
Emanuele Giuseppe Esposito |
Subject: |
Re: [PATCH v8 17/20] job.c: enable job lock/unlock and remove Aiocontext locks |
Date: |
Wed, 6 Jul 2022 23:29:33 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 |
Am 05/07/2022 um 15:07 schrieb Stefan Hajnoczi:
> On Wed, Jun 29, 2022 at 10:15:35AM -0400, Emanuele Giuseppe Esposito wrote:
>> Change the job_{lock/unlock} and macros to use job_mutex.
>>
>> Now that they are not nop anymore, remove the aiocontext
>> to avoid deadlocks.
>>
>> Therefore:
>> - when possible, remove completely the aiocontext lock/unlock pair
>> - if it is used by some other function too, reduce the locking
>> section as much as possible, leaving the job API outside.
>>
>> There is only one JobDriver callback, ->free() that assumes that
>> the aiocontext lock is held (because it calls bdrv_unref), so for
>> now keep that under aiocontext lock.
>
> This discussion shouldn't hold up the patch series, it's a separate
> issue:
>
> Why does bdrv_unref() need the AioContext lock? The reference count
> itself is protected by the BQL (bdrv_ref() is GS too). I/O requests
> should be using fine-grained locks now, so I'm not sure if we still need
> to hold the AioContext lock to drain them?
If I remove the AioContex lock/unlock in job_unref_locked, I see that
test 200 and test-bdrv-drain are failing.
The reason is that job->free() is calling block_job_free and then we
eventually get to bdrv_detach_child claling bdrv_try_set_aio_context
from the main loop, on a node that is in another AioContext lock.
So it isn't really about bdrv_unref, but more bdrv_try_set_aio_context.
Until we don't find a solution to that, we can't get rid of this
aiocontext lock.
That's the shared call stack from both tests:
#0 __pthread_kill_implementation (threadid=<optimized out>,
signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007ffff66a64a3 in __pthread_kill_internal (signo=6,
threadid=<optimized out>) at pthread_kill.c:78
#2 0x00007ffff6659d06 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/posix/raise.c:26
#3 0x00007ffff662c7d3 in __GI_abort () at abort.c:79
#4 0x0000555555d6411f in error_exit (err=<optimized out>,
msg=msg@entry=0x555555f49b30 <__func__.20> "qemu_mutex_unlock_impl")
at ../util/qemu-thread-posix.c:38
--Type <RET> for more, q to quit, c to continue without paging--
#5 0x0000555555d64965 in qemu_mutex_unlock_impl (mutex=0x5555568d0620,
file=<optimized out>, line=<optimized out>)
at ../util/qemu-thread-posix.c:118
#6 0x0000555555c39c06 in bdrv_set_aio_context_ignore (bs=0x5555575133e0,
new_context=0x5555566a2ad0, ignore=0x7fffffffd110) at ../block.c:7399
#7 0x0000555555c3a100 in bdrv_child_try_set_aio_context (
bs=bs@entry=0x5555575133e0, ctx=ctx@entry=0x5555566a2ad0,
ignore_child=ignore_child@entry=0x0, errp=errp@entry=0x0)
at ../block.c:7493
#8 0x0000555555c3a1f6 in bdrv_try_set_aio_context (errp=0x0,
--Type <RET> for more, q to quit, c to continue without paging--
ctx=0x5555566a2ad0, bs=0x5555575133e0) at ../block.c:7503
#9 bdrv_detach_child (childp=0x7fffffffd168) at ../block.c:3130
#10 bdrv_root_unref_child (child=<optimized out>,
child@entry=0x55555732dc00)
at ../block.c:3228
#11 0x0000555555c4265f in block_job_remove_all_bdrv (job=0x5555575b6a30)
at ../blockjob.c:195
#12 0x0000555555c426b5 in block_job_free (job=0x5555575b6a30)
at ../blockjob.c:76
#13 0x0000555555c4472e in job_unref_locked (job=0x5555575b6a30)
at ../job.c:464
--Type <RET> for more, q to quit, c to continue without paging--
#14 job_unref_locked (job=0x5555575b6a30) at ../job.c:450
>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>