[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PULL 29/32] virtio-blk: implement BlockDevOps->drained_begin()
From: |
Fiona Ebner |
Subject: |
Re: [PULL 29/32] virtio-blk: implement BlockDevOps->drained_begin() |
Date: |
Mon, 13 Nov 2023 15:38:24 +0100 |
User-agent: |
Mozilla Thunderbird |
Am 03.11.23 um 14:12 schrieb Fiona Ebner:
> Hi,
>
> Am 30.05.23 um 18:32 schrieb Kevin Wolf:
>> From: Stefan Hajnoczi <stefanha@redhat.com>
>>
>> Detach ioeventfds during drained sections to stop I/O submission from
>> the guest. virtio-blk is no longer reliant on aio_disable_external()
>> after this patch. This will allow us to remove the
>> aio_disable_external() API once all other code that relies on it is
>> converted.
>>
>> Take extra care to avoid attaching/detaching ioeventfds if the data
>> plane is started/stopped during a drained section. This should be rare,
>> but maybe the mirror block job can trigger it.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Message-Id: <20230516190238.8401-18-stefanha@redhat.com>
>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>
> I ran into a strange issue where guest IO would get completely stuck
> during certain block jobs a while ago and finally managed to find a
> small reproducer [0]. I'm using a VM with virtio-blk-pci (or
> virtio-scsi-pci) with an iothread and running
>
> fio --name=file --size=100M --direct=1 --rw=randwrite --bs=4k
> --ioengine=psync --numjobs=5 --runtime=1200 --time_based
>
> in the guest. Then I'm issuing the QMP command with the reproducer in a
> loop. Usually, the guest IO will get stuck after about 1-3 minutes,
> sometimes fio can manage to continue with a lower speed for a while (but
> trying to Ctrl+C it or doing other IO in the guest will already be
> broken), which I guess could be a hint that it's an issue with notifiers?
>
> Bisecting (to declare a commit good, I waited 10 minutes) led me to this
> patch, i.e. commit 1665d9326f ("virtio-blk: implement
> BlockDevOps->drained_begin()") and for SCSI, I verified that the issue
> similarly starts happening after 766aa2de0f ("virtio-scsi: implement
> BlockDevOps->drained_begin()").
>
> Both issues are still present on current master (i.e. 1c98a821a2
> ("tests/qtest: Introduce tests for AMD/Xilinx Versal TRNG device"))
>
> Happy to provide more information and hints about how to debug the issue
> further.
>
Of course, I meant "and for hints" ;)
I should also mention that when IO is stuck, for the two
BlockDriverStates (i.e. bdrv_raw and bdrv_file) and BlockBackend,
in_flight and quiesce_counter are 0, tracked_requests, respectively
queued_requests, are empty and quiesced_parent is false for the parents.
Two observations:
1. I found that using QMP 'stop' and 'cont' will allow guest IO to get
unstuck. I'm pretty sure, it's the virtio_blk_data_plane_stop/start
calls it triggers.
2. While experimenting, I found that after the below change [1] in
aio_poll(), I wasn't able to trigger the issue anymore (letting my
reproducer run for 40 minutes).
Best Regards,
Fiona
[1]:
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 7f2c99729d..dff9ad4148 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -655,7 +655,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
> /* If polling is allowed, non-blocking aio_poll does not need the
> * system call---a single round of run_poll_handlers_once suffices.
> */
> - if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
> + if (1) { //timeout || ctx->fdmon_ops->need_wait(ctx)) {
> /*
> * Disable poll mode. poll mode should be disabled before the call
> * of ctx->fdmon_ops->wait() so that guest's notification can wake
> [0]:
>
>> diff --git a/blockdev.c b/blockdev.c
>> index db2725fe74..bf2e0fc22c 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -2986,6 +2986,11 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>> bool zero_target;
>> int ret;
>>
>> + bdrv_drain_all_begin();
>> + bdrv_drain_all_end();
>> + return;
>> +
>> +
>> bs = qmp_get_root_bs(arg->device, errp);
>> if (!bs) {
>> return;
>
>
>