qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread


From: Hanna Czenczek
Subject: Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
Date: Mon, 29 Jan 2024 17:30:42 +0100
User-agent: Mozilla Thunderbird

On 23.01.24 18:10, Kevin Wolf wrote:
Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:
On 21.12.23 22:23, Kevin Wolf wrote:
From: Stefan Hajnoczi<stefanha@redhat.com>

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
    the requests list.
- When the VM is stopped only the main loop may access the requests
    list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
Reviewed-by: Eric Blake<eblake@redhat.com>
Message-ID:<20231204164259.1515217-2-stefanha@redhat.com>
Signed-off-by: Kevin Wolf<kwolf@redhat.com>
---
   include/hw/scsi/scsi.h |   7 +-
   hw/scsi/scsi-bus.c     | 181 ++++++++++++++++++++++++++++-------------
   2 files changed, 131 insertions(+), 57 deletions(-)
My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

(gdb) bt
#0  0x00007f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x00007f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x00007f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x00007f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x00007f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x0000556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at
../block/block-backend.c:2417
#7  0x0000556e6b112d87 in scsi_device_for_each_req_async_bh
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x0000556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at
../util/async.c:218
#9  0x0000556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290,
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x0000556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920)
at ../iothread.c:63
#11 0x0000556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at
../util/qemu-thread-posix.c:541
#12 0x00007f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x00007f32a670f7cc in  () at /usr/lib/libc.so.6

I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).
I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()?

Removing the drain to allow for all of bdrv_try_change_aio_context() to require GRAPH_WRLOCK, but that broke tests/unit/test-block-iothread, because without draining, block jobs would need to switch AioContexts while running, and job_set_aio_context() doesn’t like that.  Similarly to blk_get_aio_context(), I assume we can in theory just drop the assertion there and change the context while the job is running, because then the job can just change AioContexts on the next pause point (and in the meantime send requests from the old context, which is fine), but this does get quite murky.  (One rather virtual (but present) problem is that test-block-iothread itself contains some assertions in the job that its AioContext is actually the on its running in, but this assertion would no longer necessarily hold true.)

I don’t like using drain as a form of lock specifically against AioContext changes, but maybe Stefan is right, and we should use it in this specific case to get just the single problem fixed.  (Though it’s not quite trivial either.  We’d probably still want to remove the assertion from blk_get_aio_context(), so we don’t have to require all of its callers to hold a count in the in-flight counter.)

Hanna




reply via email to

[Prev in Thread] Current Thread [Next in Thread]