qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: virtio-blk using a single iothread


From: Sagi Grimberg
Subject: Re: virtio-blk using a single iothread
Date: Sun, 11 Jun 2023 15:27:57 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0



On 6/8/23 19:08, Stefan Hajnoczi wrote:
On Thu, Jun 08, 2023 at 10:40:57AM +0300, Sagi Grimberg wrote:
Hey Stefan, Paolo,

I just had a report from a user experiencing lower virtio-blk
performance than he expected. This user is running virtio-blk on top of
nvme-tcp device. The guest is running 12 CPU cores.

The guest read/write throughput is capped at around 30% of the available
throughput from the host (~800MB/s from the guest vs. 2800MB/s from the
host - 25Gb/s nic). The workload running on the guest is a
multi-threaded fio workload.

What is observed is the fact that virtio-blk is using a single disk-wide
iothread processing all the vqs. Specifically nvme-tcp (similar to other
tcp based protocols) is negatively impacted by lack of thread
concurrency that can distribute I/O requests to different TCP
connections.

We also attempted to move the iothread to a dedicated core, however that
did yield any meaningful performance improvements). The reason appears
to be less about CPU utilization on the iothread core, but more around
single TCP connection serialization.

Moving to io=threads does increase the throughput, however sacrificing
latency significantly.

So the user find itself with available host cpus and TCP connections
that it could easily use to get maximum throughput, without the ability
to leverage them. True, other guests will use different
threads/contexts, however the goal here is to allow the full performance
from a single device.

I've seen several discussions and attempts in the past to allow a
virtio-blk device leverage multiple iothreads, but around 2 years ago
the discussions over this paused. So wanted to ask, are there any plans
or anything in the works to address this limitation?

I've seen that the spdk folks are heading in this direction with their
vhost-blk implementation:
https://review.spdk.io/gerrit/c/spdk/spdk/+/16068

Hi Sagi,
Yes, there is an ongoing QEMU multi-queue block layer effort to make it
possible for multiple IOThreads to process disk I/O for the same
--blockdev in parallel.

Great to know.

Most of my recent QEMU patches have been part of this effort. There is a
work-in-progress branch that supports mapping virtio-blk virtqueues to
specific IOThreads:
https://gitlab.com/stefanha/qemu/-/commits/virtio-blk-iothread-vq-mapping

Thanks for the pointer.

The syntax is:

   --device 
'{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0"},{"iothread":"iothread1"}],"drive":"drive0"}'

This says "assign virtqueues round-robin to iothread0 and iothread1".
Half the virtqueues will be processed by iothread0 and the other half by
iothread1. There is also syntax for assigning specific virtqueues to
each IOThread, but usually the automatic round-robin assignment is all
that's needed.

This work is not finished yet. Basic I/O (e.g. fio) works without
crashes, but expect to hit issues if you use blockjobs, hotplug, etc.

Performance optimization work has just begun, so it won't deliver all
the benefits yet. I ran a benchmark yesterday where going from 1 to 2
IOThreads increased performance by 25%. That's much less than we're
aiming for; attaching two independent virtio-blk devices improves the
performance by ~100%. I know we can get there eventually. Some of the
bottlenecks are known (e.g. block statistics collection causes lock
contention) and others are yet to be investigated.

Hmm, I rebased this branch on top of mainline master and ran a naive
test, and it seems that performance regressed quite a bit :(

I'm running this test on my laptop (Intel(R) Core(TM) i7-8650U CPU
@1.90GHz), so this is more qualitative test for BW only.
I use null_blk as the host device.

With mainline master I get ~9GB/s 64k randread, and with your branch
I get ~5GB/s, this is regardless of assigning iothreads (one or
two) or not.

my qemu command:
taskset -c 0-3 build/qemu-system-x86_64 -cpu host -m 1G -enable-kvm -smp 4 -drive file=/var/lib/libvirt/images/ubuntu-22/root-disk-clone.qcow2,format=qcow2 -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=/dev/nullb0 -device virtio-blk-pci,drive=drive0,scsi=off -nographic

my guest fio jobfile:
--
[global]
group_reporting
runtime=3000
time_based
loops=1
direct=1
invalidate=1
randrepeat=0
norandommap
exitall
cpus_allowed=0-3
cpus_allowed_policy=split

[read]
filename=/dev/vda
numjobs=4
iodepth=32
bs=64k
rw=randread
ioengine=io_uring
--

Maybe I'm doing something wrong? Didn't expect to find a regression
against mainline on the default setup.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]