qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: virtio-blk using a single iothread


From: Stefan Hajnoczi
Subject: Re: virtio-blk using a single iothread
Date: Wed, 21 Jun 2023 14:23:07 +0200

Hi Sagi,
I just got back from a conference and am going to be offline for a
week starting tomorrow. I haven't had time to look through your email
but will reply when I'm back from vacation.

Stefan

On Sun, 11 Jun 2023 at 14:29, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
>
> On 6/8/23 19:08, Stefan Hajnoczi wrote:
> > On Thu, Jun 08, 2023 at 10:40:57AM +0300, Sagi Grimberg wrote:
> >> Hey Stefan, Paolo,
> >>
> >> I just had a report from a user experiencing lower virtio-blk
> >> performance than he expected. This user is running virtio-blk on top of
> >> nvme-tcp device. The guest is running 12 CPU cores.
> >>
> >> The guest read/write throughput is capped at around 30% of the available
> >> throughput from the host (~800MB/s from the guest vs. 2800MB/s from the
> >> host - 25Gb/s nic). The workload running on the guest is a
> >> multi-threaded fio workload.
> >>
> >> What is observed is the fact that virtio-blk is using a single disk-wide
> >> iothread processing all the vqs. Specifically nvme-tcp (similar to other
> >> tcp based protocols) is negatively impacted by lack of thread
> >> concurrency that can distribute I/O requests to different TCP
> >> connections.
> >>
> >> We also attempted to move the iothread to a dedicated core, however that
> >> did yield any meaningful performance improvements). The reason appears
> >> to be less about CPU utilization on the iothread core, but more around
> >> single TCP connection serialization.
> >>
> >> Moving to io=threads does increase the throughput, however sacrificing
> >> latency significantly.
> >>
> >> So the user find itself with available host cpus and TCP connections
> >> that it could easily use to get maximum throughput, without the ability
> >> to leverage them. True, other guests will use different
> >> threads/contexts, however the goal here is to allow the full performance
> >> from a single device.
> >>
> >> I've seen several discussions and attempts in the past to allow a
> >> virtio-blk device leverage multiple iothreads, but around 2 years ago
> >> the discussions over this paused. So wanted to ask, are there any plans
> >> or anything in the works to address this limitation?
> >>
> >> I've seen that the spdk folks are heading in this direction with their
> >> vhost-blk implementation:
> >> https://review.spdk.io/gerrit/c/spdk/spdk/+/16068
> >
> > Hi Sagi,
> > Yes, there is an ongoing QEMU multi-queue block layer effort to make it
> > possible for multiple IOThreads to process disk I/O for the same
> > --blockdev in parallel.
>
> Great to know.
>
> > Most of my recent QEMU patches have been part of this effort. There is a
> > work-in-progress branch that supports mapping virtio-blk virtqueues to
> > specific IOThreads:
> > https://gitlab.com/stefanha/qemu/-/commits/virtio-blk-iothread-vq-mapping
>
> Thanks for the pointer.
>
> > The syntax is:
> >
> >    --device 
> > '{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0"},{"iothread":"iothread1"}],"drive":"drive0"}'
> >
> > This says "assign virtqueues round-robin to iothread0 and iothread1".
> > Half the virtqueues will be processed by iothread0 and the other half by
> > iothread1. There is also syntax for assigning specific virtqueues to
> > each IOThread, but usually the automatic round-robin assignment is all
> > that's needed.
> >
> > This work is not finished yet. Basic I/O (e.g. fio) works without
> > crashes, but expect to hit issues if you use blockjobs, hotplug, etc.
> >
> > Performance optimization work has just begun, so it won't deliver all
> > the benefits yet. I ran a benchmark yesterday where going from 1 to 2
> > IOThreads increased performance by 25%. That's much less than we're
> > aiming for; attaching two independent virtio-blk devices improves the
> > performance by ~100%. I know we can get there eventually. Some of the
> > bottlenecks are known (e.g. block statistics collection causes lock
> > contention) and others are yet to be investigated.
>
> Hmm, I rebased this branch on top of mainline master and ran a naive
> test, and it seems that performance regressed quite a bit :(
>
> I'm running this test on my laptop (Intel(R) Core(TM) i7-8650U CPU
> @1.90GHz), so this is more qualitative test for BW only.
> I use null_blk as the host device.
>
> With mainline master I get ~9GB/s 64k randread, and with your branch
> I get ~5GB/s, this is regardless of assigning iothreads (one or
> two) or not.
>
> my qemu command:
> taskset -c 0-3 build/qemu-system-x86_64 -cpu host -m 1G -enable-kvm -smp
> 4 -drive
> file=/var/lib/libvirt/images/ubuntu-22/root-disk-clone.qcow2,format=qcow2
> -drive
> if=none,id=drive0,cache=none,aio=native,format=raw,file=/dev/nullb0
> -device virtio-blk-pci,drive=drive0,scsi=off -nographic
>
> my guest fio jobfile:
> --
> [global]
> group_reporting
> runtime=3000
> time_based
> loops=1
> direct=1
> invalidate=1
> randrepeat=0
> norandommap
> exitall
> cpus_allowed=0-3
> cpus_allowed_policy=split
>
> [read]
> filename=/dev/vda
> numjobs=4
> iodepth=32
> bs=64k
> rw=randread
> ioengine=io_uring
> --
>
> Maybe I'm doing something wrong? Didn't expect to find a regression
> against mainline on the default setup.
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]