qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

PING: [PATCH 2/2] coroutine: take exactly one batch from global pool at


From: 王洪浩
Subject: PING: [PATCH 2/2] coroutine: take exactly one batch from global pool at a time
Date: Tue, 29 Sep 2020 11:24:14 +0800

Hi, I'd like to know if there are any other problems with this patch,
or if there is a better implement to improve coroutine pool.

王洪浩 <wanghonghao@bytedance.com> 于2020年8月26日周三 下午2:06写道:

>
> The purpose of this patch is to improve performance without increasing
> memory consumption.
>
> My test case:
> QEMU command line arguments
> -drive file=/dev/nvme2n1p1,format=raw,if=none,id=local0,cache=none,aio=native 
> \
>     -device virtio-blk,id=blk0,drive=local0,iothread=iothread0,num-queues=4 \
> -drive file=/dev/nvme3n1p1,format=raw,if=none,id=local1,cache=none,aio=native 
> \
>     -device virtio-blk,id=blk1,drive=local1,iothread=iothread1,num-queues=4 \
>
> run these two fio jobs at the same time
> [job-vda]
> filename=/dev/vda
> iodepth=64
> ioengine=libaio
> rw=randrw
> bs=4k
> size=300G
> rwmixread=80
> direct=1
> numjobs=2
> runtime=60
>
> [job-vdb]
> filename=/dev/vdb
> iodepth=64
> ioengine=libaio
> rw=randrw
> bs=4k
> size=300G
> rwmixread=90
> direct=1
> numjobs=2
> loops=1
> runtime=60
>
> without this patch, test 3 times:
> total iops: 278548.1, 312374.1, 276638.2
> with this patch, test 3 times:
> total iops: 368370.9, 335693.2, 327693.1
>
> 18.9% improvement in average.
>
> In addition, we are also using a distributed block storage, of which
> the io latency is much more than local nvme devices because of the
> network overhead. So it needs higher iodepth(>=256) to reach its max
> throughput.
> Without this patch, it has more than 5% chance of calling
> `qemu_coroutine_new` and the iops is less than 100K, while the iops is
> about 260K with this patch.
>
> On the other hand, there's a simpler way to reduce or eliminate the
> cost of `qemu_coroutine_new` is to increase POOL_BATCH_SIZE. But it
> will also bring much more memory consumption which we don't expect.
> So it's the purpose of this patch.
>
> Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月25日周二 下午10:52写道:
> >
> > On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote:
> > > This patch replace the global coroutine queue with a lock-free stack of 
> > > which
> > > the elements are coroutine queues. Threads can put coroutine queues into 
> > > the
> > > stack or take queues from it and each coroutine queue has exactly
> > > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but 
> > > it's
> > > enough for buffer pool.
> > >
> > > Coroutines will be put into thread-local pools first while release. Now 
> > > the
> > > fast pathes of both allocation and release are atomic-free, and there 
> > > won't
> > > be too many coroutines remain in a single thread since POOL_BATCH_SIZE 
> > > has been
> > > reduced to 16.
> > >
> > > In practice, I've run a VM with two block devices binding to two different
> > > iothreads, and run fio with iodepth 128 on each device. It maintains 
> > > around
> > > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new`
> > > without this patch. And with this patch, it maintains no more than 273
> > > coroutines and doesn't call `qemu_coroutine_new` after initial 
> > > allocations.
> >
> > Does throughput or IOPS change?
> >
> > Is the main purpose of this patch to reduce memory consumption?
> >
> > Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]