qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] coroutine: cap per-thread local pool size


From: Kevin Wolf
Subject: Re: [PATCH] coroutine: cap per-thread local pool size
Date: Tue, 19 Mar 2024 18:41:28 +0100

Am 19.03.2024 um 18:10 hat Daniel P. Berrangé geschrieben:
> On Tue, Mar 19, 2024 at 05:54:38PM +0100, Kevin Wolf wrote:
> > Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben:
> > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > > > The coroutine pool implementation can hit the Linux vm.max_map_count
> > > > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > > > or "failed to set up stack guard page" during coroutine creation.
> > > > 
> > > > This happens because per-thread pools can grow to tens of thousands of
> > > > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > > 
> > > This sounds quite alarming. What usage scenario is justified in
> > > creating so many coroutines?
> > 
> > Basically we try to allow pooling coroutines for as many requests as
> > there can be in flight at the same time. That is, adding a virtio-blk
> > device increases the maximum pool size by num_queues * queue_size. If
> > you have a guest with many CPUs, the default num_queues is relatively
> > large (the bug referenced by Stefan had 64), and queue_size is 256 by
> > default. That's 16k potential requests in flight per disk.
> 
> If we have more than 1 virtio-blk device, does that scale up the max
> coroutines too ?
> 
> eg would 32 virtio-blks devices imply 16k * 32 -> 512k potential
> requests/coroutines ?

Yes. This is the number of request descriptors that fit in the
virtqueues, and if you add another device with additional virtqueues,
then obviously that increases the number of theoretically possible
parallel requests.

The limits of what you can actually achieve in practice might be lower
because I/O might complete faster than the time we need to process all
of the queued requests, depending on how many vcpus are trying to
"compete" with how many iothreads. Of course, the practical limits in
five years might be different from today.

> > > IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> > > coroutines implies 10's of GB of memory just on stacks alone.
> > 
> > That's only virtual memory, though. Not sure how much of it is actually
> > used in practice.
> 
> True, by default Linux wouldn't care too much about virtual memory,
> Only if 'vm.overcommit_memory' is changed from its default, such
> that Linux applies an overcommit ratio on RAM, then total virtual
> memory would be relevant.

That's a good point and one that I don't have a good answer for, short
of just replacing the whole QEMU block layer with rsd and switching to
stackless coroutines/futures this way.

> > > > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > > 
> > > On my system max_map_count is 1048576, quite alot higher than
> > > 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> > > ~500 GB of stacks !
> > 
> > Did you change the configuration some time in the past, or is this just
> > a newer default? I get 65530, and that's the same default number I've
> > seen in the bug reports.
> 
> It turns out it is a Fedora change, rather than a kernel change:
> 
>   https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount

Good to know, thanks.

> > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > > > index 5fd2dbaf8b..2790959eaf 100644
> > > > --- a/util/qemu-coroutine.c
> > > > +++ b/util/qemu-coroutine.c
> > > 
> > > > +static unsigned int get_global_pool_hard_max_size(void)
> > > > +{
> > > > +#ifdef __linux__
> > > > +    g_autofree char *contents = NULL;
> > > > +    int max_map_count;
> > > > +
> > > > +    /*
> > > > +     * Linux processes can have up to max_map_count virtual memory 
> > > > areas
> > > > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this 
> > > > limit. We
> > > > +     * must limit the coroutine pool to a safe size to avoid running 
> > > > out of
> > > > +     * VMAs.
> > > > +     */
> > > > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, 
> > > > NULL,
> > > > +                            NULL) &&
> > > > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > > > +        /*
> > > > +         * This is a conservative upper bound that avoids exceeding
> > > > +         * max_map_count. Leave half for non-coroutine users like 
> > > > library
> > > > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 
> > > > VMAs so
> > > > +         * halve the amount again.
> > > > +         */
> > > > +        return max_map_count / 4;
> > > 
> > > That's 256,000 coroutines, which still sounds incredibly large
> > > to me.
> > 
> > The whole purpose of the limitation is that you won't ever get -ENOMEM
> > back, which will likely crash your VM. Even if this hard limit is high,
> > that doesn't mean that it's fully used. Your setting of 1048576 probably
> > means that you would never have hit the crash anyway.
> > 
> > Even the benchmarks that used to hit the problem don't even get close to
> > this hard limit any more because the actual number of coroutines stays
> > much smaller after applying this patch.
> 
> I'm more thinking about what's the worst case behaviour that a
> malicious guest can inflict on QEMU, and cause unexpectedly high
> memory usage in the host.
> 
> ENOMEM is bad for a friendy VM, but there's also the risk to the host
> from a unfriendly VM exploiting the high limits

But from a QEMU perspective, what is the difference between a friendly
high-performance VM that exhausts the available bandwidth to do its job
as good and fast as possible, and a malicious VM that does that same
just to waste host resources? I don't think QEMU can decide this, they
look the same.

If you want a VM not to send 16k requests in parallel, you can configure
its disk to expose less queues or a smaller queue size. The values I
mentioned above are only defaults that allow friendly VMs to perform
well out of the box, nothing prevents you from changing them to restrict
the amount of resources a VM can use.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]