qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memo


From: David Hildenbrand
Subject: Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU
Date: Mon, 11 Oct 2021 09:43:53 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

virito-mem currently relies on having a single sparse memory region (anon
mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
share memory with other processes, sharing with other VMs is not intended.
Instead of actually mmaping parts dynamically (which can be quite
expensive), virtio-mem relies on punching holes into the backend and
dynamically allocating memory/file blocks/... on access.

So the easy way to make it work is:

a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
memory getting managed by the buddy on a separate NUMA node.


Linux kernel buddy system? how to guarantee other applications don't apply 
memory
from it

Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
such that even if some other allocation ended up there, that it could
get migrated somewhere else.

For example, "daxctl reconfigure-device" tries doing that as default:

https://pmem.io/ndctl/daxctl-reconfigure-device.html

However, I agree that we might actually want to tell the system to not
use this CPU-less node as fallback for other allocations, and that we
might not want to swap out such memory etc.


But, in the end all that virtio-mem needs to work in the hypervisor is

a) A sparse memmap (anonymous RAM, memfd, file)
b) A way to populate memory within that sparse memmap (e.g., on fault,
   using madvise(MADV_POPULATE_WRITE), fallocate())
c) A way to discard memory (madvise(MADV_DONTNEED),
   fallocate(FALLOC_FL_PUNCH_HOLE))

So instead of using anonymous memory+mbind, you can also mmap a sparse file
and rely on populate-on-demand. One alternative for your use case would be
to create a DAX  filesystem on that CXL memory (IIRC that should work) and
simply providing virtio-mem with a sparse file located on that filesystem.

Of course, you can also use some other mechanism as you might have in
your approach, as long as it supports a,b,c.



b) (optional) allocate huge pages on that separate NUMA node.
c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
*bidning* the memory backend to that special NUMA node.

"-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
How to bind backend memory to NUMA node


I think the syntax is "policy=bind,host-nodes=X"

whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
"host-nodes=0x20" etc.


This will dynamically allocate memory from that special NUMA node, resulting
in the virtio-mem device completely being backed by that device memory,
being able to dynamically resize the memory allocation.


Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
isn't really what we want and won't work without major design changes. Also,
I'm not so sure it's a very clean design: exposing memory belonging to other
VMs to unrelated QEMU processes. This sounds like a serious security hole:
if you managed to escalate to the QEMU process from inside the VM, you can
access unrelated VM memory quite happily. You want an abstraction
in-between, that makes sure each VM/QEMU process only sees private memory:
for example, the buddy via dax/kmem.

Hi David
Thanks for your suggestion, also sorry for my delayed reply due to my long 
vacation.
How does current virtio-mem dynamically attach memory to guest, via page fault?

Essentially you have a large sparse mmap. Withing that mmap, memory is
populated on demand. Instead if mmap/munmap you perform a single large
mmap and then dynamically populate memory/discard memory.

Right now, memory is populated via page faults on access. This is
sub-optimal when dealing with limited resources (i.e., hugetlbfs,
file blocks) and you might run out of backend memory.

I'm working on a "prealloc" mode, which will preallocate/populate memory
necessary for exposing the next block of memory to the VM, and which
fails gracefully if preallocation/population fails in the case of such
limited resources.

The patch resides on:
        https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next

commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
Author: David Hildenbrand <david@redhat.com>
Date:   Mon Aug 2 19:51:36 2021 +0200

    virtio-mem: support "prealloc=on" option
Especially for hugetlb, but also for file-based memory backends, we'd
    like to be able to prealloc memory, especially to make user errors less
    severe: crashing the VM when there are not sufficient huge pages around.
A common option for hugetlb will be using "reserve=off,prealloc=off" for
    the memory backend and "prealloc=on" for the virtio-mem device. This
    way, no huge pages will be reserved for the process, but we can recover
    if there are no actual huge pages when plugging memory.
Signed-off-by: David Hildenbrand <david@redhat.com>


--
Thanks,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]