qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memo


From: David Hildenbrand
Subject: Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU
Date: Wed, 13 Oct 2021 10:33:39 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

On 13.10.21 10:13, david.dai wrote:
On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand (david@redhat.com) 
wrote:



virito-mem currently relies on having a single sparse memory region (anon
mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
share memory with other processes, sharing with other VMs is not intended.
Instead of actually mmaping parts dynamically (which can be quite
expensive), virtio-mem relies on punching holes into the backend and
dynamically allocating memory/file blocks/... on access.

So the easy way to make it work is:

a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
memory getting managed by the buddy on a separate NUMA node.


Linux kernel buddy system? how to guarantee other applications don't apply 
memory
from it

Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
such that even if some other allocation ended up there, that it could
get migrated somewhere else.

For example, "daxctl reconfigure-device" tries doing that as default:

https://pmem.io/ndctl/daxctl-reconfigure-device.html

However, I agree that we might actually want to tell the system to not
use this CPU-less node as fallback for other allocations, and that we
might not want to swap out such memory etc.


But, in the end all that virtio-mem needs to work in the hypervisor is

a) A sparse memmap (anonymous RAM, memfd, file)
b) A way to populate memory within that sparse memmap (e.g., on fault,
    using madvise(MADV_POPULATE_WRITE), fallocate())
c) A way to discard memory (madvise(MADV_DONTNEED),
    fallocate(FALLOC_FL_PUNCH_HOLE))

So instead of using anonymous memory+mbind, you can also mmap a sparse file
and rely on populate-on-demand. One alternative for your use case would be
to create a DAX  filesystem on that CXL memory (IIRC that should work) and
simply providing virtio-mem with a sparse file located on that filesystem.

Of course, you can also use some other mechanism as you might have in
your approach, as long as it supports a,b,c.



b) (optional) allocate huge pages on that separate NUMA node.
c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
*bidning* the memory backend to that special NUMA node.

"-object memory-backend/device-ram or memory-device-memfd, id=mem0, size=768G"
How to bind backend memory to NUMA node


I think the syntax is "policy=bind,host-nodes=X"

whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
"host-nodes=0x20" etc.


This will dynamically allocate memory from that special NUMA node, resulting
in the virtio-mem device completely being backed by that device memory,
being able to dynamically resize the memory allocation.


Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
isn't really what we want and won't work without major design changes. Also,
I'm not so sure it's a very clean design: exposing memory belonging to other
VMs to unrelated QEMU processes. This sounds like a serious security hole:
if you managed to escalate to the QEMU process from inside the VM, you can
access unrelated VM memory quite happily. You want an abstraction
in-between, that makes sure each VM/QEMU process only sees private memory:
for example, the buddy via dax/kmem.

Hi David
Thanks for your suggestion, also sorry for my delayed reply due to my long 
vacation.
How does current virtio-mem dynamically attach memory to guest, via page fault?

Essentially you have a large sparse mmap. Withing that mmap, memory is
populated on demand. Instead if mmap/munmap you perform a single large
mmap and then dynamically populate memory/discard memory.

Right now, memory is populated via page faults on access. This is
sub-optimal when dealing with limited resources (i.e., hugetlbfs,
file blocks) and you might run out of backend memory.

I'm working on a "prealloc" mode, which will preallocate/populate memory
necessary for exposing the next block of memory to the VM, and which
fails gracefully if preallocation/population fails in the case of such
limited resources.

The patch resides on:
        https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next

commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
Author: David Hildenbrand <david@redhat.com>
Date:   Mon Aug 2 19:51:36 2021 +0200

     virtio-mem: support "prealloc=on" option
     Especially for hugetlb, but also for file-based memory backends, we'd
     like to be able to prealloc memory, especially to make user errors less
     severe: crashing the VM when there are not sufficient huge pages around.
     A common option for hugetlb will be using "reserve=off,prealloc=off" for
     the memory backend and "prealloc=on" for the virtio-mem device. This
     way, no huge pages will be reserved for the process, but we can recover
     if there are no actual huge pages when plugging memory.
     Signed-off-by: David Hildenbrand <david@redhat.com>


--
Thanks,

David / dhildenb


Hi David,

After read virtio-mem code, I understand what you have expressed, please allow 
me to describe
my understanding to virtio-mem, so that we have a aligned view.

Virtio-mem:
  Virtio-mem device initializes and reserved a memory area(GPA), later memory 
dynamically
  growing/shrinking will not exceed this scope, memory-backend-ram has mapped 
anon. memory
  to the whole area, but no ram is attached because Linux have a policy to 
delay allocation.

Right, but it can also be any sparse file (memory-backend-memfd, memory-backend-file).

  When virtio-mem driver apply to dynamically add memory to guest, it first 
request a region
  from the reserved memory area, then notify virtio-mem device to record the 
information
  (virtio-mem device doesn't make real memory allocation). After received 
response from

In the upcoming prealloc=on mode I referenced, the allocation will happen before the guest is notified about success and starts using the memory.

With vfio/mdev support, the allocation will happen nowadays already, when vfio/mdev is notified about the populated memory ranges (see RamDiscardManager). That's essentially what makes virtio-mem device passthrough work.

  virtio-mem deivce, virtio-mem driver will online the requested region and add 
it to Linux
  page allocator. Real ram allocation will happen via page fault when guest cpu 
access it.
  Memory shrink will be achieved by madvise()

Right, but you could write a custom virtio-mem driver that pools this memory differently.

Memory shrinking in the hypervisor is either done using madvise(DONMTNEED) or fallocate(FALLOC_FL_PUNCH_HOLE)


Questions:
1. heterogeneous computing, memory may be accessed by CPUs on host side and 
device side.
    Memory delayed allocation is not suitable. Host software(for instance, 
OpenCL) may
    allocate a buffer to computing device to place the computing result in.

That works already with virtio-mem with vfio/mdev via the RamDiscardManager infrastructure introduced recently. With "prealloc=on", the delayed memory allocation can also be avoided without vfio/mdev.

2. we hope build ourselves page allocator in host kernel, so it can offer 
customized mmap()
    method to build va->pa mapping in MMU and IOMMU.

Theoretically, you can wire up pretty much any driver in QEMU like vfio/mdev via the RamDiscardManager. From there, you can issue whatever syscall you need to popualte memory when plugging new memory blocks. All you need to support is a sparse mmap and a way to populate/discard memory. Populate/discard could be wired up in QEMU virtio-mem code as you need it.

3. some potential requirements also require our driver to manage memory, so 
that page size
    granularity can be controlled to fit small device iotlb cache.
    CXL has bias mode for HDM(host managed device memory), it needs physical 
address to make
    bias mode switch between host access and device access. These tell us 
driver manage memory
    is mandatory.

I think if you write your driver in a certain way and wire it up in QEMU virtio-mem accordingly (e.g., using a new memory-backend-whatever), that shouldn't be an issue.


--
Thanks,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]