qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memo


From: david.dai
Subject: Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU
Date: Wed, 13 Oct 2021 16:13:37 +0800

On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand (david@redhat.com) 
wrote:
> 
> 
> 
> > > virito-mem currently relies on having a single sparse memory region (anon
> > > mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we can
> > > share memory with other processes, sharing with other VMs is not intended.
> > > Instead of actually mmaping parts dynamically (which can be quite
> > > expensive), virtio-mem relies on punching holes into the backend and
> > > dynamically allocating memory/file blocks/... on access.
> > > 
> > > So the easy way to make it work is:
> > > 
> > > a) Exposing the CXL memory to the buddy via dax/kmem, esulting in device
> > > memory getting managed by the buddy on a separate NUMA node.
> > > 
> > 
> > Linux kernel buddy system? how to guarantee other applications don't apply 
> > memory
> > from it
> 
> Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
> such that even if some other allocation ended up there, that it could
> get migrated somewhere else.
> 
> For example, "daxctl reconfigure-device" tries doing that as default:
> 
> https://pmem.io/ndctl/daxctl-reconfigure-device.html
> 
> However, I agree that we might actually want to tell the system to not
> use this CPU-less node as fallback for other allocations, and that we
> might not want to swap out such memory etc.
> 
> 
> But, in the end all that virtio-mem needs to work in the hypervisor is
> 
> a) A sparse memmap (anonymous RAM, memfd, file)
> b) A way to populate memory within that sparse memmap (e.g., on fault,
>    using madvise(MADV_POPULATE_WRITE), fallocate())
> c) A way to discard memory (madvise(MADV_DONTNEED),
>    fallocate(FALLOC_FL_PUNCH_HOLE))
> 
> So instead of using anonymous memory+mbind, you can also mmap a sparse file
> and rely on populate-on-demand. One alternative for your use case would be
> to create a DAX  filesystem on that CXL memory (IIRC that should work) and
> simply providing virtio-mem with a sparse file located on that filesystem.
> 
> Of course, you can also use some other mechanism as you might have in
> your approach, as long as it supports a,b,c.
> 
> > 
> > > 
> > > b) (optional) allocate huge pages on that separate NUMA node.
> > > c) Use ordinary memory-device-ram or memory-device-memfd (for huge pages),
> > > *bidning* the memory backend to that special NUMA node.
> > > 
> > "-object memory-backend/device-ram or memory-device-memfd, id=mem0, 
> > size=768G"
> > How to bind backend memory to NUMA node
> > 
> 
> I think the syntax is "policy=bind,host-nodes=X"
> 
> whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for "5"
> "host-nodes=0x20" etc.
> 
> > > 
> > > This will dynamically allocate memory from that special NUMA node, 
> > > resulting
> > > in the virtio-mem device completely being backed by that device memory,
> > > being able to dynamically resize the memory allocation.
> > > 
> > > 
> > > Exposing an actual devdax to the virtio-mem device, shared by multiple VMs
> > > isn't really what we want and won't work without major design changes. 
> > > Also,
> > > I'm not so sure it's a very clean design: exposing memory belonging to 
> > > other
> > > VMs to unrelated QEMU processes. This sounds like a serious security hole:
> > > if you managed to escalate to the QEMU process from inside the VM, you can
> > > access unrelated VM memory quite happily. You want an abstraction
> > > in-between, that makes sure each VM/QEMU process only sees private memory:
> > > for example, the buddy via dax/kmem.
> > > 
> > Hi David
> > Thanks for your suggestion, also sorry for my delayed reply due to my long 
> > vacation.
> > How does current virtio-mem dynamically attach memory to guest, via page 
> > fault?
> 
> Essentially you have a large sparse mmap. Withing that mmap, memory is
> populated on demand. Instead if mmap/munmap you perform a single large
> mmap and then dynamically populate memory/discard memory.
> 
> Right now, memory is populated via page faults on access. This is
> sub-optimal when dealing with limited resources (i.e., hugetlbfs,
> file blocks) and you might run out of backend memory.
> 
> I'm working on a "prealloc" mode, which will preallocate/populate memory
> necessary for exposing the next block of memory to the VM, and which
> fails gracefully if preallocation/population fails in the case of such
> limited resources.
> 
> The patch resides on:
>       https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next
> 
> commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
> Author: David Hildenbrand <david@redhat.com>
> Date:   Mon Aug 2 19:51:36 2021 +0200
> 
>     virtio-mem: support "prealloc=on" option
>     Especially for hugetlb, but also for file-based memory backends, we'd
>     like to be able to prealloc memory, especially to make user errors less
>     severe: crashing the VM when there are not sufficient huge pages around.
>     A common option for hugetlb will be using "reserve=off,prealloc=off" for
>     the memory backend and "prealloc=on" for the virtio-mem device. This
>     way, no huge pages will be reserved for the process, but we can recover
>     if there are no actual huge pages when plugging memory.
>     Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

Hi David,

After read virtio-mem code, I understand what you have expressed, please allow 
me to describe
my understanding to virtio-mem, so that we have a aligned view.

Virtio-mem:
 Virtio-mem device initializes and reserved a memory area(GPA), later memory 
dynamically
 growing/shrinking will not exceed this scope, memory-backend-ram has mapped 
anon. memory
 to the whole area, but no ram is attached because Linux have a policy to delay 
allocation.
 When virtio-mem driver apply to dynamically add memory to guest, it first 
request a region
 from the reserved memory area, then notify virtio-mem device to record the 
information
 (virtio-mem device doesn't make real memory allocation). After received 
response from
 virtio-mem deivce, virtio-mem driver will online the requested region and add 
it to Linux
 page allocator. Real ram allocation will happen via page fault when guest cpu 
access it.
 Memory shrink will be achieved by madvise()

Questions:
1. heterogeneous computing, memory may be accessed by CPUs on host side and 
device side.
   Memory delayed allocation is not suitable. Host software(for instance, 
OpenCL) may
   allocate a buffer to computing device to place the computing result in.
2. we hope build ourselves page allocator in host kernel, so it can offer 
customized mmap()
   method to build va->pa mapping in MMU and IOMMU.
3. some potential requirements also require our driver to manage memory, so 
that page size
   granularity can be controlled to fit small device iotlb cache.
   CXL has bias mode for HDM(host managed device memory), it needs physical 
address to make
   bias mode switch between host access and device access. These tell us driver 
manage memory
   is mandatory.

My opinion:
 I hope this patch can enter QEMU main tree, it is a self-contain virtual 
device which doesn't impact QEMU stability.
 It is a mechanism to dynamically attach memory to guest, virtio-mem via 
pagefault, this patch create new memory region.
 In addition, user has big room to customize frontend and backend implement.
 It can be regarded as a sample code and give other people more idea and help.

Thanks,
David





reply via email to

[Prev in Thread] Current Thread [Next in Thread]