qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory


From: David Hildenbrand
Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
Date: Tue, 26 Sep 2023 18:54:53 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1

On 26.09.23 16:52, Ankit Agrawal wrote:
Good idea.  Fundamentally the device should not be creating NUMA
nodes, the VM should be configured with NUMA nodes and the device
memory associated with those nodes.

+1. That would also make it fly with DIMMs and virtio-mem, where you
would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
using virtio-mem).


We actually do not add the device memory on the host, instead
map it into the Qemu VMA using remap_pfn_range(). Please checkout the
mmap function in vfio-pci variant driver code managing the device.
https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
And I think host memory backend would need memory that is added on the
host.

Moreover since we want to passthrough the entire device memory, the
-object memory-backend-ram would have to be passed a size that is equal
to the device memory. I wonder if that would be too much of a trouble
for an admin (or libvirt) triggering the Qemu process.

Both these items are avoided by exposing the device memory as BAR as in the
current  implementation (referenced above) since it lets Qemu to naturally
discover the device memory region and do mmap.


Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured
on the device, not on the memory backend.

e.g., -device pc-dimm,node=3,memdev=mem1,...


Alco CCing Gavin, I remember he once experimented with virtio-mem + multiple memory-less nodes and it was quite working (because of MEM_AFFINITY_HOTPLUGGABLE only on the last node, below).

Agreed, but still we will have the aforementioned issues viz.
1. The backing memory for the memory device would need to be allocated
on the host. However, we do not add the device memory on the host in this
case. Instead the Qemu VMA is mapped to the device memory physical
address using remap_pfn_range().

I don't see why that would be necessary ...

2. The memory device need to be passed an allocation size such that all of
the device memory is mapped into the Qemu VMA. This may not be readily
available to the admin/libvirt.

... or that. But your proposal roughly looks like what I had in mind, so let's focus on that.


Based on the suggestions here, can we consider something like the
following?
1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
structure to make it hotpluggable.

Is that "devnode=on" parameter required? Can't we simply expose any node that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?

Right now, with "ordinary", fixed-location memory devices (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that covers the device memory region for these devices with MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, which does not quite work IIRC. All applicable nodes that don't have boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.

In your example, which memory ranges would we use for these nodes in SRAT?

2. Create several NUMA nodes with 'devnode' which are supposed to be
associated with the vfio-pci device.
3. Pass the numa node start and count to associate the nodes created.

So, the command would look something like the following.
...
         -numa node,nodeid=2,devnode=on \
         -numa node,nodeid=3,devnode=on \
         -numa node,nodeid=4,devnode=on \
         -numa node,nodeid=5,devnode=on \
         -numa node,nodeid=6,devnode=on \
         -numa node,nodeid=7,devnode=on \
         -numa node,nodeid=8,devnode=on \
         -numa node,nodeid=9,devnode=on \
         -device 
vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8
 \

Better an array/list like "numa-nodes=2-9"

... but how would the device actually use these nodes? (which for which?)

--
Cheers,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]