qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory


From: Jonathan Cameron
Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
Date: Wed, 27 Sep 2023 12:33:18 +0100

On Wed, 27 Sep 2023 07:14:28 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> > >
> > > Based on the suggestions here, can we consider something like the
> > > following?
> > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> > > structure to make it hotpluggable.  
> >
> > Is that "devnode=on" parameter required? Can't we simply expose any node
> > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?

That needs some checking for what extra stuff we'll instantiate on CPU only
(or once we implement them) Generic Initiator / Generic Port nodes -
I'm definitely not keen on doing so for generic ports (which QEMU doesn't yet
do though there have been some RFCs I think).

> > Right now, with "ordinary", fixed-location memory devices
> > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that
> > covers the device memory region for these devices with
> > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine,
> > which does not quite work IIRC. All applicable nodes that don't have
> > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.  
> 
> Yeah, you're right that it isn't required. Exposing the node without any 
> memory as
> MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using
> "devnode=on".
> 
> > In your example, which memory ranges would we use for these nodes in SRAT?  
> 
> We are setting the Base Address and the Size as 0 in the SRAT memory affinity
> structures. This is done through the following:
> build_srat_memory(table_data, 0, 0, i,
>                   MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
> 
> This results in the following logs in the VM from the Linux ACPI SRAT parsing 
> code:
> [    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] 
> hotplug
> [    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] 
> hotplug
> [    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] 
> hotplug
> [    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] 
> hotplug
> [    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] 
> hotplug
> [    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] 
> hotplug
> [    0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] 
> hotplug
> [    0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] 
> hotplug
> 
> I would re-iterate that we are just emulating the baremetal behavior here.
> 
> 
> > I don't see how these numa-node args on a vfio-pci device have any
> > general utility.  They're only used to create a firmware table, so why
> > don't we be explicit about it and define the firmware table as an
> > object?  For example:
> >
> >        -numa node,nodeid=2 \
> >        -numa node,nodeid=3 \
> >        -numa node,nodeid=4 \
> >        -numa node,nodeid=5 \
> >        -numa node,nodeid=6 \
> >        -numa node,nodeid=7 \
> >        -numa node,nodeid=8 \
> >        -numa node,nodeid=9 \
> >        -device 
> >vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0
> > \
> >        -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \  
> 
> Yeah, that is fine with me. If we agree with this approach, I can go
> implement it.
> 
> 
> > There are some suggestions in this thread that CXL could have similar
> > requirements,

For CXL side of things, if talking memory devices (type 3), I'm not
sure what the usecase will be of this feature.
Either we treat them as normal memory in which case it will all be static
at boot of the VM (for SRAT anyway - we might plug things in and
out of ranges), or it will be whole device hotplug and look like pc-dimm
hotplug (which should be into a statically defined range in SRAT).
 Longer term if we look at virtualizing dynamic capacity
devices (not sure we need to other that possibly to leverage
sparse DAX etc on top of them) then we might want to provide
emulated CXL Fixed memory windows in the guest (which get their own 
NUMA nodes anyway) + plug the memory into that. We'd probably hide
away interleaving etc in the host as all the guest should care about
is performance information and I doubt we'd want to emulate the complexity
of address routing complexities.

Similar to host PA ranges used in CXL fixed memory windows, I'm not sure
we wouldn't just allow for the guest to have 'all' possible setups that
might get plugged later by just burning a lot of HPA space and hence
just be able to use static SRAT nodes covering each region.
This would be less painful than for real PAs because as we are
emulating the CXL devices, probably as one emulated type 3 device per
potential set of real devices in an interleave set we can avoid
all the ordering constraints of CXL address decoders that end up eating
up Host PA space.

Virtualizing DCD is going to be a fun topic (that's next year's
plumbers CXL uconf session sorted ;), but I can see it might be done completely
differently and look nothing like a CXL device, in which case maybe
what you have here will make sense.

Come to think of it, you 'could' potentially do that for your use case
if the regions are reasonably bound in maximum size at the cost of
large GPA usage?

CXL accelerators / GPUs etc are a different question but who has one
of those anyway? :)


> > but I haven't found any evidence that these
> > dev-mem-pxm-{start,count} attributes in the _DSD are standardized in
> > any way.  If they are, maybe this would be a dev-mem-pxm-acpi object
> > rather than an NVIDIA specific one.  
> 
> Maybe Jason, Jonathan can chime in on this?

I'm not aware of anything general around this.  A PCI device
can have a _PXM and I think you could define subdevices each with a
_PXM of their own?  Those subdevices would need drivers to interpret
the structure anyway so not real benefit over a _DSD that I can
immediately think of...

If we think this will be common long term, anyone want to take
multiple _PXM per device support as a proposal to ACPI?

So agreed, it's not general, so if it's acceptable to have 0 length
NUMA nodes (and I think we have to emulate them given that's what real
hardware is doing even if some of us think the real hardware shouldn't
have done that!) then just spinning them up explicitly as nodes
+ device specific stuff for the NVIDIA device seems fine to me.

> 
> 
> > It seems like we could almost meet the requirement for this table via
> > -acpitable, but I think we'd like to avoid the VM orchestration tool
> > from creating, compiling, and passing ACPI data blobs into the VM.  
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]