qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] numa: Set default distance map if needed


From: Gavin Shan
Subject: Re: [PATCH 1/2] numa: Set default distance map if needed
Date: Wed, 13 Oct 2021 10:32:18 +1100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.0

Hi Igor,

On 10/13/21 12:53 AM, Igor Mammedov wrote:
On Tue, 12 Oct 2021 15:13:08 +0200
Andrew Jones <drjones@redhat.com> wrote:
On Tue, Oct 12, 2021 at 02:27:54PM +0200, Igor Mammedov wrote:
On Tue, 12 Oct 2021 12:37:54 +0200
Andrew Jones <drjones@redhat.com> wrote:
On Tue, Oct 12, 2021 at 11:40:16AM +0200, Igor Mammedov wrote:
On Wed,  6 Oct 2021 18:22:08 +0800
Gavin Shan <gshan@redhat.com> wrote:
The following option is used to specify the distance map. It's
possible the option isn't provided by user. In this case, the
distance map isn't populated and exposed to platform. On the
other hand, the empty NUMA node, where no memory resides, is
allowed on ARM64 virt platform. For these empty NUMA nodes,
their corresponding device-tree nodes aren't populated, but
their NUMA IDs should be included in the "/distance-map"
device-tree node, so that kernel can probe them properly if
device-tree is used.

   -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance>

So when user doesn't specify distance map, we need to generate
the default distance map, where the local and remote distances
are 10 and 20 separately. This adds an extra parameter to the
exiting complete_init_numa_distance() to generate the default
distance map for this case.

Signed-off-by: Gavin Shan <gshan@redhat.com>


how about error-ing out if distance map is required but
not provided by user explicitly and asking user to fix
command line?

Reasoning behind this that defaults are hard to maintain
and will require compat hacks and being raod blocks down
the road.
Approach I was taking with generic NUMA code, is deprecating
defaults and replacing them with sanity checks, which bail
out on incorrect configuration and ask user to correct command line.
Hence I dislike approach taken in this patch.

If you really wish to provide default, push it out of
generic code into ARM specific one
(then I won't oppose it that much (I think PPC does
some magic like this))
Also behavior seems to be ARM specific so generic
NUMA code isn't a place for it anyways

The distance-map DT node and the default 10/20 distance-map values
aren't arch-specific. RISCV is using it too.

I'm on the fence with this. I see erroring-out to require users
to provide explicit command lines as a good thing, but I also
see it as potentially an unnecessary burden for those that want
the default map anyway. The optional nature of the distance-map
node and the specification of the default map is here [1]

[1] Linux source: Documentation/devicetree/bindings/numa.txt

Looking at proposed linux patches [ https://lkml.org/lkml/2021/9/27/31 ],
using optional distance table as source for numa-node-ids,
looks like a hack around kernel's inability to fish them out
from CPU &| PCI nodes (using those nodes as source should
cover memory-less node use-case).

I consider including optional node as a policy decision.
So user shall include it explicitly on QEMU command line
if necessary (that works just fine for x86), or guest OS
can make up defaults on its own in absence of data.

OK, so erroring-out on configs that must provide distance-maps, rather
than automatically generating them for all configs is better.

So, my r-b stands for this patch, but I also wouldn't complain
about respinning it to error out instead.
I would complain about
moving the logic to Arm specific code, though, since RISCV would
then need to duplicate it.

Instead of putting workaround in QEMU and then making them generic,
I'd prefer to:
  1. make QEMU to be able generate DT with memory-less nodes

How? DT syntax doesn't allow this, because each node needs a unique
name which is derived from its base address, which an empty numa
you are talking about memory@foo nodes, aren't you?

node doesn't have.

Looking at Documentation/devicetree/bindings/numa.txt

mem/cpu/pci nodes also contain numa-node-id attribute,
so idea is to collect IDs from all present sources
instead of abusing distance map.
That would allow QEMU to skip memory@foo elements for
memory-less nodes because they obviously do not exist
and there is no way to describe them using 'memory' nodes.


I don't think it's feasible because it's hard to elaborate NUMA node IDs
from this sort of sources. Apart from mem/cpu/pci, the NUMA node IDs
can be included into platform devices, which could be vendor specific
sometimes. Other type of devices, which I don't know, could include
NUMA node IDs either.

Besides, things become more complicated when hotplug is considered.
For example, the hot-added CPU is associated with a non-existing
NUMA node. The CPU hot-add fails until the associated NUMA node
is initialized. This means CPU/mem hotplug have to be twisted.

So the point is to elaborate the NUMA node IDs from the limited
source: mem/cpu/distance-map. The distance-map is optional in
current Linux implementation.

  2. fix guest to get numa-node-id from CPU/PCI nodes if
     memory node isn't present,

I'm not sure that's possible with DT. If it is, then proposing it
upstream to Linux DT maintainers would be the next step.
Added Rob to CC.


As explained above.


or use ACPI tables which can
     describe memory-less NUMA nodes if fixing how DT is
     parsed unfeasible.

We use ACPI already for our guests, but we also generate a DT (which
edk2 consumes). We can't generate a valid DT when empty numa nodes
does edk2 actually uses numa info from QEMU?

are put on the command line unless we follow a DT spec saying how
to do that. The current spec says we should have a distance-map
that contains those nodes.

can you point out to the spec and place within it, pls?


https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20211012&id=58ae0b51506802713aa0e9956d1853ba4c722c98
("Documentation, dt, numa: Add note to empty NUMA node")

Thanks,
Gavin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]