qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 6/6] specs/ppc-spapr-numa: update with new NUMA support


From: David Gibson
Subject: Re: [PATCH v2 6/6] specs/ppc-spapr-numa: update with new NUMA support
Date: Fri, 25 Sep 2020 13:43:43 +1000

On Thu, Sep 24, 2020 at 04:50:58PM -0300, Daniel Henrique Barboza wrote:
> This update provides more in depth information about the
> choices and drawbacks of the new NUMA support for the
> spapr machine.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  docs/specs/ppc-spapr-numa.rst | 213 ++++++++++++++++++++++++++++++++++
>  1 file changed, 213 insertions(+)
> 
> diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst
> index e762038022..994bfb996f 100644
> --- a/docs/specs/ppc-spapr-numa.rst
> +++ b/docs/specs/ppc-spapr-numa.rst
> @@ -189,3 +189,216 @@ QEMU up to 5.1, as follows:
>  
>  This also means that user input in QEMU command line does not change the
>  NUMA distancing inside the guest for the pseries machine.
> +
> +New NUMA mechanics for pseries in QEMU 5.2
> +==========================================

I'd suggest putting the new mechanics - which we intend to become the
normal mechanis - at the top of the document, with the old version later.

> +
> +Starting in QEMU 5.2, the pseries machine now considers user input when
> +setting NUMA topology of the guest. The following changes were made:
> +
> +* ibm,associativity-reference-points was changed to {0x4, 0x3, 0x2, 0x1}, 
> allowing
> +  for 4 distinct NUMA distance values based on the NUMA levels
> +
> +* ibm,max-associativity-domains was changed to support multiple associativity
> +  domains in all NUMA levels. This is needed to ensure user flexibility
> +
> +* ibm,associativity for all resources now varies with user input
> +
> +These changes are only effective for pseries-5.2 and newer machines that are
> +created with more than one NUMA node (disconsidering NUMA nodes created by
> +the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been
> +around for such a long time, with users seeing NUMA distances 10 and 40
> +(and 80 if using NVLink2 GPUs), and there is no need to disrupt the
> +existing experience of those guests.
> +
> +To bring the user experience x86 users have when tuning up NUMA, we had
> +to operate under the current pseries Linux kernel logic described in
> +`How the pseries Linux guest calculates NUMA distances`_. The result
> +is that we needed to translate NUMA distance user input to pseries
> +Linux kernel input.
> +
> +Translating user distance to kernel distance
> +--------------------------------------------
> +
> +User input for NUMA distance can vary from 10 to 254. We need to translate
> +that to the values that the Linux kernel operates on (10, 20, 40, 80, 160).
> +This is how it is being done:
> +
> +* user distance 11 to 30 will be interpreted as 20
> +* user distance 31 to 60 will be interpreted as 40
> +* user distance 61 to 120 will be interpreted as 80
> +* user distance 121 and beyond will be interpreted as 160
> +* user distance 10 stays 10
> +
> +The reasoning behind this aproximation is to avoid any round up to the local
> +distance (10), keeping it exclusive to the 4th NUMA level (which is still
> +exclusive to the node_id). All other ranges were chosen under the developer
> +discretion of what would be (somewhat) sensible considering the user input.
> +Any other strategy can be used here, but in the end the reality is that we'll
> +have to accept that a large array of values will be translated to the same
> +NUMA topology in the guest, e.g. this user input:
> +
> +::
> +
> +      0   1   2
> +  0  10  31 120
> +  1  31  10  30
> +  2 120  30  10
> +
> +And this other user input:
> +
> +::
> +
> +      0   1   2
> +  0  10  60  61
> +  1  60  10  11
> +  2  61  11  10
> +
> +Will both be translated to the same values internally:
> +
> +::
> +
> +      0   1   2
> +  0  10  40  80
> +  1  40  10  20
> +  2  80  20  10
> +
> +Users are encouraged to use only the kernel values in the NUMA definition to
> +avoid being taken by surprise with that the guest is actually seeing in the
> +topology. There are enough potential surprises that are inherent to the
> +associativity domain assignment process, discussed below.
> +
> +
> +How associativity domains are assigned
> +--------------------------------------
> +
> +LOPAPR allows more than one associativity array (or 'string') per allocated
> +resource. This would be used to represent that the resource has multiple
> +connections with the board, and then the operational system, when deciding
> +NUMA distancing, should consider the associativity information that provides
> +the shortest distance.
> +
> +The spapr implementation does not support multiple associativity arrays per
> +resource, neither does the pseries Linux kernel. We'll have to represent the
> +NUMA topology using one associativity per resource, which means that choices
> +and compromises are going to be made.
> +
> +Consider the following NUMA topology entered by user input:
> +
> +::
> +
> +      0   1   2   3
> +  0  10  20  20  40
> +  1  20  10  80  40
> +  2  20  80  10  20
> +  3  40  40  20  10
> +
> +Honoring just the relative distances of node 0 to every other node, one 
> possible
> +value for all associativity arrays would be:
> +
> +* node 0: 0 B A 0
> +* node 1: 0 0 A 1
> +* node 2: 0 0 A 2
> +* node 3: 0 B 0 3
> +
> +With the reference points {0x4, 0x3, 0x2, 0x1}, for node 0:
> +
> +* distance from 0 to 1 is 20 (no match at 0x4, will match at 0x3)
> +* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3)
> +* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match
> +  at 0x2)
> +
> +The distances related to node 0 are well represented. Doing for node 1, and 
> keeping
> +in mind that we don't need to revisit node 0 again, the distance from node 1 
> to
> +2 is 80, matching at 0x4:
> +
> +* node 1: C 0 A 1
> +* node 2: C 0 A 2
> +
> +Over here we already have the first conflict. Even if we assign a new 
> associativity
> +domain at 0x4 for 1 and 2, and we do that in the code, the kernel will define
> +the distance between 1 and 2 as 20, not 80, because both 1 and 2 have the "A"
> +associativity domain from the previous step. If we decide to discard the
> +associativity with "A" then the node 0 distances are compromised.
> +
> +Following up with the distance from 1 to 3 being 40 (a match in 0x2) we have 
> another
> +decision to make. These are the current associativity domains of each:
> +
> +* node 1: C 0 A 1
> +* node 3: 0 B 0 3
> +
> +There is already an associativity domain at 0x2 in node 3, "B", which was 
> assigned
> +by the node 0 distances. If we define a new associativity domain at this 
> level
> +for 1 and 3 we will overwrite the existing associativity between 0 and 3. 
> What
> +the code is doing in this case is to assign the existing domain to the
> +current associativity, in this case, "B" is now assigned to the 0x2 of node 
> 1,
> +resulting in the following associativity arrays:
> +
> +* node 0: 0 B A 0
> +* node 1: C 0 A 1
> +* node 2: C B A 2
> +* node 3: 0 B 0 3
> +
> +In the last step we will analyze just nodes 2 and 3. The desired distance 
> between
> +2 and 3 is 20, i.e. a match in 0x3. Node 2 already has a domain assigned in 
> 0x3,
> +A, so we do the same as we did in the previous case and assign it to node 3
> +at 0x3. This is the end result for the associativity arrays:
> +
> +* node 0: 0 B A 0
> +* node 1: C 0 A 1
> +* node 2: C B A 2
> +* node 3: 0 B A 3
> +
> +The kernel will read these arrays and will calculate the following NUMA 
> topology for
> +the guest:
> +
> +::
> +
> +      0   1   2   3
> +  0  10  20  20  20
> +  1  20  10  20  20
> +  2  20  20  10  20
> +  3  20  20  20  10
> +
> +Which is not what the user wanted, but it is what the current logic and 
> implementation
> +constraints of the kernel and QEMU will provide inside the LOPAPR 
> specification.
> +
> +Changing a single value, specially a low distance value, makes for drastic 
> changes
> +in the result. For example, with the same user input from above, but 
> changing the
> +node distance from 0 to 1 to 40:
> +
> +::
> +
> +      0   1   2   3
> +  0  10  40  20  40
> +  1  40  10  80  40
> +  2  20  80  10  20
> +  3  40  40  20  10
> +
> +This is the result inside the guest, applying the same heuristics:
> +
> +::
> +
> +  $ numactl -H
> +  available: 4 nodes (0-3)
> +  (...)
> +  node distances:
> +  node   0   1   2   3
> +    0:  10  40  20  20
> +    1:  40  10  80  40
> +    2:  20  80  10  20
> +    3:  20  40  20  10
> +
> +This result is much closer to the user input and only a single distance was 
> changed
> +from the original.
> +
> +The kernel will always match with the shortest associativity domain 
> possible, and we're
> +attempting to retain the previous established relations between the nodes. 
> This means
> +that a distance equal to 20 between nodes A and B and the same distance 20 
> between nodes
> +A and F will cause the distance between B and F to also be 20. The same will 
> happen to
> +other distances, but shorter distances has precedent over it to the distance 
> calculation.
> +
> +Users are welcome to use this knowledge and experiment with the input to get 
> the
> +NUMA topology they want, or as closer as they want. The important thing is 
> to keep
> +expectations up to par with what we are capable of provide at this moment: an
> +approximation.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]