[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [RFC PATCH v0 for 2.7] spapr: Work around the memory hotp
From: |
Michael Roth |
Subject: |
Re: [Qemu-ppc] [RFC PATCH v0 for 2.7] spapr: Work around the memory hotplug failure with DDW |
Date: |
Tue, 03 May 2016 18:30:51 -0500 |
User-agent: |
alot/0.3.6 |
Quoting Bharata B Rao (2016-04-19 06:51:41)
> Memory hotplug can fail for some combinations of RAM and maxmem when
> DDW is enabled in the presence of devices like nec-xhci-usb. DDW depends
> on maximum addressable memory by guest and this value is currently
> being calculated wrongly by the guest kernel routine memory_hotplug_max().
> While there is an attempt to fix the guest kernel(*), this patch works
> around the problem within QEMU itself.
>
> memory_hotplug_max() routine in the guest kernel arrives at max
> addressable memory by multiplying lmb-size with the lmb-count obtained
> from ibm,dyanmic-memory property. There are two assumptions here:
>
> - All LMBs are part of ibm,dynamic memory: This is not true for PowerKVM
> where only hot-pluggable LMBs are present in this property.
> - The memory area comprising of RAM and hotplug region is contiguous: This
> needn't be true always for PowerKVM as there can be gap between
> boot time RAM and hotplug region.
>
> This work around involves having all the LMBs (RMA, rest of the boot time
> LMBs and hot-pluggable LMBs) as part of ibm,dynamic-memory so that
> guest kernel's calculation of max addressable memory comes out correct
> resulting in correct DDW value which prevents memory hotplug failures.
> address@hidden is created for RMA, but RMA LMBs are also represented as
> "reserved" LMBs in ibm,dynamic-memory. Parts of this are essenitally a
> revert of e8f986fc57a664a74b9f685b466506366a15201b.
>
> In addition to this, the alignment of hotplug memory region is reduced from
> current 1G to 256M (LMB size in PowerKVM) so that we don't end up with any
> gaps between boot time RAM and hotplug region.
I don't see the actual change to SPAPR_HOTPLUG_MEM_ALIGN here? Is it aligned
by some other means?
>
> This change has a side effect on how the memory nodes in DT are
> represented before and after this change. As an example consider
> a guest with the following memory related command line options:
>
> -m 4G,slots=32,maxmem=8G -numa node,nodeid=0,mem=2G -numa node,nodeid=1,mem=2G
>
> Before this change, the guest would have
>
> Scenario 1
> ----------
> address@hidden for RMA
> address@hidden for rest of the boot time memory
> ibm,dynamic-reconfiguration-memory for hot-pluggable memory.
>
> After this commit, the guest will have
>
> Scenario 2
> ----------
> address@hidden for RMA
> ibm,dynamic-reconfiguration-memory for the entire memory including
> RMA, boot time memory as well as hot-pluggable memory.
>
> If an existing guest having DT nodes as in Scenario 1 above is migrated
> to a QEMU which has this change, at the target, it continues to have the
> DT nodes as in Scenario 1. However after 1st reboot, the DT representation
> changes over to Scenario 2.
>
> I haven't yet looked at Jian Jun's DRC migration patchset to ascertain
> if this change works well with DRC migration.
>
> (*) https://patchwork.ozlabs.org/patch/606912/
>
> Signed-off-by: Bharata B Rao <address@hidden>
> Cc: Nathan Fontenot <address@hidden>
> Cc: Michael Roth <address@hidden>
> ---
> hw/ppc/spapr.c | 59
> +++++++++++++++++++++++++++++++++++---------------
> include/hw/ppc/spapr.h | 1 +
> 2 files changed, 43 insertions(+), 17 deletions(-)
>
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 79a70a9..6d8de2e 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -566,7 +566,6 @@ static int spapr_populate_memory(sPAPRMachineState
> *spapr, void *fdt)
> }
> if (!mem_start) {
> /* ppc_spapr_init() checks for rma_size <= node0_size already */
> - spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
> mem_start += spapr->rma_size;
> node_size -= spapr->rma_size;
> }
> @@ -759,18 +758,13 @@ static int
> spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
> int ret, i, offset;
> uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
> uint32_t prop_lmb_size[] = {0, cpu_to_be32(lmb_size)};
> - uint32_t nr_lmbs = (machine->maxram_size - machine->ram_size)/lmb_size;
> + uint32_t nr_rma_lmbs = spapr->rma_size / lmb_size;
> + uint32_t nr_lmbs = machine->maxram_size / lmb_size;
> + uint32_t nr_assigned_lmbs = machine->ram_size / lmb_size;
> uint32_t *int_buf, *cur_index, buf_len;
> int nr_nodes = nb_numa_nodes ? nb_numa_nodes : 1;
>
> /*
> - * Don't create the node if there are no DR LMBs.
> - */
> - if (!nr_lmbs) {
> - return 0;
> - }
> -
> - /*
> * Allocate enough buffer size to fit in ibm,dynamic-memory
> * or ibm,associativity-lookup-arrays
> */
> @@ -802,9 +796,15 @@ static int
> spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
> for (i = 0; i < nr_lmbs; i++) {
> sPAPRDRConnector *drc;
> sPAPRDRConnectorClass *drck;
> - uint64_t addr = i * lmb_size + spapr->hotplug_memory.base;;
> + uint64_t addr;
> uint32_t *dynamic_memory = cur_index;
>
> + if (i < nr_assigned_lmbs) {
> + addr = i * lmb_size;
> + } else {
> + addr = (i - nr_assigned_lmbs) * lmb_size +
> + spapr->hotplug_memory.base;
If the fix relies on there being no gap between hotplug_memory.base and
machine->ram_size, could we instead assert that (nr_assigned_lmbs *
lmb_size == spapr->hotplug_memory.base) and then use the same addr
calculation for all lmbs (here and elsewhere)?
Otherwise the patch looks good, and it seems like a reasonable
workaround to have on the QEMU side even if we still pursue the kernel
fix.
> + }
> drc = spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB,
> addr/lmb_size);
> g_assert(drc);
> @@ -817,7 +817,11 @@ static int
> spapr_populate_drconf_memory(sPAPRMachineState *spapr, void *fdt)
> dynamic_memory[4] = cpu_to_be32(numa_get_node(addr, NULL));
> if (addr < machine->ram_size ||
> memory_region_present(get_system_memory(), addr)) {
> - dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGNED);
> + if (i < nr_rma_lmbs) {
> + dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_RESERVED);
> + } else {
> + dynamic_memory[5] = cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGNED);
> + }
> } else {
> dynamic_memory[5] = cpu_to_be32(0);
> }
> @@ -879,6 +883,8 @@ int spapr_h_cas_compose_response(sPAPRMachineState *spapr,
> /* Generate ibm,dynamic-reconfiguration-memory node if required */
> if (memory_update && smc->dr_lmb_enabled) {
> _FDT((spapr_populate_drconf_memory(spapr, fdt)));
> + } else {
> + _FDT((spapr_populate_memory(spapr, fdt)));
> }
>
> /* Pack resulting tree */
> @@ -916,10 +922,23 @@ static void spapr_finalize_fdt(sPAPRMachineState *spapr,
> /* open out the base tree into a temp buffer for the final tweaks */
> _FDT((fdt_open_into(spapr->fdt_skel, fdt, FDT_MAX_SIZE)));
>
> - ret = spapr_populate_memory(spapr, fdt);
> - if (ret < 0) {
> - fprintf(stderr, "couldn't setup memory nodes in fdt\n");
> - exit(1);
> + /*
> + * Add address@hidden node to represent RMA. Rest of the memory is either
> + * represented by memory nodes or ibm,dynamic-reconfiguration-memory
> + * node later during ibm,client-architecture-support call.
> + *
> + * If NUMA is configured, ensure that address@hidden ends up in the
> + * first memory-less node.
> + */
> + if (nb_numa_nodes) {
> + for (i = 0; i < nb_numa_nodes; ++i) {
> + if (numa_info[i].node_mem) {
> + spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
> + break;
> + }
> + }
> + } else {
> + spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
> }
>
> ret = spapr_populate_vdevice(spapr->vio_bus, fdt);
> @@ -1659,14 +1678,20 @@ static void
> spapr_create_lmb_dr_connectors(sPAPRMachineState *spapr)
> {
> MachineState *machine = MACHINE(spapr);
> uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
> - uint32_t nr_lmbs = (machine->maxram_size - machine->ram_size)/lmb_size;
> + uint32_t nr_lmbs = machine->maxram_size / lmb_size;
> + uint32_t nr_assigned_lmbs = machine->ram_size / lmb_size;
> int i;
>
> for (i = 0; i < nr_lmbs; i++) {
> sPAPRDRConnector *drc;
> uint64_t addr;
>
> - addr = i * lmb_size + spapr->hotplug_memory.base;
> + if (i < nr_assigned_lmbs) {
> + addr = i * lmb_size;
> + } else {
> + addr = (i - nr_assigned_lmbs) * lmb_size +
> + spapr->hotplug_memory.base;
> + }
> drc = spapr_dr_connector_new(OBJECT(spapr),
> SPAPR_DR_CONNECTOR_TYPE_LMB,
> addr/lmb_size);
> qemu_register_reset(spapr_drc_reset, drc);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 098d85d..9f2050d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -627,5 +627,6 @@ int spapr_rng_populate_dt(void *fdt);
> * property under ibm,dynamic-reconfiguration-memory node.
> */
> #define SPAPR_LMB_FLAGS_ASSIGNED 0x00000008
> +#define SPAPR_LMB_FLAGS_RESERVED 0x00000080
>
> #endif /* !defined (__HW_SPAPR_H__) */
> --
> 2.1.0
>
- Re: [Qemu-ppc] [RFC PATCH v0 for 2.7] spapr: Work around the memory hotplug failure with DDW,
Michael Roth <=