Re: [Qemu-ppc] [for-2.9 3/5] pseries: Implement HPT resizing

qemu-ppc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [for-2.9 3/5] pseries: Implement HPT resizing

From:	David Gibson
Subject:	Re: [Qemu-ppc] [for-2.9 3/5] pseries: Implement HPT resizing
Date:	Fri, 9 Dec 2016 20:19:32 +1100
User-agent:	Mutt/1.7.1 (2016-10-04)

On Fri, Dec 09, 2016 at 09:36:17AM +0100, Thomas Huth wrote:
> On 09.12.2016 03:23, David Gibson wrote:
> > This patch implements hypercalls allowing a PAPR guest to resize its own
> > hash page table.  This will eventually allow for more flexible memory
> > hotplug.
> > 
> > The implementation is partially asynchronous, handled in a special thread
> > running the hpt_prepare_thread() function.  The state of a pending resize
> > is stored in SPAPR_MACHINE->pending_hpt.
> > 
> > The H_RESIZE_HPT_PREPARE hypercall will kick off creation of a new HPT, or,
> > if one is already in progress, monitor it for completion.  If there is an
> > existing HPT resize in progress that doesn't match the size specified in
> > the call, it will cancel it, replacing it with a new one matching the
> > given size.
> > 
> > The H_RESIZE_HPT_COMMIT completes transition to a resized HPT, and can only
> > be called successfully once H_RESIZE_HPT_PREPARE has successfully
> > completed initialization of a new HPT.  The guest must ensure that there
> > are no concurrent accesses to the existing HPT while this is called (this
> > effectively means stop_machine() for Linux guests).
> > 
> > For now H_RESIZE_HPT_COMMIT goes through the whole old HPT, rehashing each
> > HPTE into the new HPT.  This can have quite high latency, but it seems to
> > be of the order of typical migration downtime latencies for HPTs of size
> > up to ~2GiB (which would be used in a 256GiB guest).
> > 
> > In future we probably want to move more of the rehashing to the "prepare"
> > phase, by having H_ENTER and other hcalls update both current and
> > pending HPTs.  That's a project for another day, but should be possible
> > without any changes to the guest interface.
> > 
> > Signed-off-by: David Gibson <address@hidden>
> > ---
> >  hw/ppc/spapr.c          |   4 +-
> >  hw/ppc/spapr_hcall.c    | 346 
> > +++++++++++++++++++++++++++++++++++++++++++++++-
> 
> I wonder whether it makes sense to put all the hpt related code into a
> separate new file?

Maybe, but not really in scope here.

> >  include/hw/ppc/spapr.h  |   6 +
> >  target-ppc/mmu-hash64.h |   4 +
> >  4 files changed, 354 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index ecb0822..f3b74dc 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -93,8 +93,6 @@
> >  
> >  #define PHANDLE_XICP            0x00001111
> >  
> > -#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> > -
> >  static XICSState *try_create_xics(const char *type, int nr_servers,
> >                                    int nr_irqs, Error **errp)
> >  {
> > @@ -1055,7 +1053,7 @@ static void close_htab_fd(sPAPRMachineState *spapr)
> >      spapr->htab_fd = -1;
> >  }
> >  
> > -static int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
> > +int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
> >  {
> >      int shift;
> >  
> > diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> > index 72a9c4d..9f421c1 100644
> > --- a/hw/ppc/spapr_hcall.c
> > +++ b/hw/ppc/spapr_hcall.c
> > @@ -2,6 +2,7 @@
> >  #include "qapi/error.h"
> >  #include "sysemu/sysemu.h"
> >  #include "qemu/log.h"
> > +#include "qemu/error-report.h"
> >  #include "cpu.h"
> >  #include "exec/exec-all.h"
> >  #include "helper_regs.h"
> > @@ -355,20 +356,320 @@ static target_ulong h_read(PowerPCCPU *cpu, 
> > sPAPRMachineState *spapr,
> >      return H_SUCCESS;
> >  }
> >  
> > +struct sPAPRPendingHPT {
> > +    /* These fields are read-only after initialization */
> > +    int shift;
> > +    QemuThread thread;
> > +
> > +    /* These fields are protected by the BQL */
> > +    bool complete;
> > +
> > +    /* These fields are private to the preparation thread if
> > +     * !complete, otherwise protected by the BQL */
> > +    int ret;
> > +    void *hpt;
> > +};
> > +
> > +static void free_pending_hpt(sPAPRPendingHPT *pending)
> > +{
> > +    if (pending->hpt) {
> > +        qemu_vfree(pending->hpt);
> > +    }
> > +
> > +    g_free(pending);
> > +}
> > +
> > +static void *hpt_prepare_thread(void *opaque)
> > +{
> > +    sPAPRPendingHPT *pending = opaque;
> > +    size_t size = 1ULL << pending->shift;
> > +
> > +    pending->hpt = qemu_memalign(size, size);
> > +    if (pending->hpt) {
> > +        memset(pending->hpt, 0, size);
> > +        pending->ret = H_SUCCESS;
> > +    } else {
> > +        pending->ret = H_NO_MEM;
> > +    }
> > +
> > +    qemu_mutex_lock_iothread();
> > +
> > +    if (SPAPR_MACHINE(qdev_get_machine())->pending_hpt != pending) {
> > +        /* We've been cancelled, clean ourselves up */
> > +        free_pending_hpt(pending);
> > +        goto out;
> > +    }
> > +
> > +    pending->complete = true;
> > +
> > +out:
> 
> You could easily avoid the goto here:
> 
>     if (SPAPR_MACHINE(qdev_get_machine())->pending_hpt != pending) {
>         /* We've been cancelled, clean ourselves up */
>         free_pending_hpt(pending);
>     } else {
>         pending->complete = true;
>     }
> 
> ?

Ah, yeah, I guess so.  I think I had more cases that used the same
bailout path during development.

> > +    qemu_mutex_unlock_iothread();
> > +    return NULL;
> > +}
> > +
> > +/* Must be called with BQL held */
> > +static void cancel_hpt_prepare(sPAPRMachineState *spapr)
> > +{
> > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> > +
> > +    /* Let the thread know it's cancelled */
> > +    spapr->pending_hpt = NULL;
> > +
> > +    if (!pending) {
> > +        /* Nothing to do */
> > +        return;
> > +    }
> > +
> > +    if (!pending->complete) {
> > +        /* thread will clean itself up */
> > +        return;
> > +    }
> > +
> > +    free_pending_hpt(pending);
> > +}
> > +
> > +static int build_dimm_list(Object *obj, void *opaque)
> > +{
> > +    GSList **list = opaque;
> > +
> > +    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
> > +        DeviceState *dev = DEVICE(obj);
> > +        if (dev->realized) { /* only realized DIMMs matter */
> > +            *list = g_slist_prepend(*list, dev);
> > +        }
> > +    }
> > +
> > +    object_child_foreach(obj, build_dimm_list, opaque);
> > +    return 0;
> > +}
> > +
> > +static ram_addr_t get_current_ram_size(void)
> > +{
> > +    GSList *list = NULL, *item;
> > +    ram_addr_t size = ram_size;
> > +
> > +    build_dimm_list(qdev_get_machine(), &list);
> > +    for (item = list; item; item = g_slist_next(item)) {
> > +        Object *obj = OBJECT(item->data);
> > +        if (!strcmp(object_get_typename(obj), TYPE_PC_DIMM)) {
> > +            size += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
> > +                                            &error_abort);
> > +        }
> > +    }
> > +    g_slist_free(list);
> > +
> > +    return size;
> > +}
> > +
> >  static target_ulong h_resize_hpt_prepare(PowerPCCPU *cpu,
> >                                           sPAPRMachineState *spapr,
> >                                           target_ulong opcode,
> >                                           target_ulong *args)
> >  {
> >      target_ulong flags = args[0];
> > -    target_ulong shift = args[1];
> > +    int shift = args[1];
> > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> >  
> >      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> >          return H_AUTHORITY;
> >      }
> >  
> >      trace_spapr_h_resize_hpt_prepare(flags, shift);
> > -    return H_HARDWARE;
> > +
> > +    if (flags != 0) {
> > +        return H_PARAMETER;
> > +    }
> > +
> > +    if (shift && ((shift < 18) || (shift > 46))) {
> > +        return H_PARAMETER;
> > +    }
> > +
> > +    if (pending) {
> > +        /* something already in progress */
> > +        if (pending->shift == shift) {
> > +            /* and it's suitable */
> > +            if (pending->complete) {
> > +                return pending->ret;
> > +            } else {
> > +                return H_LONG_BUSY_ORDER_100_MSEC;
> > +            }
> > +        }
> > +
> > +        /* not suitable, cancel and replace */
> > +        cancel_hpt_prepare(spapr);
> > +    }
> > +
> > +    if (!shift) {
> > +        /* nothing to do */
> > +        return H_SUCCESS;
> > +    }
> > +
> > +    /* start new prepare */
> > +
> > +    /* We only allow the guest to allocate an HPT one order above what
> > +     * we'd normally give them (to stop a small guest claiming a huge
> > +     * chunk of resources in the HPT */
> > +    if (shift > (spapr_hpt_shift_for_ramsize(get_current_ram_size()) + 1)) 
> > {
> > +        return H_RESOURCE;
> > +    }
> > +
> > +    pending = g_malloc0(sizeof(*pending));
> 
> Maybe use g_new0() instead?

Ah, good idea.

> > +    pending->shift = shift;
> > +    pending->ret = H_HARDWARE;
> > +
> > +    qemu_thread_create(&pending->thread, "sPAPR HPT prepare",
> > +                       hpt_prepare_thread, pending, QEMU_THREAD_DETACHED);
> > +
> > +    spapr->pending_hpt = pending;
> > +
> > +    /* In theory we could estimate the time more accurately based on
> > +     * the new size, but there's not much point */
> > +    return H_LONG_BUSY_ORDER_100_MSEC;
> > +}
> > +
> > +static uint64_t new_hpte_load0(void *htab, uint64_t pteg, int slot)
> > +{
> > +    uint8_t *addr = htab;
> > +
> > +    addr += pteg * HASH_PTEG_SIZE_64;
> > +    addr += slot * HASH_PTE_SIZE_64;
> > +    return  ldq_p(addr);
> > +}
> > +
> > +static void new_hpte_store(void *htab, uint64_t pteg, int slot,
> > +                           uint64_t pte0, uint64_t pte1)
> > +{
> > +    uint8_t *addr = htab;
> > +
> > +    addr += pteg * HASH_PTEG_SIZE_64;
> > +    addr += slot * HASH_PTE_SIZE_64;
> > +
> > +    stq_p(addr, pte0);
> > +    stq_p(addr + HASH_PTE_SIZE_64/2, pte1);
> > +}
> > +
> > +static int rehash_hpte(PowerPCCPU *cpu, uint64_t token,
> > +                       void *old, uint64_t oldsize,
> > +                       void *new, uint64_t newsize,
> > +                       uint64_t pteg, int slot)
> > +{
> > +    uint64_t old_hash_mask = (oldsize >> 7) - 1;
> > +    uint64_t new_hash_mask = (newsize >> 7) - 1;
> > +    target_ulong pte0 = ppc_hash64_load_hpte0(cpu, token, slot);
> > +    target_ulong pte1;
> > +    uint64_t avpn;
> > +    unsigned shift;
> > +    uint64_t hash, new_pteg, replace_pte0;
> > +
> > +    if (!(pte0 & HPTE64_V_VALID) || !(pte0 & HPTE64_V_BOLTED)) {
> > +        return H_SUCCESS;
> > +    }
> > +
> > +    pte1 = ppc_hash64_load_hpte1(cpu, token, slot);
> > +
> > +    shift = ppc_hash64_hpte_page_shift_noslb(cpu, pte0, pte1);
> > +    assert(shift); /* H_ENTER should never have allowed a bad encoding */
> > +    avpn = HPTE64_V_AVPN_VAL(pte0) & ~(((1ULL << shift) - 1) >> 23);
> > +
> > +    if (pte0 & HPTE64_V_SECONDARY) {
> > +        pteg = ~pteg;
> > +    }
> > +
> > +    if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_256M) {
> > +        uint64_t offset, vsid;
> > +
> > +        /* We only have 28 - 23 bits of offset in avpn */
> > +        offset = (avpn & 0x1f) << 23;
> > +        vsid = avpn >> 5;
> > +        /* We can find more bits from the pteg value */
> > +        if (shift < 23) {
> > +            offset |= ((vsid ^ pteg) & old_hash_mask) << shift;
> > +        }
> > +
> > +        hash = vsid ^ (offset >> shift);
> > +    } else if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_1T) {
> > +        uint64_t offset, vsid;
> > +
> > +        /* We only have 40 - 23 bits of seg_off in avpn */
> > +        offset = (avpn & 0x1ffff) << 23;
> > +        vsid = avpn >> 17;
> > +        if (shift < 23) {
> > +            offset |= ((vsid ^ (vsid << 25) ^ pteg) & old_hash_mask) << 
> > shift;
> > +        }
> > +
> > +        hash = vsid ^ (vsid << 25) ^ (offset >> shift);
> > +    } else {
> > +        error_report("rehash_pte: Bad segment size in HPTE");
> > +        return H_HARDWARE;
> > +    }
> > +
> > +    new_pteg = hash & new_hash_mask;
> > +    if (pte0 & HPTE64_V_SECONDARY) {
> > +        assert(~pteg == (hash & old_hash_mask));
> > +        new_pteg = ~new_pteg;
> > +    } else {
> > +        assert(pteg == (hash & old_hash_mask));
> > +    }
> > +    assert((oldsize != newsize) || (pteg == new_pteg));
> > +    replace_pte0 = new_hpte_load0(new, new_pteg, slot);
> > +    if (replace_pte0 & HPTE64_V_VALID) {
> > +        assert(newsize < oldsize);
> > +        if (replace_pte0 & HPTE64_V_BOLTED) {
> > +            if (pte0 & HPTE64_V_BOLTED) {
> > +                /* Bolted collision, nothing we can do */
> > +                return H_PTEG_FULL;
> > +            } else {
> > +                /* Discard this hpte */
> > +                return H_SUCCESS;
> > +            }
> > +        }
> > +    }
> > +
> > +    new_hpte_store(new, new_pteg, slot, pte0, pte1);
> > +    return H_SUCCESS;
> > +}
> > +
> > +static int rehash_hpt(PowerPCCPU *cpu,
> > +                      void *old, uint64_t oldsize,
> > +                      void *new, uint64_t newsize)
> > +{
> > +    CPUPPCState *env = &cpu->env;
> > +    uint64_t n_ptegs = oldsize >> 7;
> > +    uint64_t pteg;
> > +    int slot;
> > +    int rc;
> > +
> > +    assert(env->external_htab == old);
> > +
> > +    for (pteg = 0; pteg < n_ptegs; pteg++) {
> > +        uint64_t token = ppc_hash64_start_access(cpu, pteg * 
> > HPTES_PER_GROUP);
> > +
> > +        if (!token) {
> > +            return H_HARDWARE;
> > +        }
> > +
> > +        for (slot = 0; slot < HPTES_PER_GROUP; slot++) {
> > +            rc = rehash_hpte(cpu, token, old, oldsize, new, newsize,
> > +                             pteg, slot);
> > +            if (rc != H_SUCCESS) {
> > +                ppc_hash64_stop_access(cpu, token);
> > +                return rc;
> > +            }
> > +        }
> > +        ppc_hash64_stop_access(cpu, token);
> > +    }
> > +
> > +    return H_SUCCESS;
> > +}
> > +
> > +static void pivot_hpt(CPUState *cs, run_on_cpu_data arg)
> > +{
> > +    sPAPRMachineState *spapr = SPAPR_MACHINE(arg.host_ptr);
> > +    PowerPCCPU *cpu = POWERPC_CPU(cs);
> > +
> > +    cpu_synchronize_state(cs);
> > +    ppc_hash64_set_external_hpt(cpu, spapr->htab, spapr->htab_shift,
> > +                                &error_fatal);
> >  }
> >  
> >  static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
> > @@ -378,13 +679,52 @@ static target_ulong h_resize_hpt_commit(PowerPCCPU 
> > *cpu,
> >  {
> >      target_ulong flags = args[0];
> >      target_ulong shift = args[1];
> > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> > +    int rc;
> > +    size_t newsize;
> 
> Why size_t here? size_t could be 32-bit on some systems, and don't you
> rather want to always have 64-bit (or at least target_ulong) here?

In that case (and if newsize would be > 4G) we're stuffed anyway - we
allocate the htab with qemu_memalign(), which only takes a size_t, and
even if it didn't there's nowhere we could actually allocate the hash table.

> >      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> >          return H_AUTHORITY;
> >      }
> >  
> >      trace_spapr_h_resize_hpt_commit(flags, shift);
> > -    return H_HARDWARE;
> > +
> > +    if (flags != 0) {
> > +        return H_PARAMETER;
> > +    }
> > +
> > +    if (!pending || (pending->shift != shift)) {
> > +        /* no matching prepare */
> > +        return H_CLOSED;
> > +    }
> > +
> > +    if (!pending->complete) {
> > +        /* prepare has not completed */
> > +        return H_BUSY;
> > +    }
> > +
> > +    newsize = 1ULL << pending->shift;
> > +    rc = rehash_hpt(cpu, spapr->htab, HTAB_SIZE(spapr),
> > +                    pending->hpt, newsize);
> > +    if (rc == H_SUCCESS) {
> > +        CPUState *cs;
> > +
> > +        qemu_vfree(spapr->htab);
> > +        spapr->htab = pending->hpt;
> > +        spapr->htab_shift = pending->shift;
> > +
> > +        CPU_FOREACH(cs) {
> > +            run_on_cpu(cs, pivot_hpt, RUN_ON_CPU_HOST_PTR(spapr));
> > +        }
> > +
> > +        pending->hpt = NULL; /* so it's not free()d */
> > +    }
> > +
> > +    /* Clean up */
> > +    spapr->pending_hpt = NULL;
> > +    free_pending_hpt(pending);
> > +
> > +    return rc;
> >  }
> 
>  Thomas
> 
> 

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-ppc] [for-2.9 0/5] Hash Page Table resizing for TCG pseries guests, David Gibson, 2016/12/08
- [Qemu-ppc] [for-2.9 3/5] pseries: Implement HPT resizing, David Gibson, 2016/12/08
  - Re: [Qemu-ppc] [for-2.9 3/5] pseries: Implement HPT resizing, Thomas Huth, 2016/12/09
    - Re: [Qemu-ppc] [for-2.9 3/5] pseries: Implement HPT resizing, David Gibson <=
- [Qemu-ppc] [for-2.9 2/5] pseries: Stubs for HPT resizing, David Gibson, 2016/12/08
  - Re: [Qemu-ppc] [for-2.9 2/5] pseries: Stubs for HPT resizing, Thomas Huth, 2016/12/09
    - Re: [Qemu-ppc] [for-2.9 2/5] pseries: Stubs for HPT resizing, David Gibson, 2016/12/09
    - Re: [Qemu-ppc] [for-2.9 2/5] pseries: Stubs for HPT resizing, Thomas Huth, 2016/12/09
    - Re: [Qemu-ppc] [for-2.9 2/5] pseries: Stubs for HPT resizing, David Gibson, 2016/12/11
  - Re: [Qemu-ppc] [for-2.9 2/5] pseries: Stubs for HPT resizing, Michael Roth, 2016/12/09
    - Re: [Qemu-ppc] [for-2.9 2/5] pseries: Stubs for HPT resizing, David Gibson, 2016/12/11
    - Re: [Qemu-ppc] [for-2.9 2/5] pseries: Stubs for HPT resizing, David Gibson, 2016/12/11
- [Qemu-ppc] [for-2.9 1/5] pseries: Add pseries-2.9 machine type, David Gibson, 2016/12/08
  - Re: [Qemu-ppc] [for-2.9 1/5] pseries: Add pseries-2.9 machine type, Thomas Huth, 2016/12/09

Prev by Date: Re: [Qemu-ppc] [for-2.9 2/5] pseries: Stubs for HPT resizing
Next by Date: Re: [Qemu-ppc] [QEMU PATCH v16 1/4] migration: extend VMStateInfo
Previous by thread: Re: [Qemu-ppc] [for-2.9 3/5] pseries: Implement HPT resizing
Next by thread: [Qemu-ppc] [for-2.9 2/5] pseries: Stubs for HPT resizing
Index(es):
- Date
- Thread