[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [PATCH qemu v2 1/4] vfio/spapr: Fix indirect levels calcu
From: |
David Gibson |
Subject: |
Re: [Qemu-ppc] [PATCH qemu v2 1/4] vfio/spapr: Fix indirect levels calculation |
Date: |
Fri, 15 Feb 2019 12:03:17 +1100 |
User-agent: |
Mutt/1.10.1 (2018-07-13) |
On Thu, Feb 14, 2019 at 04:21:41PM +1100, Alexey Kardashevskiy wrote:
> The current code assumes that we can address more bits on a PCI bus
> for DMA than we really can but there is no way knowing the actual limit.
>
> This makes a better guess for the number of levels and if the kernel
> fails to allocate that, this increases the level numbers till succeeded
> or reached the 64bit limit.
>
> This adds levels to the trace point.
>
> This may cause the kernel to warn about failed allocation:
> [65122.837458] Failed to allocate a TCE memory, level shift=28
> which might happen if MAX_ORDER is not large enough as it can vary:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Kconfig?h=v5.0-rc2#n727
>
> Signed-off-by: Alexey Kardashevskiy <address@hidden>
So, looking at this, it seems kind of bogus that qemu chooses the
number of levels, when those levels aren't really visible to it. The
kernel has better information to choose a sensible number of levels -
witness qemu here attempting to guess what the kernel will do.
But, I guess the interface is what it is now, so,
Reviewed-by: David Gibson <address@hidden>
> ---
> Changes:
> v2:
> * replace systempagesize with getpagesize() when calculating
> bits_per_level/max_levels
> ---
> hw/vfio/spapr.c | 41 ++++++++++++++++++++++++++++++++---------
> hw/vfio/trace-events | 2 +-
> 2 files changed, 33 insertions(+), 10 deletions(-)
>
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index becf71a..302d6b0 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -146,7 +146,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
> int ret;
> IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
> uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr);
> - unsigned entries, pages;
> + unsigned entries, bits_total, bits_per_level, max_levels;
> struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> long systempagesize = qemu_getrampagesize();
>
> @@ -176,16 +176,38 @@ int vfio_spapr_create_window(VFIOContainer *container,
> create.window_size = int128_get64(section->size);
> create.page_shift = ctz64(pagesize);
> /*
> - * SPAPR host supports multilevel TCE tables, there is some
> - * heuristic to decide how many levels we want for our table:
> - * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> + * SPAPR host supports multilevel TCE tables. We try to guess optimal
> + * levels number and if this fails (for example due to the host memory
> + * fragmentation), we increase levels. The DMA address structure is:
> + * rrrrrrrr rxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
> iiiiiiii
> + * where:
> + * r = reserved (bits >= 55 are reserved in the existing hardware)
> + * i = IOMMU page offset (64K in this example)
> + * x = bits to index a TCE which can be split to equal chunks to index
> + * within the level.
> + * The aim is to split "x" to smaller possible number of levels.
> */
> entries = create.window_size >> create.page_shift;
> - pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> - pages = MAX(pow2ceil(pages), 1); /* Round up */
> - create.levels = ctz64(pages) / 6 + 1;
> -
> - ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> + /* bits_total is number of "x" needed */
> + bits_total = ctz64(entries * sizeof(uint64_t));
> + /*
> + * bits_per_level is a safe guess of how much we can allocate per level:
> + * 8 is the current minimum for CONFIG_FORCE_MAX_ZONEORDER and MAX_ORDER
> + * is usually bigger than that.
> + * Below we look at getpagesize() as TCEs are allocated from system
> pages.
> + */
> + bits_per_level = ctz64(getpagesize()) + 8;
> + create.levels = bits_total / bits_per_level;
> + if (bits_total % bits_per_level) {
> + ++create.levels;
> + }
> + max_levels = (64 - create.page_shift) / ctz64(getpagesize());
> + for ( ; create.levels <= max_levels; ++create.levels) {
> + ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> + if (!ret) {
> + break;
> + }
> + }
> if (ret) {
> error_report("Failed to create a window, ret = %d (%m)", ret);
> return -errno;
> @@ -200,6 +222,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
> return -EINVAL;
> }
> trace_vfio_spapr_create_window(create.page_shift,
> + create.levels,
> create.window_size,
> create.start_addr);
> *pgsize = pagesize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index f41ca96..579be19 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -128,6 +128,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start,
> uint64_t end) "0x%"PRIx64"
> vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end)
> "0x%"PRIx64" - 0x%"PRIx64
> vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64"
> size=0x%"PRIx64" ret=%d"
> vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64"
> size=0x%"PRIx64" ret=%d"
> -vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x
> winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t
> off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
> vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
> vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to
> liobn fd %d"
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature
[Qemu-ppc] [PATCH qemu v2 2/4] vfio/spapr: Rename local systempagesize variable, Alexey Kardashevskiy, 2019/02/14