[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 su
From: |
Alex Williamson |
Subject: |
Re: [Qemu-ppc] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) |
Date: |
Wed, 08 Jul 2015 08:51:13 -0600 |
On Wed, 2015-07-08 at 16:26 +1000, Alexey Kardashevskiy wrote:
> On 07/08/2015 02:24 AM, Alex Williamson wrote:
> > On Tue, 2015-07-07 at 22:11 +1000, Alexey Kardashevskiy wrote:
> >> On 07/07/2015 02:13 AM, Alex Williamson wrote:
> >>> On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote:
> >>>> On 07/06/2015 11:42 PM, Alex Williamson wrote:
> >>>>> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote:
> >>>>>> This makes use of the new "memory registering" feature. The idea is
> >>>>>> to provide the userspace ability to notify the host kernel about pages
> >>>>>> which are going to be used for DMA. Having this information, the host
> >>>>>> kernel can pin them all once per user process, do locked pages
> >>>>>> accounting (once) and not spent time on doing that in real time with
> >>>>>> possible failures which cannot be handled nicely in some cases.
> >>>>>>
> >>>>>> This adds a guest RAM memory listener which notifies a VFIO container
> >>>>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions
> >>>>>> (i.e. "skip dump" regions) are skipped.
> >>>>>>
> >>>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> >>>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this
> >>>>>> does
> >>>>>> not call it when v2 is detected and enabled.
> >>>>>>
> >>>>>> This does not change the guest visible interface.
> >>>>>>
> >>>>>> Signed-off-by: Alexey Kardashevskiy <address@hidden>
> >>>>>> Reviewed-by: David Gibson <address@hidden>
> >>>>>> ---
> >>>>>> Changes:
> >>>>>> v9:
> >>>>>> * since there is no more SPAPR-specific data in container::iommu_data,
> >>>>>> the memory preregistration fields are common and potentially can be
> >>>>>> used
> >>>>>> by other architectures
> >>>>>>
> >>>>>> v7:
> >>>>>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
> >>>>>> * s'ramlistener'register_listener'
> >>>>>>
> >>>>>> v6:
> >>>>>> * fixed commit log (s/guest/userspace/), added note about no guest
> >>>>>> visible
> >>>>>> change
> >>>>>> * fixed error checking if ram registration failed
> >>>>>> * added alignment check for section->offset_within_region
> >>>>>>
> >>>>>> v5:
> >>>>>> * simplified the patch
> >>>>>> * added trace points
> >>>>>> * added round_up() for the size
> >>>>>> * SPAPR IOMMU v2 used
> >>>>>> ---
> >>>>>> hw/vfio/common.c | 109
> >>>>>> ++++++++++++++++++++++++++++++++++++++----
> >>>>>> include/hw/vfio/vfio-common.h | 3 ++
> >>>>>> trace-events | 1 +
> >>>>>> 3 files changed, 104 insertions(+), 9 deletions(-)
> >>>>>>
> >>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>>> index 8eacfd7..0c7ba8c 100644
> >>>>>> --- a/hw/vfio/common.c
> >>>>>> +++ b/hw/vfio/common.c
> >>>>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer
> >>>>>> *container)
> >>>>>>
> >>>>>> memory_listener_unregister(&container->iommu_data.type1.listener);
> >>>>>> }
> >>>>>>
> >>>>>> +static void vfio_ram_do_region(VFIOContainer *container,
> >>>>>> + MemoryRegionSection *section, unsigned
> >>>>>> long req)
> >>>>>> +{
> >>>>>> + int ret;
> >>>>>> + struct vfio_iommu_spapr_register_memory reg = { .argsz =
> >>>>>> sizeof(reg) };
> >>>>>
> >>>>> This function is not as general as the name would imply, it's spapr
> >>>>> specific due to this. How about vfio_spapr_register_memory() with a
> >>>>> bool parameter toggling register vs unregister so we're not passing an
> >>>>> arbitrary ioctl number?
> >>>>
> >>>> Ok. Although I am quite often asked not to do such a thing and rather
> >>>> add 2
> >>>> helpers (reg/unreg, do/undo, etc) instead and reuse common bits.
> >>>
> >>> I'm not a fan of functions that do the reverse process based on a bool
> >>> arg either, but I dislike them less than passing an arbitrary ioctl
> >>> number for a parameter. The former is ugly, but the latter is difficult
> >>> to use and difficult to maintain because it would be subtle later to
> >>> spot an unsupported ioctl being passed to the function.
> >>>
> >>>>>> +
> >>>>>> + if (!memory_region_is_ram(section->mr) ||
> >>>>>> + memory_region_is_skip_dump(section->mr)) {
> >>>>>> + return;
> >>>>>> + }
> >>>>>> +
> >>>>>> + if (unlikely((section->offset_within_region & (getpagesize() -
> >>>>>> 1)))) {
> >>>>>
> >>>>> s/getpagesize()/qemu_real_host_page_size/?
> >>>>
> >>>>
> >>>> Oh, right, I guess it reached upstream now.
> >>>>
> >>>>
> >>>>>> + error_report("%s received unaligned region", __func__);
> >>>>>> + return;
> >>>>>> + }
> >>>>>> +
> >>>>>> + reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) +
> >>>>>> + section->offset_within_region;
> >>>>>> + reg.size = ROUND_UP(int128_get64(section->size),
> >>>>>> TARGET_PAGE_SIZE);
> >>>>>> +
> >>>>>> + ret = ioctl(container->fd, req, ®);
> >>>>>> + trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr,
> >>>>>> reg.size,
> >>>>>> + ret ? -errno : 0);
> >>>>>> + if (!ret) {
> >>>>>> + return;
> >>>>>> + }
> >>>>>> +
> >>>>>> + /*
> >>>>>> + * On the initfn path, store the first error in the container so
> >>>>>> we
> >>>>>> + * can gracefully fail. Runtime, there's not much we can do other
> >>>>>> + * than throw a hardware error.
> >>>>>> + */
> >>>>>> + if (!container->iommu_data.ram_reg_initialized) {
> >>>>>> + if (!container->iommu_data.ram_reg_error) {
> >>>>>> + container->iommu_data.ram_reg_error = -errno;
> >>>>>> + }
> >>>>>> + } else {
> >>>>>> + hw_error("vfio: RAM registering failed, unable to continue");
> >>>>>> + }
> >>>>>
> >>>>> I'd rather see:
> >>>>>
> >>>>> if (ret) {
> >>>>> if (!container...) {
> >>>>> ...
> >>>>> } else {
> >>>>> ...
> >>>>> }
> >>>>> }
> >>>>>
> >>>>> Exiting early on success and otherwise falling into error handling is a
> >>>>> strange code flow.
> >>>>
> >>>> Ok... vfio_dma_map() does not follow this rule so I thought it is not
> >>>> that
> >>>> strict :)
> >>>
> >>> It would be nice to clean it up there too.
> >>>
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void vfio_ram_listener_region_add(MemoryListener *listener,
> >>>>>> + MemoryRegionSection *section)
> >>>>>> +{
> >>>>>> + VFIOContainer *container = container_of(listener, VFIOContainer,
> >>>>>> +
> >>>>>> iommu_data.register_listener);
> >>>>>> + memory_region_ref(section->mr);
> >>>>>> + vfio_ram_do_region(container, section,
> >>>>>> VFIO_IOMMU_SPAPR_REGISTER_MEMORY);
> >>>>>
> >>>>> vfio_spapr_register_memory(container, section, true);
> >>>>>
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void vfio_ram_listener_region_del(MemoryListener *listener,
> >>>>>> + MemoryRegionSection *section)
> >>>>>> +{
> >>>>>> + VFIOContainer *container = container_of(listener, VFIOContainer,
> >>>>>> +
> >>>>>> iommu_data.register_listener);
> >>>>>> + vfio_ram_do_region(container, section,
> >>>>>> VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY);
> >>>>>
> >>>>> vfio_spapr_register_memory(container, section, false);
> >>>>>
> >>>>>> + memory_region_unref(section->mr);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static const MemoryListener vfio_ram_memory_listener = {
> >>>>>> + .region_add = vfio_ram_listener_region_add,
> >>>>>> + .region_del = vfio_ram_listener_region_del,
> >>>>>> +};
> >>>>>
> >>>>> These are all spapr specific, please reflect that in the name;
> >>>>> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del.
> >>>>
> >>>> ok.
> >>>>
> >>>>
> >>>>> Actually, can't we determine what type of IOMMU we have and make the
> >>>>> existing MemoryListener handle either type1 or spapr or spapr-v2?
> >>>>
> >>>>
> >>>> Sorry, I do not follow you here. How? The existing listener listens on
> >>>> PCI
> >>>> address space (at least, on pseries), new one listens on RAM address
> >>>> space
> >>>> (address_space_memory). What do I miss?
> >>>
> >>> Isn't that simply a difference of the address space the listener is
> >>> attached to? Type1 maps RAM, spapr-v1 maps guest IOMMU space and these
> >>> are already both handled by the same listener.
> >>
> >>
> >> Ok, I tried merging 2 listeners and realized that the PCI listener works
> >> with TARGET_PAGE_SIZE granularity (which is 4K and actually it should be
> >> using an IOMMU page size which is not easily available there but this is a
> >> different story) and RAM listener with the qemu_real_host_page_size
> >> granularity (64K for my case) so depending on the address space type,
> >> vfio_listener_region_add() will have to use different page sizes. I like
> >> the idea of merging less now...
> >
> > Sounds like you're already solving something that needs to be fixed for
> > both. The type1 VFIO_IOMMU_GET_INFO ioctl does actually give us a
> > bitmap of supported iommu page sizes. It's really all but useless for
> > anything except determining the minimum page size.
>
> btw what sizes can really come from there?
I think it was originally intended to be a bitmap of native IOMMU page
sizes. I think AMD-Vi still does this, and reports essentially
PAGE_MASK minus a few bits that the hardware doesn't support for
whatever reason. Not to be outdone, VT-d reports PAGE_MASK even though
their hardware supports a small set of discrete page sizes. I think the
theory there was that software can breakdown any mapping to supported
sizes. The result is that we have no idea whether a bit in the bitmap
means native support or not, so we ignore it and assume host page size
is the minimum alignment.
> > For the most part we
> > just assume that it's the same as the host page size, so those existing
> > checks could actually change to host page alignment pretty safely. I
> > think we both actually want pages that are both host and target aligned,
> > don't we? What would you do on a 64k host if the guest tried to map a
> > region that only had 4k alignment?
>
> I will get_user_pages_fast(va & PAGE_MASK) and then write
> (gpa_to_hpa(va&PAGE_MASK)|(va & ~PAGE_MASK)) to the table, this is what we
> do now as our typical host uses 64k pages and default 32bit window always
> uses 4K (irrelevant to the guest page size).
Ok, so the windowed IOMMU with native 4k pages prevents you from
allowing access to more than the guest mapped.
> > Anyway, if that's the only problem,
> > it looks more like an opportunity than a barrier.
>
> Oh. Ok :)
Sorry ;) Thanks,
Alex
- Re: [Qemu-ppc] [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed, (continued)
Re: [Qemu-ppc] [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering), Thomas Huth, 2015/07/07
- Re: [Qemu-ppc] [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering), Alexey Kardashevskiy, 2015/07/07
- Re: [Qemu-ppc] [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering), Thomas Huth, 2015/07/07
- Re: [Qemu-ppc] [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering), Alexey Kardashevskiy, 2015/07/07
- Re: [Qemu-ppc] [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering), David Gibson, 2015/07/08
- Re: [Qemu-ppc] [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering), Thomas Huth, 2015/07/08
- Re: [Qemu-ppc] [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering), David Gibson, 2015/07/08
Re: [Qemu-ppc] [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering), Alexey Kardashevskiy, 2015/07/08
Re: [Qemu-ppc] [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering), Alex Williamson, 2015/07/08
[Qemu-ppc] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge, Alexey Kardashevskiy, 2015/07/05