qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable


From: Joao Martins
Subject: Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Date: Thu, 24 Feb 2022 20:04:48 +0000


On 2/24/22 19:54, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
>>>>>> On 2/23/22 23:35, Joao Martins wrote:
>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>>>> +                                          uint64_t pci_hole64_size)
>>>>>>>>> +{
>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>>>> +
>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>>>> +        return;
>>>>>>>>> +    }
>>>>>>>>
>>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
>>>>>>>> It's really about what we present to the guest though,
>>>>>>>> isn't it?
>>>>>>>
>>>>>>> It was the easier catch all to use cpuid without going into
>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>>>>>> for systems with an IOMMU present.
>>>>>>>
>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>>>>>
>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
>>>>>>> think it's
>>>>>>> even worth checking the range exists in:
>>>>>>>
>>>>>>>         /sys/kernel/iommu_groups/0/reserved_regions
>>>>>>>
>>>>>>> (Also that sysfs ABI is >= 4.11 only)
>>>>>>
>>>>>> Here's what I have staged in local tree, to address your comment.
>>>>>>
>>>>>> Naturally the first chunk is what's affected by this patch the rest is a
>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>>>>>> all the tests and what not.
>>>>>>
>>>>>> I am not entirely sure this is the right place to put such a helper, open
>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the 
>>>>>> rest
>>>>>> of the file's style.
>>>>>>
>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>>>> index a9be5d33a291..2ea4430d5dcc 100644
>>>>>> --- a/hw/i386/pc.c
>>>>>> +++ b/hw/i386/pc.c
>>>>>> @@ -868,10 +868,8 @@ static void 
>>>>>> x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>                                            uint64_t pci_hole64_size)
>>>>>>  {
>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>> -    uint32_t eax, vendor[3];
>>>>>>
>>>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>> -    if (!IS_AMD_VENDOR(vendor)) {
>>>>>> +    if (!qemu_amd_iommu_is_present()) {
>>>>>>          return;
>>>>>>      }
>>>>>>
>>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
>>>>>> --- a/include/qemu/osdep.h
>>>>>> +++ b/include/qemu/osdep.h
>>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>>>>>>   */
>>>>>>  size_t qemu_get_host_physmem(void);
>>>>>>
>>>>>> +/**
>>>>>> + * qemu_amd_iommu_is_present:
>>>>>> + *
>>>>>> + * Operating system agnostic way of querying if an AMD IOMMU
>>>>>> + * is present.
>>>>>> + *
>>>>>> + */
>>>>>> +bool qemu_amd_iommu_is_present(void);
>>>>>> +
>>>>>>  /*
>>>>>>   * Toggle write/execute on the pages marked MAP_JIT
>>>>>>   * for the current thread.
>>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>>>>> index f2be7321c59f..54cef21217c4 100644
>>>>>> --- a/util/oslib-posix.c
>>>>>> +++ b/util/oslib-posix.c
>>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>>>>>>  #endif
>>>>>>      return 0;
>>>>>>  }
>>>>>> +
>>>>>> +bool qemu_amd_iommu_is_present(void)
>>>>>> +{
>>>>>> +    bool found = false;
>>>>>> +#ifdef CONFIG_LINUX
>>>>>> +    struct dirent *entry;
>>>>>> +    char *path;
>>>>>> +    DIR *dir;
>>>>>> +
>>>>>> +    path = g_strdup_printf("/sys/class/iommu");
>>>>>> +    dir = opendir(path);
>>>>>> +    if (!dir) {
>>>>>> +        g_free(path);
>>>>>> +        return found;
>>>>>> +    }
>>>>>> +
>>>>>> +    do {
>>>>>> +            entry = readdir(dir);
>>>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
>>>>>> +                found = true;
>>>>>> +                break;
>>>>>> +            }
>>>>>> +    } while (entry);
>>>>>> +
>>>>>> +    g_free(path);
>>>>>> +    closedir(dir);
>>>>>> +#endif
>>>>>> +    return found;
>>>>>> +}
>>>>>
>>>>> why are we checking whether an AMD IOMMU is present
>>>>> on the host? 
>>>>> Isn't what we care about whether it's
>>>>> present in the VM? What we are reserving after all
>>>>> is a range of GPAs, not actual host PA's ...
>>>>>
>>>> Right.
>>>>
>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>>>> and so we need to not map that portion of address space entirely
>>>> and use some other portion of the GPA-space. This has to
>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
>>>> place. So, if you do not have an host IOMMU, you can't
>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore 
>>>> you
>>>> don't need this problem.
>>>>
>>>> Whether one uses a guest IOMMU or not does not affect the result,
>>>> it would be more of a case of mimicking real hardware not fixing
>>>> the issue of this series.
>>>
>>>
>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
>>> And ideally move the logic reporting reserved ranges there too.
>>> However, I think vdpa has the same issue too.
>>> CC Jason for an opinion.
>>
>> It does have the same problem.
>>
>> This is not specific to VFIO, it's to anything that uses the iommu.
> 
> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
> that we don't have a global "enable-vfio" flag since vfio devices
> can be hot-plugged. I think this is not the first time we wanted
> something like this, right Alex?
> 
>> It's
>> just that VFIO doesn't let you shoot in the foot :)
>>
>> vDPA would need to validate iova ranges as well to prevent mapping on
>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
>> map request is within a valid iova range"). Now you could argue that
>> it's an IOMMU core problem, maybe.
>>
>> But I this as a separate problem,
>> because regardless of said validation, your guest provisioning
>> is still going to fail for guests with >=1010G of max GPA and even if
>> it doesn't fail you shouldn't be letting it DMA map those in the first
>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
> 
> I wonder what's the status on a system without an IOMMU.
> 
In theory you should be OK. Also it's worth keeping in mind that AMD machines
with >=1T have this 12G hole marked as Reserved, so even DMA at last when CPU
is the initiator should be fine on worst case scenario.

>>> Also, some concerns
>>> - I think this changes memory layout for working VMs not using VFIO.
>>>   Better preserve the old layout for old machine types?
>>>
>> Oh definitely, and I do that in this series. See the last commit, and
>> in the past it was also a machine-property exposed to userspace.
>> Otherwise I would be breaking (badly) compat/migration.
>>
>> I would like to emphasize that these memory layout changes are *exclusive* 
>> and
>> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
>> It's not every guest, but a fairly limited subset of big-memory guests that
>> are not the norm.
>>
>> Albeit the phys-bits property errors might gives us a bit more trouble, so
>> it might help being more conservative.
>>
>>> - You mention the phys bits issue very briefly, and it's pretty
>>>   concerning.  Do we maybe want to also disable the work around if phys
>>>   bits is too small? 
>>
>> We are doing that here (well, v4), as suggested by Igor. Note that in this 
>> series
>> it's a warning, but v4 I have it as an error and with the 32-bit issues that
>> I found through qtest.
>>
>> I share the same concern as you over making this an error because of 
>> compatibility.
>> Perhaps, we could go back to the previous version and turn back into a 
>> warning and
>> additionally even disabling the relocation all together. Or if desired even
>> depending on a machine-property to enable it.
> 
> I would be inclined to disable the relocation. And maybe block vfio? I'm
> not 100% sure about that last one, but this can be a vfio decision to
> make.
> 
I don't see this as a VFIO problem (VFIO is actually being a good citizen and 
doing the
right thing :)). From my perspective this fix is also useful to vDPA (which we 
also care),
and thus will also let us avoid DMA mapping in these GPA ranges.

The reason I mention VFIO in cover letter is that it's one visible UAPI change 
that
users will notice that suddenly their 1TB+ VMs break.

>>> Also, we'd need to test a bunch of old
>>>   guests to see what happens. Which guests were tested? 
>>>
>> Do you envision that old guests would run (or care) into this sort of 1TB 
>> config? Mainly
>> testing more modern guests on Linux (>= 4.14) over this last set of 
>> versions, and in the
>> past Windows 2012+. Let me be extra extra pedantic on this part for the next 
>> submission
>> (and report back if something odd happens).
> 
> Not 100% sure but let's at least document.
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]