qemu-s390x
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [qemu-s390x] [Qemu-devel] [PATCH for-4.0?] exec: Only count mapped m


From: David Hildenbrand
Subject: Re: [qemu-s390x] [Qemu-devel] [PATCH for-4.0?] exec: Only count mapped memory backends for qemu_getrampagesize()
Date: Thu, 28 Mar 2019 10:26:07 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0

On 28.03.19 02:18, David Gibson wrote:
> On Wed, Mar 27, 2019 at 03:22:58PM +0100, David Hildenbrand wrote:
>> On 27.03.19 10:03, David Gibson wrote:
>>> On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
>>>> On 27.03.19 01:12, David Gibson wrote:
>>>>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
>>>>>> On 26.03.19 15:08, Igor Mammedov wrote:
>>>>>>> On Tue, 26 Mar 2019 14:50:58 +1100
>>>>>>> David Gibson <address@hidden> wrote:
>>>>>>>
>>>>>>>> qemu_getrampagesize() works out the minimum host page size backing any 
>>>>>>>> of
>>>>>>>> guest RAM.  This is required in a few places, such as for POWER8 PAPR 
>>>>>>>> KVM
>>>>>>>> guests, because limitations of the hardware virtualization mean the 
>>>>>>>> guest
>>>>>>>> can't use pagesizes larger than the host pages backing its memory.
>>>>>>>>
>>>>>>>> However, it currently checks against *every* memory backend, whether 
>>>>>>>> or not
>>>>>>>> it is actually mapped into guest memory at the moment.  This is 
>>>>>>>> incorrect.
>>>>>>>>
>>>>>>>> This can cause a problem attempting to add memory to a POWER8 pseries 
>>>>>>>> KVM
>>>>>>>> guest which is configured to allow hugepages in the guest (e.g.
>>>>>>>> -machine cap-hpt-max-page-size=16m).  If you attempt to add 
>>>>>>>> non-hugepage,
>>>>>>>> you can (correctly) create a memory backend, however it (correctly) 
>>>>>>>> will
>>>>>>>> throw an error when you attempt to map that memory into the guest by
>>>>>>>> 'device_add'ing a pc-dimm.
>>>>>>>>
>>>>>>>> What's not correct is that if you then reset the guest a startup check
>>>>>>>> against qemu_getrampagesize() will cause a fatal error because of the 
>>>>>>>> new
>>>>>>>> memory object, even though it's not mapped into the guest.
>>>>>>> I'd say that backend should be remove by mgmt app since device_add 
>>>>>>> failed
>>>>>>> instead of leaving it to hang around. (but fatal error either not a nice
>>>>>>> behavior on QEMU part)
>>>>>>
>>>>>> Indeed, it should be removed. Depending on the options (huge pages with
>>>>>> prealloc?) memory might be consumed for unused memory. Undesired.
>>>>>
>>>>> Right, but if the guest initiates a reboot before the management gets
>>>>> to that, we'll have a crash.
>>>>>
>>>>
>>>> Yes, I agree.
>>>>
>>>> At least on s390x (extending on what Igor said):
>>>>
>>>> mc->init() -> s390_memory_init() ->
>>>> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
>>>>
>>>>
>>>> ac->init_machine() -> kvm_arch_init() ->
>>>> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
>>>>
>>>>
>>>> And in vl.c
>>>>
>>>> configure_accelerator(current_machine, argv[0]);
>>>> ...
>>>> machine_run_board_init()
>>>>
>>>> So memory is indeed not mapped before calling qemu_getrampagesize().
>>>>
>>>>
>>>> We *could* move the call to kvm_s390_configure_mempath_backing() to
>>>> s390_memory_init().
>>>>
>>>> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
>>>>
>>>> We could than eventually make qemu_getrampagesize() asssert if no
>>>> backends are mapped at all, to catch other user that rely on this being
>>>> correct.
>>>
>>> So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
>>> and I'm pretty sure it's broken.  It will work in the case where
>>> there's only one backend.  And if that's the default -mem-path rather
>>> than an explicit memory backend then my patch won't break it any
>>> further.
>>
>> On the second look, I think I get your point.
>>
>> 1. Why on earth does "find_max_supported_pagesize" find the "minimum
>> page size". What kind of nasty stuff is this.
> 
> Ah, yeah, the naming is bad because of history.
> 
> The original usecase of this was because on POWER (before POWER9) the
> way MMU virtualization works, pages inserted into the guest MMU view
> have to be host-contiguous: there's no 2nd level translation that lets
> them be broken into smaller host pages.
> 
> The upshot is that a KVM guest can only use large pages if it's backed
> by large pages on the host.  We have to advertise the availability of
> large pages to the guest at boot time though, and there's no way to
> restrict it to certain parts of guest RAM.
> 
> So, this code path was finding the _maximum_ page size the guest could
> use... which depends on the _minimum page_ size used on the host.
> When this was moved to (partly) generic code we didn't think to
> improve all the names.
> 
>> 2. qemu_mempath_getpagesize() is not affected by your patch
> 
> Correct.
> 
>> and that
>> seems to be the only thing used on s390x for now.
> 
> Uh.. what?
> 
>> I sent a patch to move the call on s390x. But we really have to detect
>> the maximum page size (what find_max_supported_pagesize promises), not
>> the minimum page size.
> 
> Well.. sort of.  In the ppc case it really is the minimum page size we
> care about, in the sense that if some part of RAM has a larger page
> size, that's fine - even if it's a weird size that we didn't expect.
> 
> IIUC for s390 the problem is that KVM doesn't necessarily support
> putting large pages into the guest at all, and what large page sizes
> it can put into the guest depends on the KVM version.

Yes.

> 
> Finding the maximum backing page size answers that question only on
> the assumption that a KVM which supports page size X will support all
> smaller page sizes.  I imagine that's true in practice, but does it
> have to be true in theory?  If not, it seems safer to me to explicitly
> step through every (mapped) backend and individually check if it has a
> supported pagesize, rather than only testing the max page size.


We only support 4k and 1MB page sizes in KVM. So for now the question
does not apply. Then there is 2GB pages in HW.

A HW that supports 2GB supports 1MB. The only way something else (e.g.
512KB pages) could be added would be by doing massive changes to the
architecture. I don't consider this realistic. Like changing suddenly
the page size from 4k to 8k on a HW that was never prepared for it.
However, if that would ever happen, we'll get new KVM capabilities to
check/enable support.

E.g. for 2GB pages, we'll have another KVM capability to enable gigantic
pages. If 1MB pages would no longer supported by KVM, the capability
would not be indicated.

Once we support 2GB pages, we'll might have think about what you
describe here, depending on what the KVM interface promises us. If the
interfaces promises "If 2GB are enabled, 1MB are enabled implicitly", we
are fine, otherwise we would have to check per mapped backend.

> 
> It also occurs to me: why does this logic need to be in qemu at all?
> KVM must know what pagesizes it supports, and I assume it will throw
> an error if you try to put something with the wrong size into a
> memslot.  So can't qemu just report that error, rather than checking
> the pagesizes itself?

There are multiple things to that

1. "KVM must know what pagesizes it supports"

Yes it does, and this is reflected via KVM capabilities (e.g.
KVM_CAP_S390_HPAGE_1M) . To make use of
these capabilities, they have to be enabled by user space. Once we have
support for 2G pages, we'll have KVM_CAP_S390_HPAGE_2G.

In case the capability is enabled, certain things have to be changed in KVM
- CMMA can no longer be used (user space has to properly take care of that)
- Certain HW assists (PFMF interpretation, Storage Key Facility) have to
be disabled early.


2. "it will throw an error if you try to put something with the wrong
    size into a memslot"

An error will be reported when trying to map a huge page into the GMAP
(e.g. on a host page fault in virtualization mode). So not when the
memslot is configured, but during kvm_run.

Checking the memslot might be

a) complicated (check all VMAs)
b) waste of time (many VMAs)
c) incorrect - the content of a memslot can change any time. (KVM
synchronous MMU). Think of someone wanting to remap some pages part of a
memslot using huge pages.

3. Can you elaborate on "So can't qemu just report that error, rather
than checking the pagesizes itself?" We effectively check against the
capabilities of KVM and the page size. Based on that, we report the
error in QEMU. Reporting an error after the guest has already started
and crashed during kvm_run due to a huge page is way too late.

-- 

Thanks,

David / dhildenb



reply via email to

[Prev in Thread] Current Thread [Next in Thread]