qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)


From: Maciej S. Szmigiero
Subject: Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
Date: Fri, 25 Sep 2020 00:37:32 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0

On 23.09.2020 14:48, David Hildenbrand wrote:
(...)
>>
>> I know that this a source of significant hot removal slowdown, especially
>> when a "ripple effect" happens on removal:
>> 1) There are 3 extra DIMMs plugged into the guest: A, B, C.
>>    A and B are nearly empty, but C is nearly full.
>>
>> 2) The host does not know anything which DIMM is empty and which is full,
>>    so it requests the guest to unplug the stick C,
> 
> In theory, the host can simply track inflation requests. In practice,
> guests tend to sometimes re-access balloon-inflated memory (e.g., simple
> kexec-style reboot under Linux, kdump on older Linux versions), so it's
> not completely safe to do.

You are describing Linux situation here, while this driver targets
Windows.

I think the issues you describe (kexec, etc.) are probably fixable once
somebody is determined enough to do so.

I mean, either the old kernel needs to transfer information about
"forbidden" memory areas to the new kernel or the new kernel needs to
query these somehow (probably only if is receives a "do it" flag from
the old kernel).

>>
>> 3) The guest copies the content of the stick C to the stick B,
>>
>> 4) Once again, the host does not know anything which DIMM is empty and
>>    which is full, so it requests the guest to unplug the stick B,
>>
>> 5) The guest now has to copy the same data from the stick B to the
>>    stick A, once again.
>>
>> With virtual DIMM sticks + this driver it is the guest which chooses
>> which particular pages to release, hopefully choosing the already unused
>> ones.
>> Once the whole memory behind a DIMM stick is released the host knows
>> that it can be unplugged now without any copying.
>>
>> While it might seem like this will cause a lot of fragmentation in
>> practice Windows seems to try to give out the largest continuous range
>> of pages it is able to find.
>>
>> One can also see in the hv_balloon client driver from the Linux kernel
>> that this driver tries to do 2 MB allocations for as long as it can
>> before giving out single pages.
> 
> Yeah, something similar was proposed for virtio-balloon already (and
> there is a paper about that - see below). For virito-balloon, we were
> not yet convinced that stealing most hugepages from the guest is always
> such a good idea. It at least would have to be configurable, to not mess
> with existing use cases.

Thanks for the paper.

The hv_balloon Linux client driver does large continuous allocations of
ordinary 4k pages, not only transparent hugepages.

Even if the virtio-balloon client driver is changed to do higher-older
allocations the protocol itself only supports transporting individual
page numbers and not ranges.

Either the virtio-balloon protocol will need to be changed to allow
sending page ranges (good) or there will need to be some implicit
agreement between the host and client drivers that pages from a
continuous range will be sent consecutively.
Then the host driver can reassemble the whole range, ending either
when it receives a page outside the range or when a reassembly timeout
happens.

But that's just really ugly - realistically the virtio-balloon
protocol will need to be changed in such case.

> 
>>
>> The reason why ballooning and DIMMs wasn't being used together previously
>> is probably because virtio-balloon will (at least on Windows) mark the
>> ballooned out pages as in use inside the guest, preventing the removal
>> of the DIMM stick backing them.
> 
> Same on Linux - but the pages are movable, such they can at least moved
> around to offline + unplug a DIMM.
> 
> Some of the reasons why ballooning + DIMMs are not used as far as I know is:
> - Management issues. Using it without some managing instance
>   (plug/unplug DIMM, control balloon) is impossible. Try using it with
>   bare QEMU - basically impossible.

We'll it's not impossible, just needs some manual reconfiguration.

But such advanced usage scenarios usually are using QEMU with an
external controller anyway.

> - Memory Hotplug limitations: Maximum DIMM count. Minimum DIMM size that
>   an OS can use (e.g., >= 128MiB under Linux, sometimes even 1GB). The
>   granularity restrictions you mentioned apply.

Windows has a memory hotplug granularity of 1 MB.

> - Memory Hotplug reliability: It can happen easily that hotplugging a
>   DIMM / onlining it under Linux fails (e.g., minimum DIMM size). "What
>   you think the guests actually has as available memory might be wrong".
>   If you ignore that (and you don't even get notified) and adjust the
>   balloon later, your (Linux) guest might be in trouble. Assume
>   you hotplug a 8GB to a 2G VM and later try to inflate the balloon to
>   4GB ... so you need reliable monitoring and error handling.

The DM "hot add" message has a "page_count" field to tell the host
how much memory the guest has actually added.

Quoting the DM protocol header file:
> Hot adds may also fail due to low resources; in this case, the guest must
> not complete this message until the hot add can succeed, and the host must
> not send a new hot add request until the response is sent.
> If VSC fails to hot add memory DYNMEM_NUMBER_OF_UNSUCCESSFUL_HOTADD_ATTEMPTS
> times it fails the request.

The host also knows the current guest memory size from its STATUS
messages.

> So yeah, I can understand how hv-balloon tries to work around some of
> these issues.
>
>>
>> In addition to the above, virtio-balloon is also very slow, as the whole
>> protocol operates on single pages only, not on page ranges.
>> There might also be some interference with Windows memory management
>> causing an extra slowdown in comparison to the native Windows DM
>> protocol.
> 
> Yes, because I assume Windows doesn't really care too much about
> optimizing for virtio-balloon. There isn't too much external developers
> can do about that. See below for hugepage ballooning in virtio-balloon.

Exactly, that's why the DM protocol is the best thing we have to offer
for Windows guests right now.

>>
>> If the KVM slot limit starts to be a problem in practice then we can
>> think what can be done about it.
>> It's always one obstacle less.
> 
> I'm not a friend of leaving the challenging/fundamental problems to be
> sported out in the future (e.g., resizing initial boot memory, dealing
> with fundamental limits - like KVM memory slots or VMA). But I get how
> it's easier to get something running this way :)

Constraints can be removed step-by-step, when they actually start tobite.

It is unrealistic to have a perfect guest resize solution in a single
patch series upfront, the issue is just too complex.

>>
>> I see that the same KVM slot limit probably applies also for virtio-mem,
>> since it uses memory-backend-ram as its backing memory device, too,
>> right?
> 
> Yes, one memory backend per virtio-mem device. You usually have one
> device per NUMA node.

So if you want to dynamically manage most of the guest memory these
virtio-mem devices + their backends will be very large
(circa 1/4 guest size each for a 4-node machine).

That means in practice they won't ever be able to be hot-removed before
the VM is rebooted since there will very likely be at least single stuck
page in each of these backing devices preventing their removal.
(If hot-removal support is ever enabled for virtio-mem, it looks like it
is not possible yet even on a VM reboot).

And I can see that removing a single RAM block in virio-mem is done 
by discarding it via MADV_DONTNEED, just like in ballooning.
Only the minimum block size is 2 MB and not 4 KB so all consecutive
512 pages in a block will need to be free in order to discard it.

>>
>> If not, then how do you map a totally new memory range into the guest
>> address space without consuming a KVM memory slot?
>> If that's somehow possible then maybe the same mechanism can simply be
>> reused for this driver.
> 
> So, virtio-mem will (in the future, still to be upstreamed by me) use
> resizeable memory regions / ramblocks / KVM memory slots. The region can
> grow (e.g., memory hotplug) and shrink (e.g., during reboot, but later
> also if unplugged memory would allow for it). Nice thing is that
> migration code fully supports resizeable ramblocks already.
> 
> Resizes are triggered by the virtio-mem device, so stuff is completely
> handled inside QEMU. For hv-balloon, you could grow the region when
> required (e.g., balloon X, whereby X is > ram size after inflating), and
> shrink during reboot (or whenever it might be valid to shrink). However,
> you cannot "rip out" anything in between, you'll have to rely on
> MADV_DONTNEED until the guests reboots (well, just like basic
> ballooning) and you can definitely shrink the region.
> 
> That's the tradeoff virtio-mem decided to take for now to be able to
> manage any size changes + migration completely in QEMU, avoiding any
> coordination with an external entity (e.g., libvirt in your example)
> when resizing a guest.

For hv-balloon (that is, Windows guests), it seems to me that the slot
resizing would be a viable solution for growing the guest without
hitting the KVM slot limit or requiring a VM reboot.

For shrinking such guests it isn't really necessary since DM ballooning
is enough until a reboot happens.

During a reboot, virtual DIMM config optimization can be used for now,
while the best, long-term, solution would be to actually do what the
Hyper-V hypervisor does in this case: resize the boot memory to match
the target guest size.

This way the whole issue of the guest seeing only the boot memory part,
not the dynamic part, during its boot will no longer be there.

(...)
>>
>> Or it can be started with a small boot memory size, but this means that
>> few virtual DIMMs might always be inserted (their size and / or count
>> can be optimized during the next reboot or if they become unused due
>> to ballooning).
>>
>> Or one can choose some point in between these two scenarios.
>>  
>> I think a virtio-mem user has to choose a similar trade-off between
>> the boot memory size and the size and count of plugged-in virtio-mem
>> devices, right?
> 
> Partially yes, partially no. It doesn't really care about the second
> case you mention ("few virtual DIMMs might always be inserted") due to
> the way it works. And it does noever have to deal with "inflate balloon
> after/during reboot".
> 
> A virtio-mem device manages only its assigned memory, it does not work
> on random system memory like memory ballooning. So you can never unplug
> initial memory.
> 
> However, you can do something like
> 
> -m 4G,maxmem=104G
> 
> and define a virtio-mem device with a maximum size of 100G and an
> initial size of - say 16G. When booting up, the guest will detect the
> additional 16GB and have effectively 20GB. However, you can only ever
> shrink back down to 4GB (e.g., reliably during a reboot).
>
> And it might not always be desirable (at least under Linux) to have
> little boot memory - say the CMA allocator might want to reserve bigger
> chunks of memory early during boot - if the initial memory size is too
> small, this can fail easily.
> 

So the virtio-mem trade-off is between the size of the boot memory and
the dynamically managed part(s).
And also the block size, as as far I can see, a single stuck page in
a block will prevent it from being discarded.

Of course, we are talking about Linux guests here - Windows guest
will see just the boot memory part.

(...)
>>>
>>>>>>
>>>>>> Future directions:
>>>>>> * Allow sharing the ballooning QEMU interface between hv-balloon and
>>>>>>   virtio-balloon drivers.
>>>>>>   Currently, only one of them can be added to the VM at the same time.
>>>>>
>>>>> Yeah, that makes sense. Only one at a time.
>>>>
>>>> Having only one *active* at a time makes sense, however it ultimately
>>>> would be nice to be able to have them both inserted into a VM:
>>>> one for Windows guests and one for Linux ones.
>>>> Even though only one obviously would be active at the same time.
>>>
>>> I don't think that's the right way forward - that should be configured
>>> when the VM is started.
>>>
>>> Personal opinion: I can understand the motivation to implement
>>> hypervisor-specific devices to better support closed-source operating
>>> systems. But I doubt we want to introduce+support ten different
>>> proprietary devices based on proprietary standards doing roughly the
>>> same thing just because closed-source operating systems are too lazy to
>>> support open standards properly.
>>>
>>
>> What do you mean by "ten" proprietary devices?
>> Is there another balloon protocol driver currently in the tree other
>> than virtio-balloon running over various buses?
> 
> Maybe OSX wants to be next and re-invent the wheel with a proprietary
> balloon driver for a custom hypervisor. I think you get the idea.
> 
> What I'm saying is that I'd much rather want to see Windows
> improve+extend virtio-balloon and such (virtio-mem), instead of
> requiring hypervisors to implement undocumented, proprietary devices to
> get stuff running somewhat smoothly in modern cloud environments. I have
> the feeling that quite some stuff you mention can simply be "fixed" by
> extending/improving virtio-baloon under Windows - for example, inflation
> speed can be improved significantly by inflating in bigger chunks. See
> 
> https://dl.acm.org/doi/10.1145/3240302.3240420
> 
> In contrast to hv-balloon *we* can extend/improve the
> interface/stadnard/implementation on both sides (host/guest).

Even if we switched virtio-balloon to bigger allocations and made the
protocol return page ranges the allocation is still done by simply
using an ordinary alloc_pages()-equivalent API.
I don't see any exported Windows kernel API for allocating balloon
memory.

The same goes for adding new RAM.

The above alone means that supporting virtio-mem semantics on current
Windows versions is likely not possible.

>>
>> People are running Windows guests using QEMU, too.
>>
>> That's why there are dozen or so Hyper-V enlightenments implemented,
>> even though they duplicate KVM PV stuff or that there is kvmvapic
>> with its Windows guest live-patching.
> 
> IIRC the Hyper-V enlightenment stuff is properly publicly documented -

They are only documented from the guest perspective - basically what
the guest of a Hyper-V hypervisor can possibly use.

There is nothing in the Hyper-V TLFS which of the functionality it
documents is supported or required by any Windows version or how
Windows guests actually make use of these features.

Not to mention that the documented interface could say the guest can
expect values A or B or C for parameter X, which is technically true, 
however the actual Hyper-V hypervisor always uses A and that's what
Windows will expect.
 
> whereby last time I checked, the hv-balloon is completely undocumented
> and has to be reverse engineered from the Linux implementation. Please
> correct me if I'm wrong - I am not able to spot references in your cover
> letter as well - I'd be interested into that!

The DM protocol is rather straightforward - the Linux driver contains
well-commented definitions of its messages.

For a hot add the host simply provides the start page frame number and
the count of pages to add and in response receives the number of pages
the guest was actually able to hot add.

For a balloon request the host provides a count of pages it wants the
guest to free and the guest responds with page ranges it has managed
to release.
The reverse happens for an unballoon request.

The protocol consists of just a few simple messages, well described
in the Linux driver.

The VMBus part of the protocol works in the same way as in other VMBus
devices.

>>
>> Not to mention many, many devices like e1000 or VMware vmxnet3 even
>> though virtio-net exists or PIIX IDE even though virtio-{blk,scsi} exist.
>> Or the applesmc driver, which is cleanly designed to help run just
>> one proprietary OS.
> 
> IIRC we need the devices either to bootstrap - e.g., use e1000 until we
> can install virtio-net once the guest is up and running, or to support
> older, unmodified guests. I'd like to stress that what you are proposing
> is different in that sense. Your Windows VM will work just fine without
> a hv-balloon device.

There are also "accelerator" devices like kvmvapic or proprietary devices
where an older open standard exists, like sii3112 SATA, even though
IDE / AHCI can be used to bootstrap the guest.
The sii3112 driver was only added in 2017, although the hardware that
it emulates comes from early 21st century.

> Again, just my personal opinion, I don't make any decisions around here :)
> 

Thanks for taking the time to provide your feedback and insight,
Maciej



reply via email to

[Prev in Thread] Current Thread [Next in Thread]