Re: [Qemu-arm] [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expan

qemu-arm
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-arm] [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expan

From:	Auger Eric
Subject:	Re: [Qemu-arm] [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expansion and PCDIMM/NVDIMM support
Date:	Thu, 28 Feb 2019 08:48:18 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0
Hi Igor, Shameer,

On 2/27/19 6:51 PM, Igor Mammedov wrote:
> On Wed, 27 Feb 2019 10:41:45 +0000
> Shameerali Kolothum Thodi <address@hidden> wrote:
> 
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Auger Eric [mailto:address@hidden
>>> Sent: 27 February 2019 10:27
>>> To: Igor Mammedov <address@hidden>
>>> Cc: address@hidden; address@hidden; address@hidden;
>>> address@hidden; Shameerali Kolothum Thodi
>>> <address@hidden>; address@hidden;
>>> address@hidden; address@hidden;
>>> address@hidden
>>> Subject: Re: [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expansion
>>> and PCDIMM/NVDIMM support
>>>
>>> Hi Igor, Shameer,
>>>
>>> On 2/27/19 11:10 AM, Igor Mammedov wrote:  
>>>> On Tue, 26 Feb 2019 18:53:24 +0100
>>>> Auger Eric <address@hidden> wrote:
>>>>  
>>>>> Hi Igor,
>>>>>
>>>>> On 2/26/19 5:56 PM, Igor Mammedov wrote:  
>>>>>> On Tue, 26 Feb 2019 14:11:58 +0100
>>>>>> Auger Eric <address@hidden> wrote:
>>>>>>  
>>>>>>> Hi Igor,
>>>>>>>
>>>>>>> On 2/26/19 9:40 AM, Auger Eric wrote:  
>>>>>>>> Hi Igor,
>>>>>>>>
>>>>>>>> On 2/25/19 10:42 AM, Igor Mammedov wrote:  
>>>>>>>>> On Fri, 22 Feb 2019 18:35:26 +0100
>>>>>>>>> Auger Eric <address@hidden> wrote:
>>>>>>>>>  
>>>>>>>>>> Hi Igor,
>>>>>>>>>>
>>>>>>>>>> On 2/22/19 5:27 PM, Igor Mammedov wrote:  
>>>>>>>>>>> On Wed, 20 Feb 2019 23:39:46 +0100
>>>>>>>>>>> Eric Auger <address@hidden> wrote:
>>>>>>>>>>>  
>>>>>>>>>>>> This series aims to bump the 255GB RAM limit in machvirt and to
>>>>>>>>>>>> support device memory in general, and especially  
>>> PCDIMM/NVDIMM.  
>>>>>>>>>>>>
>>>>>>>>>>>> In machvirt versions < 4.0, the initial RAM starts at 1GB and can
>>>>>>>>>>>> grow up to 255GB. From 256GB onwards we find IO regions such as  
>>> the  
>>>>>>>>>>>> additional GICv3 RDIST region, high PCIe ECAM region and high  
>>> PCIe  
>>>>>>>>>>>> MMIO region. The address map was 1TB large. This corresponded  
>>> to  
>>>>>>>>>>>> the max IPA capacity KVM was able to manage.
>>>>>>>>>>>>
>>>>>>>>>>>> Since 4.20, the host kernel is able to support a larger and dynamic
>>>>>>>>>>>> IPA range. So the guest physical address can go beyond the 1TB.  
>>> The  
>>>>>>>>>>>> max GPA size depends on the host kernel configuration and physical 
>>>>>>>>>>>>  
>>> CPUs.  
>>>>>>>>>>>>
>>>>>>>>>>>> In this series we use this feature and allow the RAM to grow  
>>> without  
>>>>>>>>>>>> any other limit than the one put by the host kernel.
>>>>>>>>>>>>
>>>>>>>>>>>> The RAM still starts at 1GB. First comes the initial ram (-m) of 
>>>>>>>>>>>> size
>>>>>>>>>>>> ram_size and then comes the device memory (,maxmem) of size
>>>>>>>>>>>> maxram_size - ram_size. The device memory is potentially  
>>> hotpluggable  
>>>>>>>>>>>> depending on the instantiated memory objects.
>>>>>>>>>>>>
>>>>>>>>>>>> IO regions previously located between 256GB and 1TB are moved  
>>> after  
>>>>>>>>>>>> the RAM. Their offset is dynamically computed, depends on  
>>> ram_size  
>>>>>>>>>>>> and maxram_size. Size alignment is enforced.
>>>>>>>>>>>>
>>>>>>>>>>>> In case maxmem value is inferior to 255GB, the legacy memory  
>>> map  
>>>>>>>>>>>> still is used. The change of memory map becomes effective from 4.0
>>>>>>>>>>>> onwards.
>>>>>>>>>>>>
>>>>>>>>>>>> As we keep the initial RAM at 1GB base address, we do not need to  
>>> do  
>>>>>>>>>>>> invasive changes in the EDK2 FW. It seems nobody is eager to do
>>>>>>>>>>>> that job at the moment.
>>>>>>>>>>>>
>>>>>>>>>>>> Device memory being put just after the initial RAM, it is possible
>>>>>>>>>>>> to get access to this feature while keeping a 1TB address map.
>>>>>>>>>>>>
>>>>>>>>>>>> This series reuses/rebases patches initially submitted by Shameer
>>>>>>>>>>>> in [1] and Kwangwoo in [2] for the PC-DIMM and NV-DIMM parts.
>>>>>>>>>>>>
>>>>>>>>>>>> Functionally, the series is split into 3 parts:
>>>>>>>>>>>> 1) bump of the initial RAM limit [1 - 9] and change in
>>>>>>>>>>>>    the memory map  
>>>>>>>>>>>  
>>>>>>>>>>>> 2) Support of PC-DIMM [10 - 13]  
>>>>>>>>>>> Is this part complete ACPI wise (for coldplug)? I haven't noticed
>>>>>>>>>>> DSDT AML here no E820 changes, so ACPI wise pc-dimm shouldn't be
>>>>>>>>>>> visible to the guest. It might be that DT is masking problem
>>>>>>>>>>> but well, that won't work on ACPI only guests.  
>>>>>>>>>>
>>>>>>>>>> guest /proc/meminfo or "lshw -class memory" reflects the amount of  
>>> mem  
>>>>>>>>>> added with the DIMM slots.  
>>>>>>>>> Question is how does it get there? Does it come from DT or from  
>>> firmware  
>>>>>>>>> via UEFI interfaces?
>>>>>>>>>  
>>>>>>>>>> So it looks fine to me. Isn't E820 a pure x86 matter?  
>>>>>>>>> sorry for misleading, I've meant is UEFI GetMemoryMap().
>>>>>>>>> On x86, I'm wary of adding PC-DIMMs to E802 which then gets exposed
>>>>>>>>> via UEFI GetMemoryMap() as guest kernel might start using it as  
>>> normal  
>>>>>>>>> memory early at boot and later put that memory into zone normal and  
>>> hence  
>>>>>>>>> make it non-hot-un-pluggable. The same concerns apply to DT based  
>>> means  
>>>>>>>>> of discovery.
>>>>>>>>> (That's guest issue but it's easy to workaround it not putting  
>>> hotpluggable  
>>>>>>>>> memory into UEFI GetMemoryMap() or DT and let DSDT describe it  
>>> properly)  
>>>>>>>>> That way memory doesn't get (ab)used by firmware or early boot  
>>> kernel stages  
>>>>>>>>> and doesn't get locked up.
>>>>>>>>>  
>>>>>>>>>> What else would you expect in the dsdt?  
>>>>>>>>> Memory device descriptions, look for code that adds PNP0C80 with  
>>> _CRS  
>>>>>>>>> describing memory ranges  
>>>>>>>>
>>>>>>>> OK thank you for the explanations. I will work on PNP0C80 addition 
>>>>>>>> then.
>>>>>>>> Does it mean that in ACPI mode we must not output DT hotplug memory
>>>>>>>> nodes or assuming that PNP0C80 is properly described, it will 
>>>>>>>> "override"
>>>>>>>> DT description?  
>>>>>>>
>>>>>>> After further investigations, I think the pieces you pointed out are
>>>>>>> added by Shameer's series, ie. through the build_memory_hotplug_aml()
>>>>>>> call. So I suggest we separate the concerns: this series brings support
>>>>>>> for DIMM coldplug. hotplug, including all the relevant ACPI structures
>>>>>>> will be added later on by Shameer.  
>>>>>>
>>>>>> Maybe we should not put pc-dimms in DT for this series until it gets 
>>>>>> clear
>>>>>> if it doesn't conflict with ACPI in some way.  
>>>>>
>>>>> I guess you mean removing the DT hotpluggable memory nodes only in ACPI
>>>>> mode? Otherwise you simply remove the DIMM feature, right?  
>>>> Something like this so DT won't get in conflict with ACPI.
>>>> Only we don't have a switch for it something like, -machine fdt=on (with  
>>> default off)  
>>>>  
>>>>> I double checked and if you remove the hotpluggable memory DT nodes in
>>>>> ACPI mode:
>>>>> - you do not see the PCDIMM slots in guest /proc/meminfo anymore. So I
>>>>> guess you're right, if the DT nodes are available, that memory is
>>>>> considered as not unpluggable by the guest.
>>>>> - You can see the NVDIMM slots using ndctl list -u. You can mount a DAX
>>>>> system.
>>>>>
>>>>> Hotplug/unplug is clearly not supported by this series and any attempt
>>>>> results in "memory hotplug is not supported". Is it really an issue if
>>>>> the guest does not consider DIMM slots as not hot-unpluggable memory? I
>>>>> am not even sure the guest kernel would support to unplug that memory.
>>>>>
>>>>> In case we want all ACPI tables to be ready for making this memory seen
>>>>> as hot-unpluggable we need some Shameer's patches on top of this series.  
>>>> May be we should push for this way (into 4.0), it's just a several patches
>>>> after all or even merge them in your series (I'd guess it would need to be
>>>> rebased on top of your latest work)  
>>>
>>> Shameer, would you agree if we merge PATCH 1 of your RFC hotplug series
>>> (without the reduced hw_reduced_acpi flag) in this series and isolate in
>>> a second PATCH the acpi_memory_hotplug_init() + build_memory_hotplug_aml
>>> called in virt code?  
> probably we can do it as transitional step as we need working mmio interface
> in place for build_memory_hotplug_aml() to work, provided it won't create
> migration issues (do we need VMSTATE_MEMORY_HOTPLUG for cold-plug case?).
> 
> What about dummy initial GED (empty device), that manages mmio region only
> and then later it will be filled with remaining logic IRQ. In this case mmio 
> region
> and vmstate won't change (maybe) so it won't cause ABI or migration issues.


> 
> 
>> Sure, that’s fine with me. So what would you use for the 
>> event_handler_method in
>> build_memory_hotplug_aml()? GPO0 device?
> 
> a method name not defined in spec, so it won't be called might do.

At this point the event_handler_method, ie. \_SB.GPO0._E02, is not
supposed to be called, right? So effectivily we should be able to use
any other method name (unlinked to any GPIO/GED). I guess at this stage
only the PNP0C80 definition blocks + methods are used.

What still remains fuzzy for me is in case of cold plug the mmio hotplug
control region part only is read (despite the slot selection of course)
and returns 0 for addr/size and also flags meaning the slot is not
enabled. So despite the slots are advertised as hotpluggable/enabled in
the SRAT; I am not sure for the OS it actually makes any difference
whether the DSDT definition blocks are described or not.

To be honest I am afraid this is too late to add those additional
features for 4.0 now. This is going to jeopardize the first preliminary
part which is the introduction of the new memory map, allowing the
expansion of the initial RAM and paving the way for device memory
introduction. So I think I am going to resend the first 10 patches in a
standalone series. And we can iterate on the PCDIMM/NVDIMM parts
independently.

Thanks

Eric
> 
> 
>>
>> Thanks,
>> Shameer
>>
>>> Then would remain the GED/GPIO actual integration.
>>>
>>> Thanks
>>>
>>> Eric  
>>>>  
>>>>> Also don't DIMM slots already make sense in DT mode. Usually we accept
>>>>> to add one feature in DT and then in ACPI. For instance we can benefit  
>>>> usually it doesn't conflict with each other (at least I'm not aware of it)
>>>> but I see a problem with in this case.
>>>>  
>>>>> from nvdimm in dt mode right? So, considering an incremental approach I
>>>>> would be in favour of keeping the DT nodes.  
>>>> I'd guess it is the same as for DIMMs, ACPI support for NVDIMMs is much
>>>> more versatile.
>>>>
>>>> I consider target application of arm/virt as a board that's used to
>>>> run in production generic ACPI capable guest in most use cases and
>>>> various DT only guests as secondary ones. It's hard to make
>>>> both usecases be happy with defaults (that's probably  one of the
>>>> reasons why 'sbsa' board is being added).
>>>>
>>>> So I'd give priority to ACPI based arm/virt versus DT when defaults are
>>>> considered.
>>>>  
>>>>> Thanks
>>>>>
>>>>> Eric  
>>>>>>
>>>>>>
>>>>>>
>>>>>>  
>>>>  
> 
>
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-arm] [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expansion and PCDIMM/NVDIMM support, (continued)
Prev by Date: Re: [Qemu-arm] [RFC PATCH 0/4] ARM virt: ACPI memory hotplug support
Next by Date: Re: [Qemu-arm] [Qemu-devel] [RFC PATCH 3/4] hw/arm/virt: Enable pc-dimm hotplug support
Previous by thread: Re: [Qemu-arm] [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expansion and PCDIMM/NVDIMM support
Next by thread: Re: [Qemu-arm] [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expansion and PCDIMM/NVDIMM support
Index(es):
- Date
- Thread