[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving"
From: |
Zhijian Li (Fujitsu) |
Subject: |
Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase |
Date: |
Mon, 18 Sep 2023 03:47:58 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 |
On 15/09/2023 19:31, William Roche wrote:
> On 9/15/23 05:13, Zhijian Li (Fujitsu) wrote:
>>
>>
>> I'm okay with "RDMA isn't touched".
>> BTW, could you share your reproducing program/hacking to poison the page, so
>> that
>> i am able to take a look the RDMA part later when i'm free.
>>
>> Not sure it's suitable to acknowledge a not touched part. Anyway
>> Acked-by: Li Zhijian <lizhijian@fujitsu.com> # RDMA
>>
>
> Thanks.
> As you asked for a procedure to inject memory errors into a running VM,
> I've attached to this email the source code (mce_process_react.c) of a
> program that will help to target the error injection in the VM.
>
Very very thanks for your details, Mark it :)
Thanks
Zhijian
> (Be careful that error injection is currently nor working on AMD
> platforms -- this is a work in progress is a separate qemu thread)
>
>
> The general idea:
> We are going to target a process memory page running inside a VM to see
> what happens when we inject an error on the underlying physical page at
> the platform (hypervisor) level.
> To have a better view of what's going on, we'll use a process made for
> this: It's goal is to allocate a memory page, and create a SIGBUS
> handler to inform when it receives this signal. It will also wait before
> touching this page to see what happens next.
>
> Compiling this tool:
> $ gcc -o mce_process_react_x86 mce_process_react.c
>
>
> Let's try that:
> This procedure shows the best case scenario, where an error injected at
> the platform level is reported up to the guest process using it.
> Note that qemu should be started with root privilege.
>
> 1. Choose a process running in the VM (and identify a memory page
> you want to target, and get its physical address – crash(8) vtop can
> help with that) or run the attached mce_process_react example (compiled
> for your platform mce_process_react_[x86|arm]) with an option to be
> early informed of _AO error (-e) and wait ENTER to continue with reading
> the allocated page (-w 0):
>
> [root@VM ]# ./mce_process_react_x86 -e -w 0
> Setting Early kill... Ok
>
> Data pages at 0x7fa0f9b25000 physically 0x200f2fa000
>
> Press ENTER to continue with page reading
>
>
> 2. Go into the VM monitor to get the translation from "Guest
> Physical Address to Host Physical Address" or "Host Virtual Address":
>
> (qemu) gpa2hpa 0x200f2fa000'
> Host physical address for 0x200f2fa000 (ram-node1) is 0x46f12fa000
>
>
> 3. Before we inject the error, we want to keep track of the VM
> console output (in a separate window).
> If you are using libvirt: # virsh console myvm
>
>
> 4. We now prepare for the error injection at the platform level to
> the address we found. To do so, we'll need to use the hwpoison-inject
> module (x86)
> Be careful, as hwpoison takes Page Frame Numbers and this PFN is not the
> physical address – you need to remove the last 12 bits (the last 3 zeros
> of the above address) !
>
> [root@hv ]# modprobe hwpoison-inject
> [root@hv ]# echo 0x46f12fa > /sys/kernel/debug/hwpoison/corrupt-pfn
>
> If you see "Operation not permitted" error when writing as root
> on corrupt-pfn, you may be facing a "kernel_lockdown(7)" which is
> enabled on SecureBoot systems (can be verified with
> "mokutil --sb-state"). In this case, turn SecureBoot off (at the UEFI
> level for example)
>
> 5. Look at the qemu output (either on the terminal where qemu was
> started or if you are using libvirt: tail /var/log/libvirt/qemu/myvm
>
> 2022-08-31T13:52:25.645398Z qemu-system-x86_64: warning: Guest MCE Memory
> Error at QEMU addr 0x7eeeace00000 and GUEST addr 0x200f200 of type
> BUS_MCEERR_AO injected
>
> 6. On the guest console:
> We'll see the VM reaction to the injected error:
>
> [ 155.805149] Disabling lock debugging due to kernel taint
> [ 155.806174] mce: [Hardware Error]: Machine check events logged
> [ 155.807120] Memory failure: 0x200f200: Killing mce_process_rea:3548 due to
> hardware memory corruption
> [ 155.808877] Memory failure: 0x200f200: recovery action for dirty LRU page:
> Recovered
>
> 7. The Guest process that we started at the first step gives:
>
> Signal 7 received
> BUS_MCEERR_AO on vaddr: 0x7fa0f9b25000
>
> At this stage, the VM has a poisoned page, and a migration of this VM
> needs to be fixed in order to avoid accessing the poisoned page.
>
> 8. The process continues to run (as it handled the SIGBUS).
> Now if you press ENTER on this process terminal, it will try to read the
> page which will generate a new MCE (a synchronous one) at VM level which
> will be sent to this process:
>
> Signal 7 received
> BUS_MCEERR_AR on vaddr: 0x7fa0f9b25000
> Exit from the signal handler on BUS_MCEERR_AR
>
> 9. The VM console shows:
> [ 2520.895263] MCE: Killing mce_process_rea:3548 due to hardware memory
> corruption fault at 7f45e5265000
>
> 10. The VM continues to run...
> With a poisoned page in its address space
>
> HTH,
> William.
- Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase, (continued)
- Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase, Joao Martins, 2023/09/06
- Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase, Peter Xu, 2023/09/06
- Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase, William Roche, 2023/09/06
- Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase, Joao Martins, 2023/09/09
- Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase, Peter Xu, 2023/09/11
- Re: [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase, Peter Xu, 2023/09/12
- [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error, “William Roche, 2023/09/14
- [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase, “William Roche, 2023/09/14
- Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase, Zhijian Li (Fujitsu), 2023/09/14
- Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase, William Roche, 2023/09/15
- Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase,
Zhijian Li (Fujitsu) <=
- Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase, Zhijian Li (Fujitsu), 2023/09/20
- Re: [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase, William Roche, 2023/09/20
- [PATCH v3 0/1] Qemu crashes on VM migration after an handled memory error, “William Roche, 2023/09/20
- [PATCH v3 1/1] migration: skip poisoned memory pages on "ram saving" phase, “William Roche, 2023/09/20
- Re: [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error, Peter Xu, 2023/09/14