|
From: | Maxim Levitsky |
Subject: | NFS client deadlock while doing nested virtualization tests on my systems |
Date: | Tue, 05 Oct 2021 17:08:59 +0300 |
User-agent: | Evolution 3.36.5 (3.36.5-2.fc32) |
Lately I was trying to narrow down what now looks like an NFSv4 client bug. I believe that I have enough information to pass it down to people who are more knowledgeable in this area. My setup: - My main work computer (Zen2 AMD machine), has an NFS exports share with all the files that all my other lab computers and VMs use. - Nested virtualization is used to reproduce the issue, but I believe that it only contributes to unusual timings that make this deadlock more reproducible. - First level of virtualization (L1) (that is regular VM running on KVM on bare metal) was my test fedora 32 VM with 5.14 kernel, built from exactly the same source as the host kernel (but with different .config). L1 VM uses a regular qcow2 file located on the main computer's filesystem and it running on it. L1 VM uses kernel's NFSv4 client to access L0’s shared directory where all the VM's qcow2 files reside (including itself, but it pretends it doesn't know this to avoid escaping to the real world ;-) ) To see if there is some kind of deadlock if L2 still somehow acceses the L1 qcow2 file, I moved the L1's qcow2 files out of the shared folder, so now L1 only mounts a single folder over NFS which has the test VM and nothing else That didn't help either. - Second level of virtualization (L2) (a nested VM running in L1 VM) was a Windows 10 VM (build 1909, as I didn't bother updating this test VM) which uses a pair of qcow2 files (base and snapshot on top of it), both located on that NFS shared folder. L2 VM used to run another level of virtualization with Hyper-v, but I was able to reproduce the issue with HyperV disabled. - For all testing, a custom compiled kernel 5.14.0 was used (both on host (L0) and the L1 guest) (with patches from kvm tree and some of my patches which I believe don’t contribute to the issue). I also tested older kernels in the L1 guest, and I also tested 5.15-rc3 on both L0 and L1 and had the same result. - Qemu 5.2.0 was used for all userspace parts of KVM in both L0 and L1 (I usually use latest master branch of the qemu git, but I suspected qemu, so I used relatively old version which I used for a long time before) - Issue was reproduced with vhost disabled in both L0 and L1. virtio-net was used for networking on top of tap interface. Issue was reproduced without ‘-overcommit cpu_pm=on’ (I tend to use this qemu option too often) - aio=native was used for the IO, and cache=none for the L2 VM: -device virtio-scsi,id=scsi-ctrl,bus=rport.0 -drive if=none,id=os_image,file=./disk_s1.qcow2,aio=native,discard=unmap,cache=none -device scsi-hd,drive=os_image,bus=scsi-ctrl.0,bootindex=1 however I also tested L2 with discard=ignore,aio=threads, and the guest eventually crashed as well in the same way. - I also tested this with e1000e in the L1, and the test failed indicating that this isn't related to virtio - It seems that the test survives if the windows VM is started on a different system (also Zen2 machine), and its qcow2 file still streamed via NFS. Network speed though will be vastly different since I use 1GB network. - I tested NFSv3 as well. The NFS client also hangs, but when it recovers the VM doesn't crash. On the other hand, when I force stop L1, the NFS server deadlocks when attempting to restart it, and once even oopsed the host. I attached the backtraces for this. What is happening: The nested VM (L2) boots, and after a while hangs on IO. L1’s NFS client deadlocks, and causes the L1 itself to mostly hang “nfs style”. After a few minutes it recovers by reconnecting, gives back IO errors to the nested guest which bluescreens and crashes. Qemu is usually blocked on NFS io, but not always, in one case it was just hang, probably blocking on GUI or something which was blocked on NFS. The dmesg in L1 usually shows that we have qemu blocked on NFS twice One is the qemu main thread doing ‘io_submit’ on the file which was opened with O_DIRECT [ 1844.377677] task:qemu-system-x86 state:D stack: 0 pid: 4517 ppid: 4511 flags:0x00000000 [ 1844.379164] Call Trace: [ 1844.379613] __schedule+0x2dd/0x1440 [ 1844.380245] schedule+0x4e/0xb0 [ 1844.380834] rwsem_down_read_slowpath+0x2e4/0x330 [ 1844.381662] down_read+0x43/0x90 [ 1844.382277] nfs_start_io_direct+0x1e/0x70 [nfs] [ 1844.383181] nfs_file_direct_write+0x1f7/0x340 [nfs] [ 1844.384113] nfs_file_write+0x26f/0x320 [nfs] [ 1844.384907] aio_write+0xfc/0x1f0 [ 1844.385511] ? __pollwait+0xd0/0xd0 [ 1844.386187] ? __fget_files+0x5f/0x90 [ 1844.386859] ? __fget_light+0x32/0x80 [ 1844.387520] io_submit_one+0x1f6/0x790 [ 1844.388255] __x64_sys_io_submit+0x84/0x180 [ 1844.389049] ? fput+0x13/0x20 [ 1844.389603] ? ksys_read+0xce/0xe0 [ 1844.390225] do_syscall_64+0x36/0xb0 [ 1844.390870] entry_SYSCALL_64_after_hwframe+0x44/0xae Another is one of the qemu thread pool workers which is still used for fallocate syscall which is running this syscall. [ 1844.403632] task:worker state:D stack: 0 pid: 4555 ppid: 4511 flags:0x00000000 [ 1844.405108] Call Trace: [ 1844.405587] __schedule+0x2dd/0x1440 [ 1844.406215] ? hrtimer_cancel+0x15/0x30 [ 1844.406904] ? futex_wait+0x223/0x250 [ 1844.407601] schedule+0x4e/0xb0 [ 1844.408194] inode_dio_wait+0xd1/0x100 [ 1844.408871] ? var_wake_function+0x30/0x30 [ 1844.409617] nfs_sync_inode+0x13/0x30 [nfs] [ 1844.410426] nfs42_proc_fallocate.isra.0+0x91/0x100 [nfsv4] [ 1844.411504] nfs42_proc_allocate+0x8c/0xe0 [nfsv4] [ 1844.412426] nfs42_fallocate+0x59/0x80 [nfsv4] [ 1844.413222] vfs_fallocate+0x152/0x2f0 [ 1844.413926] __x64_sys_fallocate+0x44/0x70 [ 1844.414635] do_syscall_64+0x36/0xb0 While this is happening NFS client in L1 is hanging it, blocking all IO on the NFS mount. The NFS server continues to work normally as I am able to do IO from my other test systems to the server, e.g start another VM on a test system which also uses NFS to stream the VM’s qcow2 file. The bug is not 100% reproducible and relatively random but happens often in my current setting of VMs that always start from the same snapshot. However usually leaving the VM for hour or two, makes this bug reproduce. I know that a few months ago it didn't happen, or at least I didn't notice it despite a lot of testing with the same VMs. I still don't know what might have changed. It is possible that I didn't use NFS back then. Wireshark captures I did seems to show that all requests submitted to the NFS server are replied by it, so it looks like it is not the reason for the hang, but it is possible that some of the requests didn't reach the server in the first place. I attached such trace (I have full trace if needed). The last packet before the hang is frame 81322. I attach more detailed dmesg captures in case it helps. I also captured kernel backtrace of all qemu threads via /proc/pid/task/tid/stack: 202206 : qemu-system-x86 [<0>] nfs_start_io_direct+0x1e/0x70 [nfs] [<0>] nfs_file_direct_write+0x1f7/0x340 [nfs] [<0>] nfs_file_write+0x26f/0x320 [nfs] [<0>] aio_write+0xfc/0x1f0 [<0>] io_submit_one+0x1f6/0x790 [<0>] __x64_sys_io_submit+0x84/0x180 [<0>] do_syscall_64+0x36/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ================= 202219 : qemu-system-x86 [<0>] futex_wait_queue_me+0xa4/0x100 [<0>] futex_wait+0x105/0x250 [<0>] do_futex+0xea/0xa30 [<0>] __x64_sys_futex+0x74/0x1b0 [<0>] do_syscall_64+0x36/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ================= 202220 : gmain [<0>] do_sys_poll+0x3f9/0x610 [<0>] __x64_sys_poll+0x3b/0x140 [<0>] do_syscall_64+0x36/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ================= 202221 : threaded-ml [<0>] do_sys_poll+0x3f9/0x610 [<0>] __x64_sys_poll+0x3b/0x140 [<0>] do_syscall_64+0x36/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ================= 202224 : IO mon_iothread [<0>] do_sys_poll+0x3f9/0x610 [<0>] __x64_sys_poll+0x3b/0x140 [<0>] do_syscall_64+0x36/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ================= 202225 : CPU 0/KVM [<0>] kvm_vcpu_block+0x5c/0x3a0 [kvm] [<0>] kvm_arch_vcpu_ioctl_run+0x651/0x1710 [kvm] [<0>] kvm_vcpu_ioctl+0x284/0x6c0 [kvm] [<0>] __x64_sys_ioctl+0x8e/0xc0 [<0>] do_syscall_64+0x36/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ================= 202226 : CPU 1/KVM [<0>] kvm_vcpu_block+0x5c/0x3a0 [kvm] [<0>] kvm_arch_vcpu_ioctl_run+0x651/0x1710 [kvm] [<0>] kvm_vcpu_ioctl+0x284/0x6c0 [kvm] [<0>] __x64_sys_ioctl+0x8e/0xc0 [<0>] do_syscall_64+0x36/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ================= 202227 : CPU 2/KVM [<0>] futex_wait_queue_me+0xa4/0x100 [<0>] futex_wait+0x105/0x250 [<0>] do_futex+0xea/0xa30 [<0>] __x64_sys_futex+0x74/0x1b0 [<0>] do_syscall_64+0x36/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ================= 202228 : CPU 3/KVM [<0>] kvm_vcpu_block+0x5c/0x3a0 [kvm] [<0>] kvm_arch_vcpu_ioctl_run+0x651/0x1710 [kvm] [<0>] kvm_vcpu_ioctl+0x284/0x6c0 [kvm] [<0>] __x64_sys_ioctl+0x8e/0xc0 [<0>] do_syscall_64+0x36/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ================= 204640 : worker [<0>] inode_dio_wait+0xd1/0x100 [<0>] nfs_sync_inode+0x13/0x30 [nfs] [<0>] nfs42_proc_fallocate.isra.0+0x91/0x100 [nfsv4] [<0>] nfs42_proc_allocate+0x8c/0xe0 [nfsv4] [<0>] nfs42_fallocate+0x59/0x80 [nfsv4] [<0>] vfs_fallocate+0x152/0x2f0 [<0>] __x64_sys_fallocate+0x44/0x70 [<0>] do_syscall_64+0x36/0xb0 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae ================= As you see, all nested vCPUs are blocked in L1 kernel, so when the hang happens, a hung vCPU should not be a problem. It is possible though that L0 is not letting L1's vCPUs run for some reason and that is what causing NFS client to stall. Once I was able to capture the above backtrace where qemu would not be stuck on NFS but still hung, and NFS client still hanging L1. In this case it is likely that qemu got stuck on something else (gui?). Any help on debugging this is welcome. I can provide more info if needed as well. Best regards, Maxim Levitsky
backtrace.txt
Description: Text document
backtrace2.txt
Description: Text document
nfs3_crash1.txt
Description: Text document
nfs3_crash2.txt
Description: Text document
wireshark_capture.tar.xz
Description: application/xz-compressed-tar
[Prev in Thread] | Current Thread | [Next in Thread] |