NFS client deadlock while doing nested virtualization tests on my system

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

NFS client deadlock while doing nested virtualization tests on my system

From:	Maxim Levitsky
Subject:	NFS client deadlock while doing nested virtualization tests on my systems
Date:	Tue, 05 Oct 2021 17:08:59 +0300
User-agent:	Evolution 3.36.5 (3.36.5-2.fc32)

Lately I was trying to narrow down what now looks like an NFSv4 client bug.
 
I believe that I have enough information to pass it down to people who are more 
knowledgeable in this area.

My setup: 
 
- My main work computer (Zen2 AMD machine), has an NFS exports share with all 
the files that all my other lab computers and VMs use.

- Nested virtualization is used to reproduce the issue, but I believe that it 
only contributes to 
  unusual timings that make this deadlock more reproducible.
 
- First level of virtualization (L1) (that is regular VM running on KVM on bare 
metal) was my test fedora 32 VM with 5.14
  kernel, built from exactly the same source as the host kernel (but with 
different .config).

  L1 VM uses a regular qcow2 file located on the main computer's filesystem and 
it running on it.
  L1 VM uses kernel's NFSv4 client to access L0’s shared directory where all 
the VM's qcow2 files reside 
  (including itself, but it pretends it doesn't know this to avoid escaping to 
the real world ;-) )

  To see if there is some kind of deadlock if L2 still somehow acceses the L1 
qcow2 file, I moved the L1's qcow2
  files out of the shared folder, so now L1 only mounts a single folder over 
NFS which has the test VM and nothing else
  That didn't help either.
 
 
- Second level of virtualization (L2) (a nested VM running in L1 VM) was a 
Windows 10 VM (build 1909, as I didn't bother updating this test VM) 
  which uses a pair of qcow2 files (base and snapshot on top of it), both 
located on that NFS shared folder.
 
  L2 VM used to run another level of virtualization with Hyper-v, but I was 
able to reproduce the issue with HyperV disabled.
 

- For all testing, a custom compiled kernel 5.14.0 was used (both on host (L0) 
and the L1 guest)
  (with patches from kvm tree and some of my patches which I believe don’t 
contribute to the issue).
  I also tested older kernels in the L1 guest, and I also tested 5.15-rc3 on 
both L0 and L1 and had the same result.
 

- Qemu 5.2.0 was used for all userspace parts of KVM in both L0 and L1 
  (I usually use latest master branch of the qemu git, but I suspected qemu, so 
I used relatively old version which I used for a   
  long time before)
  
- Issue was reproduced with vhost disabled in both L0 and L1. virtio-net was 
used for networking on top of tap interface.
  Issue was reproduced without ‘-overcommit cpu_pm=on’ (I tend to use this qemu 
option too often) 

- aio=native was used for the IO, and cache=none for the L2 VM:
 
  -device virtio-scsi,id=scsi-ctrl,bus=rport.0
  -drive 
if=none,id=os_image,file=./disk_s1.qcow2,aio=native,discard=unmap,cache=none
  -device scsi-hd,drive=os_image,bus=scsi-ctrl.0,bootindex=1

  however I also tested L2 with discard=ignore,aio=threads, and the guest 
eventually crashed as well in the same way.

- I also tested this with e1000e in the L1, and the test failed indicating that 
this isn't related to virtio

- It seems that the test survives if the windows VM is started on a different 
system (also Zen2 machine), and
  its qcow2 file still streamed via NFS. Network speed though will be vastly 
different since I use 1GB network.

- I tested NFSv3 as well. The NFS client also  hangs, but when it recovers the 
VM doesn't crash.

  On the other hand, when I force stop L1, the NFS server deadlocks when 
attempting to restart it, and once
  even oopsed the host.
  I attached the backtraces for this.
 
What is happening:

 
The nested VM (L2) boots, and after a while hangs on IO. L1’s NFS client 
deadlocks, and causes the L1 itself to mostly hang “nfs style”. 
After a few minutes it recovers by reconnecting, gives back IO errors to the 
nested guest which bluescreens and crashes.
Qemu is usually blocked on NFS io, but not always, in one case it was just 
hang, probably blocking on GUI or something which
was blocked on NFS.
 
The dmesg in L1 usually shows that we have qemu blocked on NFS twice
 
One is the qemu main thread doing ‘io_submit’ on the file which was opened with 
O_DIRECT
 
[ 1844.377677] task:qemu-system-x86 state:D stack:    0 pid: 4517 ppid:  4511 
flags:0x00000000
[ 1844.379164] Call Trace:
[ 1844.379613]  __schedule+0x2dd/0x1440
[ 1844.380245]  schedule+0x4e/0xb0
[ 1844.380834]  rwsem_down_read_slowpath+0x2e4/0x330
[ 1844.381662]  down_read+0x43/0x90
[ 1844.382277]  nfs_start_io_direct+0x1e/0x70 [nfs]
[ 1844.383181]  nfs_file_direct_write+0x1f7/0x340 [nfs]
[ 1844.384113]  nfs_file_write+0x26f/0x320 [nfs]
[ 1844.384907]  aio_write+0xfc/0x1f0
[ 1844.385511]  ? __pollwait+0xd0/0xd0
[ 1844.386187]  ? __fget_files+0x5f/0x90
[ 1844.386859]  ? __fget_light+0x32/0x80
[ 1844.387520]  io_submit_one+0x1f6/0x790
[ 1844.388255]  __x64_sys_io_submit+0x84/0x180
[ 1844.389049]  ? fput+0x13/0x20
[ 1844.389603]  ? ksys_read+0xce/0xe0
[ 1844.390225]  do_syscall_64+0x36/0xb0
[ 1844.390870]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 
 
Another is one of the qemu thread pool workers which is still used for 
fallocate syscall which is running this syscall.
 
[ 1844.403632] task:worker          state:D stack:    0 pid: 4555 ppid:  4511 
flags:0x00000000
[ 1844.405108] Call Trace:
[ 1844.405587]  __schedule+0x2dd/0x1440
[ 1844.406215]  ? hrtimer_cancel+0x15/0x30
[ 1844.406904]  ? futex_wait+0x223/0x250
[ 1844.407601]  schedule+0x4e/0xb0
[ 1844.408194]  inode_dio_wait+0xd1/0x100
[ 1844.408871]  ? var_wake_function+0x30/0x30
[ 1844.409617]  nfs_sync_inode+0x13/0x30 [nfs]
[ 1844.410426]  nfs42_proc_fallocate.isra.0+0x91/0x100 [nfsv4]
[ 1844.411504]  nfs42_proc_allocate+0x8c/0xe0 [nfsv4]
[ 1844.412426]  nfs42_fallocate+0x59/0x80 [nfsv4]
[ 1844.413222]  vfs_fallocate+0x152/0x2f0
[ 1844.413926]  __x64_sys_fallocate+0x44/0x70
[ 1844.414635]  do_syscall_64+0x36/0xb0
 
 
While this is happening NFS client in L1 is hanging it, blocking all IO on the 
NFS mount.
The NFS server continues to work normally as I am able to do IO from my other 
test systems to the server, 
e.g start another VM on a test system which also uses NFS to stream the VM’s 
qcow2 file.
 
The bug is not 100% reproducible and relatively random but happens often in my 
current setting of VMs that always start from the same snapshot.
However usually leaving the VM for hour or two, makes this bug reproduce.

I know that a few months ago it didn't happen, or at least I didn't notice it 
despite a lot of testing with the same VMs. 
I still don't know what might have changed. It is possible that I didn't use 
NFS back then.

Wireshark captures I did seems to show that all requests submitted to the NFS 
server are replied by it, so it looks like it is not the reason for the hang,
but it is possible that some of the requests didn't reach the server in the 
first place.

I attached such trace (I have full trace if needed). The last packet before the 
hang is frame 81322. 

I attach more detailed dmesg captures in case it helps.

I also captured kernel backtrace of all qemu threads via 
/proc/pid/task/tid/stack:

202206 : qemu-system-x86
[<0>] nfs_start_io_direct+0x1e/0x70 [nfs]
[<0>] nfs_file_direct_write+0x1f7/0x340 [nfs]
[<0>] nfs_file_write+0x26f/0x320 [nfs]
[<0>] aio_write+0xfc/0x1f0
[<0>] io_submit_one+0x1f6/0x790
[<0>] __x64_sys_io_submit+0x84/0x180
[<0>] do_syscall_64+0x36/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
=================
202219 : qemu-system-x86
[<0>] futex_wait_queue_me+0xa4/0x100
[<0>] futex_wait+0x105/0x250
[<0>] do_futex+0xea/0xa30
[<0>] __x64_sys_futex+0x74/0x1b0
[<0>] do_syscall_64+0x36/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
=================
202220 : gmain
[<0>] do_sys_poll+0x3f9/0x610
[<0>] __x64_sys_poll+0x3b/0x140
[<0>] do_syscall_64+0x36/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
=================
202221 : threaded-ml
[<0>] do_sys_poll+0x3f9/0x610
[<0>] __x64_sys_poll+0x3b/0x140
[<0>] do_syscall_64+0x36/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
=================
202224 : IO mon_iothread
[<0>] do_sys_poll+0x3f9/0x610
[<0>] __x64_sys_poll+0x3b/0x140
[<0>] do_syscall_64+0x36/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
=================
202225 : CPU 0/KVM
[<0>] kvm_vcpu_block+0x5c/0x3a0 [kvm]
[<0>] kvm_arch_vcpu_ioctl_run+0x651/0x1710 [kvm]
[<0>] kvm_vcpu_ioctl+0x284/0x6c0 [kvm]
[<0>] __x64_sys_ioctl+0x8e/0xc0
[<0>] do_syscall_64+0x36/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
=================
202226 : CPU 1/KVM
[<0>] kvm_vcpu_block+0x5c/0x3a0 [kvm]
[<0>] kvm_arch_vcpu_ioctl_run+0x651/0x1710 [kvm]
[<0>] kvm_vcpu_ioctl+0x284/0x6c0 [kvm]
[<0>] __x64_sys_ioctl+0x8e/0xc0
[<0>] do_syscall_64+0x36/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
=================
202227 : CPU 2/KVM
[<0>] futex_wait_queue_me+0xa4/0x100
[<0>] futex_wait+0x105/0x250
[<0>] do_futex+0xea/0xa30
[<0>] __x64_sys_futex+0x74/0x1b0
[<0>] do_syscall_64+0x36/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
=================
202228 : CPU 3/KVM
[<0>] kvm_vcpu_block+0x5c/0x3a0 [kvm]
[<0>] kvm_arch_vcpu_ioctl_run+0x651/0x1710 [kvm]
[<0>] kvm_vcpu_ioctl+0x284/0x6c0 [kvm]
[<0>] __x64_sys_ioctl+0x8e/0xc0
[<0>] do_syscall_64+0x36/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
=================
204640 : worker
[<0>] inode_dio_wait+0xd1/0x100
[<0>] nfs_sync_inode+0x13/0x30 [nfs]
[<0>] nfs42_proc_fallocate.isra.0+0x91/0x100 [nfsv4]
[<0>] nfs42_proc_allocate+0x8c/0xe0 [nfsv4]
[<0>] nfs42_fallocate+0x59/0x80 [nfsv4]
[<0>] vfs_fallocate+0x152/0x2f0
[<0>] __x64_sys_fallocate+0x44/0x70
[<0>] do_syscall_64+0x36/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
=================


As you see, all nested vCPUs are blocked in L1 kernel, so when the hang 
happens, a hung vCPU should
not be a problem.


It is possible though that L0 is not letting L1's vCPUs run for some reason and 
that is what causing NFS client to stall.


Once I was able to capture the above backtrace where qemu would not be stuck on 
NFS but still
hung, and NFS client still hanging L1. In this case it is likely that qemu got 
stuck on something
else (gui?).


Any help on debugging this is welcome. I can provide more info if needed as 
well.


Best regards,
   Maxim Levitsky

backtrace.txt
Description: Text document

backtrace2.txt
Description: Text document

nfs3_crash1.txt
Description: Text document

nfs3_crash2.txt
Description: Text document

wireshark_capture.tar.xz
Description: application/xz-compressed-tar

[Prev in Thread]

Current Thread

[Next in Thread]

NFS client deadlock while doing nested virtualization tests on my systems, Maxim Levitsky <=

Prev by Date: Re: [PATCH 0/3] rSTify SubmitAPatch, TrivialPatches, and SpellCheck wiki pages
Next by Date: Re: [PATCH 0/3] rSTify SubmitAPatch, TrivialPatches, and SpellCheck wiki pages
Previous by thread: Re: [PATCH 0/3] rSTify SubmitAPatch, TrivialPatches, and SpellCheck wiki pages
Next by thread: [RFC PATCH v2 00/25] block layer: split block APIs in global state and I/O
Index(es):
- Date
- Thread