Re: Unpredictable performance degradation in QEMU KVMs

Hi Frantisek, thanks for replying.

I've not checked using `latencytop`. I will do that, thanks for the suggestion.

The most frustrating problem is that the degradation in performance is so far very hard to reproduce manually so we haven't really been able to determine if it's a CPU performance issue, storage IO, or contention.

Not dumb questions, you're talking to someone who doesn't work on this sort of technology much, so it is very helpful to get an idea of what I might or should look at.

I know we use the same architecture so we can eliminate that as an issue.

Thanks for the feedback, I'll see if I can discover anything interesting given the ideas you've suggested I poke around at.

On Wed, Oct 6, 2021 at 4:06 AM Frantisek Rysanek <Frantisek.Rysanek@post.cz> wrote:

On 5 Oct 2021 at 18:58, Parnell Springmeyer wrote:
>
> Hi, we use QEMU VMs for running our integration testing
> infrastructure and have run into a very difficult to debug problem:
> occasionally we will see a severe performance degradation in some of
> our QEMU VMs.
>
If memory serves, QEMU guests appear to run as processes in the Linux
host instance. I'm not "in the know enough" to tell you, how much is
possibly happening under the hood in the kernel support side of
things, which is potentially not well described by that superficial
abstraction visible in "top".

Esoteric issues aside (CPU arch incompatibilities between host and
guest), have you tried inspecting what the load looks like, in the
guest and in the host OS instance? What does "top" show? With CPU
cores expanded? (press "1")
Have you tried "latencytop" by any chance?

Are you sure this is a CPU performance/emulation issue?
What storage are your VM's using? Could storage be the bottleneck?
Isn't the observed "sluggishness" storage-io-bound, rather than CPU
bound? Can you tell the difference? (Heck... apologies, that's
probably a series of dumb questions to someone @arista.com)

Stuff can get sluggish when IRQ's don't work right. Any signs of that
in the guest instance? Interesting messages in dmesg, interesting
numbers in /proc/interrupts?

CPU arch emulation issues (guest vs. host) might also be an issue. If
you specify a different CPU core for the guest than the host actually
has, you may get some fringe parts of the instruction set, even
within the x86_64 family, that needs to be tediously emulated for the
guest instance... also, I'd hazard a guess 32bit vs. 64bit *might*
play a role, albeit marginal. I have fond memories of the 387 math
co-processor emulation (and its effects on program runtime), but
that's a *long* time ago :-)

I've seen EXT3 and EXT4 hang for no apparent reason, on bare metal,
under heavy IOps stress. CPU consumption at 0%, disk IOps at pure 0,
but the filesystem would block forever in a standstill. If I recall
correctly, I used Bonnie++ to generate that kind of stress
reproducibly, against fast block storage (HW RAID back then). There
was no QEMU in the game.

= feel free to add some juicy detail for us to ponder :-)

Frank

From:	Parnell Springmeyer
Subject:	Re: Unpredictable performance degradation in QEMU KVMs
Date:	Wed, 6 Oct 2021 10:58:37 -0500