View Issue Details
|0000258: qemu-kvm process goes to 100%, VM becomes unresponsive, Triggered by Creating a new VM
|Multiple times on two separate servers, qemu-kvm processes have suddenly spiked to 100% usage, and the VMs represented by those processes have become unresponsive to the point of needing force-power-cycled. Multiple VMs enter this state at the exact same moment. In two cases, the VM has recovered after multiple minutes, and in both cases, the VM's system clock was set in the future. It seems that the longer the hang goes on, the further in the future, as one machine thought it was the year 2217. Dates were correct before the hangs.
These servers were running CentOS 8 before converting to Alma, and didn't exhibit this bug on CentOS. The hanging VMs have also all been running Alma 8, though that's likely coincidence, as that's most of my VMs now. There hasn't been anything notable in the host machine's dmesg or logs that I have seen.
|Steps To Reproduce
|Run multiple Alma 8 VMs on a server. Migrate an additional VM to that server using a command like virsh migrate --live --persistent --undefinesource --copy-storage-all --verbose Gitlab qemu+ssh://10.10.3.2/system. During the migration process, there is a chance that multiple running VMs will hang in the manner described. Oddly, not every VM hangs, nor does it happen on every migration, but when it occurs, all the affected VMs hang simultaneously.
Additionally, this has been reproduced when generating a new VM on a server. One of the other Alma VMs already running on that server entered this odd failure state.
|Both servers are SuperMicro builds, one a AMD Opteron(TM) Processor 6272, the other a AMD EPYC 7302P 16-Core Processor. I'm using the Opteron_G3 processor type to host, as this allows live migration.
|Just reproduced this issue by *rebooting* a running VM, and a sistem VM jumped to 100% CPU and unresponsive.
|Dmesg from a VM as it hit this bug.
I believe I am seeing the same effect issue. I'm using Rocky but reporting here just because the situation seems very similar to this issue and it could be rare.
The environment is a little complex. Nested virtualization is being used on a Mac. The Outer host is a VirtualBox VM with Rocky 8 installed. The Outer VM hosts inner VMs using KVM. The inner VMs are running Rocky 8 as well.
The first two inner VMs work. When starting the third of fourth, all cpus allocated to inner VMs go to 100% and the inner VMs become almost unresponsive. One inner VM has been seen to produce the following message into /var/log/messages: chronyd: Forward time jump detected!
Stopping one of the inner VMs returns the system to a relatively normal state.
Are you both still seeing this issue?
I'm encountering an issue which I believe is similar, although it may not necessarily be exactly the same. My issue is just an unexplained freeze of the VM. The qemu-kvm process at or near 100% of given CPU. But the VM itself (running top inside the VM) doesn't show a high load or a lot going on.
Both nodes (and VMs) are running Almalinux 8.
There's not really a rhyme or reason as to why or when this will happen. Nothing in any logs, that I can find, that indicate any problem. It's happened twice within a couple of weeks on one VM. Another VM has had this happen once in the past couple of weeks.
A third AlmaLinux node with an AlmaLinux 8 VM has yet to experience this issue.
The one that has had this happen twice is an older Intel E3-1270v2.
The other one that's had this happen once is an Intel i9-11900K.
The one that has yet to have this happen is an AMD Ryzen 9 3900X.
When these freezes happen, my only recourse is to destroy the VM and start it back up.
|Tag Attached: almalinux8
|Tag Attached: QEMU-KVM
|Note Added: 0000585
|Note Added: 0000586
|File Added: log2.png
|File Added: log1.png
|Note Added: 0000757
|Note Added: 0000883