0000445: Fix max integer mmu_invalidate_seq hanging vCPUs - MantisBT

ID	Project	Category	View Status	Date Submitted	Last Update

0000445	AlmaLinux-8	qemu-kvm	public	2023-11-27 19:21	2023-12-04 02:51

Reporter	mutts	Assigned To	alukoshko
Priority	high	Severity	crash	Reproducibility	always
Status	resolved	Resolution	fixed
Platform	x86_64	OS	Almalinux	OS Version	8

Summary	0000445: Fix max integer mmu_invalidate_seq hanging vCPUs
Description	I'm not sure what specific kernel Almalinux 8's kernel is based on, but the kernels starting with 4.18.0-425.3.1.el8.x86_64 and through the current kernel 4.18.0-513.5.1.el8_9.x86_64 are susceptible to this bug. In mainline kernel 6.1 this was addressed in commit 82d811ff566594de3676f35808e8a9e19c5c864c effectively changing mmu_seq from an int to an unsigned long: https://lore.kernel.org/lkml/2023082606-viper-accuracy-b0fd@gregkh/T/ Meanwhile this was fixed in mainline kernel 6.3 through a complete overhaul of the system in commit ba6e3fe25543: https://lore.kernel.org/lkml/2023082644-vaporizer-stuffy-b8bc@gregkh/T/ At any rate, the kernel for Almalinux 8 needs to be updated to resolve this issue in the is_page_fault_stale() function. Kernel 4.18.0-372.26.1.el8_6.x86_64 (and presumably 4.18.0-372.32.1.el8_6.x86_64) is not affected by this because it does not have the is_page_fault_stale() function.
Steps To Reproduce	Install Almalinux 8 using any kernel between 4.18.0-425.3.1.el8.x86_64 and 4.18.0-513.5.1.el8_9.x86_64 Spin up a KVM guest on that Almalinux node. Do stuff inside the KVM guest that makes it use a lot of memory over and over again. Eventually mmu_notifier_seq will hit max integer - 2,147,483,647 - at which point the KVM guest will freeze up and become unresponsive.
Additional Information	You can monitor the mmu_notifier_seq count from the host node by running the bpftrace script: --SNIP-- #if defined(CONFIG_FUNCTION_TRACER) #define CC_USING_FENTRY #endif #include <linux/kvm_host.h> kprobe:direct_page_fault { $ctr = ((struct kvm_vcpu*)arg0)->kvm->mmu_notifier_seq; @counts[pid] = $ctr; } interval:s:60 { $ts = nsecs + 300000; printf("%s\n", strftime("%m-%d-%y %H:%M:%S", $ts)); print(@counts); print("---\n"); } --SNIP-- Once this hits 2,147,483,647 (max integer) the guest will become unresponsive. Depending on just how much memory is used on the guest and how often the memory pages are cleared, this may take a while. Some guests that use very little memory may take months or years to hit the 2,147,483,647 max integer number. Running a program on the KVM guest that continuously consumes and dumps memory may allow you to more easily duplicate this issue. Looking at the kernel source packages, it would seem that Almalinux 9 was also susceptible to this up to kernel 5.14.0-284.11.1, but the latest Almalinux 9 kernel 5.14.0-362.8.1 appears to have been refactored based on the mainline kernel 6.3 which fixes this issue.
Tags	almalinux8, kernel, QEMU-KVM

abrt_hash
URL

Date Modified	Username	Field	Change
2023-11-27 19:21	mutts	New Issue
2023-11-27 19:21	mutts	Tag Attached: almalinux8
2023-11-27 19:21	mutts	Tag Attached: kernel
2023-11-27 19:21	mutts	Tag Attached: QEMU-KVM
2023-11-30 13:03	alukoshko	Assigned To	=> alukoshko
2023-11-30 13:03	alukoshko	Status	new => confirmed
2023-12-03 23:55	alukoshko	Note Added: 0001003
2023-12-04 02:51	alukoshko	Note Added: 0001004
2023-12-04 02:51	alukoshko	Status	confirmed => resolved
2023-12-04 02:51	alukoshko	Resolution	open => fixed