View Issue Details

IDProjectCategoryView StatusLast Update
0000445AlmaLinux-8qemu-kvmpublic2023-12-04 02:51
Reportermutts Assigned Toalukoshko  
PriorityhighSeveritycrashReproducibilityalways
Status resolvedResolutionfixed 
Platformx86_64OSAlmalinuxOS Version8
Summary0000445: Fix max integer mmu_invalidate_seq hanging vCPUs
DescriptionI'm not sure what specific kernel Almalinux 8's kernel is based on, but the kernels starting with 4.18.0-425.3.1.el8.x86_64 and through the current kernel 4.18.0-513.5.1.el8_9.x86_64 are susceptible to this bug.

In mainline kernel 6.1 this was addressed in commit 82d811ff566594de3676f35808e8a9e19c5c864c effectively changing mmu_seq from an int to an unsigned long:

https://lore.kernel.org/lkml/2023082606-viper-accuracy-b0fd@gregkh/T/

Meanwhile this was fixed in mainline kernel 6.3 through a complete overhaul of the system in commit ba6e3fe25543:

https://lore.kernel.org/lkml/2023082644-vaporizer-stuffy-b8bc@gregkh/T/

At any rate, the kernel for Almalinux 8 needs to be updated to resolve this issue in the is_page_fault_stale() function.

Kernel 4.18.0-372.26.1.el8_6.x86_64 (and presumably 4.18.0-372.32.1.el8_6.x86_64) is not affected by this because it does not have the is_page_fault_stale() function.
Steps To ReproduceInstall Almalinux 8 using any kernel between 4.18.0-425.3.1.el8.x86_64 and 4.18.0-513.5.1.el8_9.x86_64

Spin up a KVM guest on that Almalinux node.

Do stuff inside the KVM guest that makes it use a lot of memory over and over again.

Eventually mmu_notifier_seq will hit max integer - 2,147,483,647 - at which point the KVM guest will freeze up and become unresponsive.
Additional InformationYou can monitor the mmu_notifier_seq count from the host node by running the bpftrace script:

--SNIP--
#if defined(CONFIG_FUNCTION_TRACER)
#define CC_USING_FENTRY
#endif


#include <linux/kvm_host.h>
kprobe:direct_page_fault {
    $ctr = ((struct kvm_vcpu*)arg0)->kvm->mmu_notifier_seq;
    @counts[pid] = $ctr;
}

interval:s:60 {
    $ts = nsecs + 300000;
    printf("%s\n", strftime("%m-%d-%y %H:%M:%S", $ts));
    print(@counts);
    print("---\n");
}
--SNIP--

Once this hits 2,147,483,647 (max integer) the guest will become unresponsive.

Depending on just how much memory is used on the guest and how often the memory pages are cleared, this may take a while. Some guests that use very little memory may take months or years to hit the 2,147,483,647 max integer number. Running a program on the KVM guest that continuously consumes and dumps memory may allow you to more easily duplicate this issue.

Looking at the kernel source packages, it would seem that Almalinux 9 was also susceptible to this up to kernel 5.14.0-284.11.1, but the latest Almalinux 9 kernel 5.14.0-362.8.1 appears to have been refactored based on the mainline kernel 6.3 which fixes this issue.
Tagsalmalinux8, kernel, QEMU-KVM
abrt_hash
URL

Issue History

Date Modified Username Field Change
2023-11-27 19:21 mutts New Issue
2023-11-27 19:21 mutts Tag Attached: almalinux8
2023-11-27 19:21 mutts Tag Attached: kernel
2023-11-27 19:21 mutts Tag Attached: QEMU-KVM
2023-11-30 13:03 alukoshko Assigned To => alukoshko
2023-11-30 13:03 alukoshko Status new => confirmed
2023-12-03 23:55 alukoshko Note Added: 0001003
2023-12-04 02:51 alukoshko Note Added: 0001004
2023-12-04 02:51 alukoshko Status confirmed => resolved
2023-12-04 02:51 alukoshko Resolution open => fixed