View Issue Details

IDProjectCategoryView StatusLast Update
0000371AlmaLinux-9kernelpublic2023-03-11 21:10
ReporterJohn Gong Assigned To 
PriorityhighSeverityminorReproducibilityalways
Status newResolutionopen 
Summary0000371: kernel complains "hardware error" when boots up
DescriptionWith the 5.14.0-162.6.1.el9 and 5.14.0-162.12.1.el9 kernels coming with AlmaLinux 9.1 on Ampere Altra machine, it complains below:

-------------------------------------------------------------------------------

Tue Feb 7 09:45:21 CST 2023 [ 16.412276] {1}[Hardware Error]: event severity: recoverable

Tue Feb 7 09:45:21 CST 2023 [ 16.417922] {1}[Hardware Error]: Error 0, type: recoverable

Tue Feb 7 09:45:21 CST 2023 [ 16.423569] {1}[Hardware Error]: section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f

Tue Feb 7 09:45:21 CST 2023 [ 16.432169] {1}[Hardware Error]: section length: 0x30

Tue Feb 7 09:45:21 CST 2023 [ 16.437384] {1}[Hardware Error]: 00000000: 00000005 ec30000e 00080110 80001001 ......0.........

Tue Feb 7 09:45:21 CST 2023 [ 16.446330] {1}[Hardware Error]: 00000010: 00000300 00000000 00000000 00000000 ................

Tue Feb 7 09:45:21 CST 2023 [ 16.455274] {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................
-------------------------------------------------------------------------------

This error is introduced from https://kojihub.stream.centos.org/koji/buildinfo?buildID=23939 and still exist on the latest centos kernel.
Steps To Reproduceusing 5.14.0-162.6.1.el9 or 5.14.0-162.12.1.el9 on Ampere Altra machine, boot up the machine, then it will print up error info.
Tagskernel

Activities

toracat

2023-02-22 18:53

reporter   ~0000823

@John Gong

This is just to confirm ... That link points to kernel-5.14.0-136.el9. So kernel-5.14.0-135.el9 did not have the problem reported here?

John Gong

2023-02-27 03:14

reporter   ~0000826

Yes, kernel-5.14.0-135.el9 has no this issue.

toracat

2023-03-02 02:23

reporter   ~0000828

@John Gong

To test if the fix for the bug has been added to the upstream kernel from kernel.org, can you do a test-install of kernel-ml from elrepo?

# dnf install https://www.elrepo.org/elrepo-release-9.el9.elrepo.noarch.rpm
# dnf --enablerepo=elrepo-kernel install kernel-ml

That will install the latest kernel (6.2.1 at the moment). Please note that kernel-ml does not work if secure boot is enabled.

John Gong

2023-03-03 02:23

reporter   ~0000830

Sorry for replying late.
RedHat has already located the upstream commit id that be backported to 5.14.0-162.6.1.el9 and causes this issue.
c733ebb7cb67dfb146a07c0ae329a0de9ec52f36 is the upstream commit id, and it still exists in the latest upstream kernel.
So next step is to communicate with Linux kernel guys to fix this issue. I will report this bug to LKML.
Thanks!

toracat

2023-03-03 02:41

reporter   ~0000831

Thank you for sharing the info. Glad to learn the problematic commit has been identified.

Hope the kernel devs can fix the issue quickly.

toracat

2023-03-04 19:51

reporter   ~0000832

Ref: https://lkml.org/lkml/2023/3/2/892
"Error reports at boot time in Ampere Altra machines since c733ebb7c"

toracat

2023-03-11 21:10

reporter   ~0000839

Some additional notes from Darren Hart (Ampere):

"Just to give a bit more detail here, these messages look scary, but they are benign as the error is managed by hardware and has no adverse effects to software other than the severe looking messages reported."

Issue History

Date Modified Username Field Change
2023-02-21 16:30 John Gong New Issue
2023-02-21 16:30 John Gong Tag Attached: kernel
2023-02-22 18:53 toracat Note Added: 0000823
2023-02-27 03:14 John Gong Note Added: 0000826
2023-02-28 19:47 alukoshko Description Updated
2023-03-02 02:23 toracat Note Added: 0000828
2023-03-03 02:23 John Gong Note Added: 0000830
2023-03-03 02:41 toracat Note Added: 0000831
2023-03-04 19:51 toracat Note Added: 0000832
2023-03-11 21:10 toracat Note Added: 0000839