Navi 21 amdgpu fatal Hardware Error on load
Brief summary of the problem:
After installing rocm, loading the amdgpu module causes a Hardware Error in dmesg, and anything attempting to access GPU hangs indefinitely
Hardware description:
- CPU: EPYC 7282
- GPU: 83:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 GL-XL [Radeon PRO W6800] [1002:73a3]
- System Memory: 128GB
System information:
- Distro name and Version: RHEL 9.1
- Kernel version: 5.14.0-162.6.1.el9_1.x86_64
- Custom kernel: N/A
- AMD official driver version: 5.4.50402-1528701.el9
How to reproduce the issue:
installed rocm (latest version, 5.4.2):
amdgpu-install --usecase='rocm'
load amdgpu module:
modprobe amdgpu
check dmesg:
...snip...
[Thu Jan 19 14:24:29 2023] Uhhuh. NMI received for unknown reason 2d on CPU 28.
[Thu Jan 19 14:24:29 2023] Do you have a strange power saving mode enabled?
[Thu Jan 19 14:24:29 2023] Dazed and confused, but trying to continue
[Thu Jan 19 14:24:29 2023] Do you have a strange power saving mode enabled?
[Thu Jan 19 14:24:29 2023] Dazed and confused, but trying to continue
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 8
[Thu Jan 19 14:24:29 2023] amdgpu 0000:83:00.0: amdgpu: Using BACO for runtime pm
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: event severity: recoverable
[Thu Jan 19 14:24:29 2023] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:83:00.0 on minor 1
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: Error 0, type: fatal
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: section_type: PCIe error
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: port_type: 1, legacy PCI end point
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: version: 3.0
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: command: 0x0547, status: 0x4810
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: device_id: 0000:83:00.0
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: slot: 0
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: secondary_bus: 0x00
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: vendor_id: 0x1002, device_id: 0x73a3
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]: class_code: 000000
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]: aer_uncor_status: 0x00008000, aer_uncor_mask: 0x00010000
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]: aer_uncor_severity: 0x004ef030
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]: TLP Header: 00009001 8000220f 99269934 00000000
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: aer_status: 0x00008000, aer_mask: 0x00010000
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: [15] CmpltAbrt (First)
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: aer_layer=Transaction Layer, aer_agent=Completer ID
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: aer_uncor_severity: 0x004ef030
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: TLP Header: 00009001 8000220f 99269934 00000000
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: [drm] fb1: amdgpudrmfb frame buffer device
[Thu Jan 19 14:24:30 2023] [drm] PCI error: detected callback, state(2)!!
[Thu Jan 19 14:24:30 2023] snd_hda_intel 0000:83:00.1: AER: can't recover (no error_detected callback)
[Thu Jan 19 14:24:30 2023] [drm] Register(0) [mmUVD_PGFSM_STATUS] failed to reach value 0x00800000 != 0x00c00000
[Thu Jan 19 14:24:30 2023] [drm:jpeg_v3_0_set_powergating_state.cold [amdgpu]] *ERROR* amdgpu: JPEG enable power gating failed
[Thu Jan 19 14:24:30 2023] [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <jpeg_v3_0> failed -110
[Thu Jan 19 14:24:30 2023] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:30 2023] [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x7fffffff != 0xffffffff
[Thu Jan 19 14:24:31 2023] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:31 2023] pcieport 0000:82:00.0: AER: Downstream Port link has been reset (0)
[Thu Jan 19 14:24:31 2023] pcieport 0000:82:00.0: AER: device recovery failed
[Thu Jan 19 14:24:31 2023] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:31 2023] [drm] Register(1) [mmUVD_RBC_RB_RPTR] failed to reach value 0x7fffffff != 0xffffffff
[Thu Jan 19 14:24:31 2023] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:34 param:0x00000001 message:SetWorkloadMask?
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff?
[Thu Jan 19 14:24:33 2023] amdgpu 0000:83:00.0: amdgpu: Failed to disable gfxoff!
(full dmesg in attached tarbal)
Attached files:
Log files (for system lockups / game freezes / crashes)
- Dmesg log (full log)
- dkms status
- dmidecode output
- lspci output
- lshw output