[navi2][5.10.20] amdgpu module crash on RX 6900 XT card
Hardware
6900 XT, IBM POWER9
Software
- Fedora 33 (5.10.20 ppc64le 64K page size) with amdgpu (58.49.0)
Context
modprobe amdgpu
yields following error in dmesg
:
[ 263.680735] [drm] amdgpu kernel modesetting enabled.
[ 263.682186] CRAT table error: (null)
[ 263.682187] DSDT table not found for OEM information
[ 263.682189] IO link not available for non x86 platforms
[ 263.682190] Virtual CRAT table created for CPU
[ 263.682199] amdgpu: Topology: Add CPU node
[ 263.683458] amdgpu 0001:03:00.0: enabling device (0140 -> 0142)
[ 263.683472] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1DA2:0xE438 0xC0).
[ 263.683476] amdgpu 0001:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 263.683489] [drm] register mmio base: 0x80000000
[ 263.683491] [drm] register mmio size: 1048576
[ 263.683493] [drm] PCI I/O BAR is not found.
[ 263.683505] [drm] PCIE atomic ops is not supported
[ 263.685953] [drm] add ip block number 0 <nv_common>
[ 263.685955] [drm] add ip block number 1 <gmc_v10_0>
[ 263.685957] [drm] add ip block number 2 <navi10_ih>
[ 263.685958] [drm] add ip block number 3 <psp>
[ 263.685960] [drm] add ip block number 4 <smu>
[ 263.685962] [drm] add ip block number 5 <gfx_v10_0>
[ 263.685963] [drm] add ip block number 6 <sdma_v5_2>
[ 263.685965] [drm] add ip block number 7 <vcn_v3_0>
[ 263.685966] [drm] add ip block number 8 <jpeg_v3_0>
[ 263.717433] amdgpu 0001:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 263.717437] amdgpu: ATOM BIOS: 113-E438XTX-UO2
[ 263.717449] [drm] VCN(0) decode is enabled in VM mode
[ 263.717450] [drm] VCN(1) decode is enabled in VM mode
[ 263.717452] [drm] VCN(0) encode is enabled in VM mode
[ 263.717453] [drm] VCN(1) encode is enabled in VM mode
[ 263.717456] [drm] JPEG decode is enabled in VM mode
[ 263.717463] [drm] GPU posting now...
[ 263.717519] amdgpu 0001:03:00.0: amdgpu: HBM ECC is not presented.
[ 263.717523] amdgpu 0001:03:00.0: amdgpu: SRAM ECC is not presented.
[ 263.717530] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 263.717575] amdgpu 0001:03:00.0: BAR 2: releasing [mem 0x6004010000000-0x60040101fffff 64bit pref]
[ 263.717580] amdgpu 0001:03:00.0: BAR 0: releasing [mem 0x6004000000000-0x600400fffffff 64bit pref]
[ 263.717615] pci 0001:02:00.0: BAR 15: releasing [mem 0x6004000000000-0x600403fffffff 64bit pref]
[ 263.717620] pci 0001:01:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 263.717624] pci 0001:00:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 263.717638] pci 0001:00:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 263.717645] pci 0001:01:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 263.717649] pci 0001:02:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 263.717655] amdgpu 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x60043ffffffff 64bit pref]
[ 263.717667] amdgpu 0001:03:00.0: BAR 2: assigned [mem 0x6004400000000-0x60044001fffff 64bit pref]
[ 263.717680] pci 0001:00:00.0: PCI bridge to [bus 01-03]
[ 263.717687] pci 0001:00:00.0: bridge window [mem 0x600c080000000-0x600c0ffefffff]
[ 263.717692] pci 0001:00:00.0: bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 263.717699] pci 0001:01:00.0: PCI bridge to [bus 02-03]
[ 263.717708] pci 0001:01:00.0: bridge window [mem 0x600c080000000-0x600c0ffefffff]
[ 263.717713] pci 0001:01:00.0: bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 263.717720] pci 0001:02:00.0: PCI bridge to [bus 03]
[ 263.717727] pci 0001:02:00.0: bridge window [mem 0x600c080000000-0x600c0807fffff]
[ 263.717732] pci 0001:02:00.0: bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 263.717747] amdgpu 0001:03:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[ 263.717751] amdgpu 0001:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 263.717755] [drm] Detected VRAM RAM=16368M, BAR=16384M
[ 263.717757] [drm] RAM width 256bits GDDR6
[ 263.717820] [drm] amdgpu: 16368M of VRAM memory ready
[ 263.717827] [drm] amdgpu: 16368M of GTT memory ready.
[ 263.717838] [drm] GART: num cpu pages 8192, num gpu pages 131072
[ 263.717950] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[ 272.048495] [drm] use_doorbell being set to: [true]
[ 272.048552] [drm] use_doorbell being set to: [true]
[ 272.048605] [drm] use_doorbell being set to: [true]
[ 272.048662] [drm] use_doorbell being set to: [true]
[ 272.048976] [drm] Found VCN firmware Version ENC: 1.3 DEC: 2 VEP: 0 Revision: 17
[ 272.048986] [drm] PSP loading VCN firmware
[ 272.273424] [drm] reserve 0xa00000 from 0x83fe000000 for PSP TMR
[ 272.943503] amdgpu 0001:03:00.0: amdgpu: smu driver if version = 0x00000039, smu fw if version = 0x0000003b, smu fw version = 0x003a3100 (58.49.0)
[ 272.943507] amdgpu 0001:03:00.0: amdgpu: SMU driver if version not matched
[ 272.943517] amdgpu 0001:03:00.0: amdgpu: use vbios provided pptable
[ 273.018737] amdgpu 0001:03:00.0: amdgpu: SMU is initialized successfully!
[ 273.023894] [drm] kiq ring mec 2 pipe 1 q 0
[ 273.085574] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 273.085784] [drm] JPEG decode initialized successfully.
[ 273.086032] kfd kfd: Allocated 3969056 bytes on gart
[ 273.086334] Virtual CRAT table created for GPU
[ 273.086837] amdgpu: Topology: Add dGPU node [0x73bf:0x1002]
[ 273.086845] kfd kfd: added device 1002:73bf
[ 273.086850] amdgpu 0001:03:00.0: amdgpu: SE 4, SH per SE 2, CU per SH 10, active_cu_number 80
[ 273.087044] amdgpu 0001:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 273.087048] amdgpu 0001:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 273.087051] amdgpu 0001:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 273.087055] amdgpu 0001:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 273.087058] amdgpu 0001:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 273.087062] amdgpu 0001:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 273.087065] amdgpu 0001:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 273.087069] amdgpu 0001:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 273.087072] amdgpu 0001:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 273.087076] amdgpu 0001:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[ 273.087079] amdgpu 0001:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 273.087083] amdgpu 0001:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 273.087086] amdgpu 0001:03:00.0: amdgpu: ring sdma2 uses VM inv eng 14 on hub 0
[ 273.087089] amdgpu 0001:03:00.0: amdgpu: ring sdma3 uses VM inv eng 15 on hub 0
[ 273.087093] amdgpu 0001:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[ 273.087096] amdgpu 0001:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[ 273.087100] amdgpu 0001:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[ 273.087103] amdgpu 0001:03:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 5 on hub 1
[ 273.087106] amdgpu 0001:03:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 6 on hub 1
[ 273.087110] amdgpu 0001:03:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 7 on hub 1
[ 273.087113] amdgpu 0001:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 8 on hub 1
[ 273.094373] EEH: Recovering PHB#1-PE#0
[ 273.094380] EEH: PE location: UOPWR.D100020-Node0-SLOT1 PCIE 4.0 X16, PHB location: N/A
[ 273.094385] EEH: Frozen PHB#1-PE#0 detected
[ 273.094386] EEH: Call Trace:
[ 273.094393] EEH: [0000000088d68852] __eeh_send_failure_event+0x7c/0x160
[ 273.094396] EEH: [0000000053433783] eeh_dev_check_failure.part.0+0x254/0x5e0
[ 273.094499] EEH: [000000000f3ba7f6] amdgpu_device_rreg+0x180/0x210 [amdgpu]
[ 273.094627] EEH: [0000000069e7642c] mmhub_v2_0_set_clockgating+0x1f8/0x320 [amdgpu]
[ 273.094738] EEH: [00000000a554a501] gmc_v10_0_set_clockgating_state+0x44/0xb0 [amdgpu]
[ 273.094841] EEH: [0000000063a011e7] amdgpu_device_ip_late_init+0x150/0x7d0 [amdgpu]
[ 273.094947] EEH: [00000000294ed418] amdgpu_device_init+0x19a8/0x1fc0 [amdgpu]
[ 273.095051] EEH: [00000000273acd85] amdgpu_driver_load_kms+0x30/0x520 [amdgpu]
[ 273.095153] EEH: [00000000f91deff0] amdgpu_pci_probe+0x18c/0x340 [amdgpu]
[ 273.095158] EEH: [0000000028f6d7d4] local_pci_probe+0x68/0x110
[ 273.095161] EEH: [00000000b5bc188e] work_for_cpu_fn+0x38/0x60
[ 273.095163] EEH: [00000000bf267e16] process_one_work+0x300/0x5d0
[ 273.095166] EEH: [00000000ac280537] worker_thread+0x360/0x780
[ 273.095170] EEH: [00000000409ee3ee] kthread+0x1e4/0x1f0
[ 273.095176] EEH: [000000001c930e8a] ret_from_kernel_thread+0x5c/0x6c
[ 273.095178] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 273.095180] EEH: Notify device drivers to shutdown
[ 273.095185] EEH: Beginning: 'error_detected(IO frozen)'
[ 273.356962] [drm] Initialized amdgpu 3.40.0 20150101 for 0001:03:00.0 on minor 1
[ 273.357162] PCI 0001:03:00.0#0000: EEH: Invoking amdgpu->error_detected(IO frozen)
[ 273.357165] [drm] PCI error: detected callback, state(2)!!
[ 273.357588] PCI 0001:03:00.0#0000: EEH: amdgpu driver reports: 'need reset'
[ 273.357593] PCI 0001:03:00.1#0000: EEH: driver not EEH aware
[ 273.357595] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[ 273.357601] EEH: Collect temporary log
[ 273.357639] EEH: of node=0001:03:00.0
[ 273.357642] EEH: PCI device/vendor: 73bf1002
[ 273.357644] EEH: PCI cmd/status register: 00100546
[ 273.357646] EEH: PCI-E capabilities and status follow:
[ 273.357656] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04
[ 273.357664] EEH: PCI-E 10: 11040040 00000000 00000000 00000000
[ 273.357665] EEH: PCI-E 20: 00000000
[ 273.357667] EEH: PCI-E AER capability register set follows:
[ 273.357676] EEH: PCI-E AER 00: 20020001 00000000 00000000 00462030
[ 273.357684] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 273.357691] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 273.357695] EEH: PCI-E AER 30: 00000000 00000000
[ 273.357697] EEH: of node=0001:03:00.1
[ 273.357700] EEH: PCI device/vendor: ab281002
[ 273.357703] EEH: PCI cmd/status register: 00100546
[ 273.357704] EEH: PCI-E capabilities and status follow:
[ 273.357713] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04
[ 273.357721] EEH: PCI-E 10: 11040040 00000000 00000000 00000000
[ 273.357722] EEH: PCI-E 20: 00000000
[ 273.357724] EEH: PCI-E AER capability register set follows:
[ 273.357733] EEH: PCI-E AER 00: 2a020001 00000000 00000000 00462030
[ 273.357740] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 273.357748] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 273.357751] EEH: PCI-E AER 30: 00000000 00000000
[ 273.357754] PHB4 PHB#1 Diag-data (Version: 1)
[ 273.357755] brdgCtl: 00000002
[ 273.357757] RootSts: 00000020 00402000 a0440008 00100107 00001000
[ 273.357759] RootErrSts: 00000000 00008000 00000000
[ 273.357761] PhbSts: 0000001c00000000 0000001c00000000
[ 273.357762] Lem: 0000000100280000 0000000000000000 0000000100000000
[ 273.357764] PhbErr: 0000088000000000 0000008000000000 2148000098000240 a008400000000000
[ 273.357766] RxeArbErr: 8000200000000000 0000200000000000 00009fde30000000 0000000000000000
[ 273.357768] PblErr: 0000000008000000 0000000008000000 0000000000000000 0000000000000000
[ 273.357770] PcieDlp: 0000000000000000 0000000000000000 b000000000000000
[ 273.357771] RegbErr: 0000004000000000 0000004000000000 4800003c00000000 0000000000000200
[ 273.357773] PE[000] A/B: a480002a03000000 8000000000000000
[ 273.357776] EEH: Reset without hotplug activity
[ 273.357779] EEH: Removing 0001:03:00.1 without EEH sensitive driver
[ 273.463561] amdgpu 0001:03:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[ 273.463564] amdgpu 0001:03:00.0: amdgpu: Failed to enable gfxoff!
[ 273.488713] snd_hda_intel 0001:03:00.1: CORB reset timeout#2, CORBRP = 65535
[ 273.948759] snd_hda_intel 0001:03:00.1: CORB reset timeout#2, CORBRP = 65535
[ 274.353721] snd_hda_codec_hdmi hdaudioC0D0: Unable to sync register 0x2f0d00. -5
[ 274.353738] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 274.353755] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 274.353769] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 274.353782] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 274.353795] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 274.353807] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 274.389593] [drm] Register(0) [mmUVD_PGFSM_STATUS] failed to reach value 0x00800000 != 0x00c00000
[ 274.389649] [drm:jpeg_v3_0_set_powergating_state [amdgpu]] *ERROR* amdgpu: JPEG enable power gating failed
[ 274.389694] [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <jpeg_v3_0> failed -110
[ 274.403707] amdgpu 0001:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110).
[ 274.403771] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110).
[ 274.625435] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[ 274.861011] [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x7fffffff != 0xffffffff
[ 275.097223] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[ 275.332748] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[ 275.568688] [drm] Register(1) [mmUVD_RBC_RB_RPTR] failed to reach value 0x7fffffff != 0xffffffff
[ 275.804270] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[ 275.804277] amdgpu 0001:03:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[ 275.804279] amdgpu 0001:03:00.0: amdgpu: Failed to power gate VCN!
[ 275.804336] [drm:amdgpu_dpm_enable_uvd [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -5.
[ 276.244073] pci 0001:03:00.1: Removing from iommu group 1
[ 278.395265] amdgpu 0001:03:00.0: enabling device (0140 -> 0142)
[ 278.401960] EEH: Sleep 5s ahead of partial hotplug
[ 283.434989] pci 0001:03:00.1: [1002:ab28] type 00 class 0x040300
[ 283.435009] pci 0001:03:00.1: reg 0x10: [mem 0x600c080120000-0x600c080123fff]
[ 283.435067] pci 0001:03:00.1: BAR0 [mem size 0x00004000]: requesting alignment to 0x10000
[ 283.435131] pci 0001:03:00.1: PME# supported from D1 D2 D3hot D3cold
[ 283.435698] pci 0001:03:00.1: can't claim BAR 0 [mem size 0x00004000]: no address assigned
[ 283.435706] pci 0001:03:00.1: BAR 0: assigned [mem 0x600c080120000-0x600c080123fff]
[ 283.435711] pci 0001:02:00.0: PCI bridge to [bus 03]
[ 283.435716] pci 0001:02:00.0: bridge window [mem 0x600c080000000-0x600c0807fffff]
[ 283.435720] pci 0001:02:00.0: bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 283.435731] pci 0001:03:00.1: Added to existing PE#0
[ 283.435738] pci 0001:03:00.1: Adding to iommu group 1
[ 283.435833] pci 0001:03:00.1: D0 power state depends on 0001:03:00.0
[ 283.435903] snd_hda_intel 0001:03:00.1: enabling device (0140 -> 0142)
[ 283.435912] snd_hda_intel 0001:03:00.1: Force to snoop mode by module option
[ 283.435956] EEH: Beginning: 'slot_reset'
[ 283.435961] PCI 0001:03:00.0#0000: EEH: Invoking amdgpu->slot_reset()
[ 283.435963] [drm] PCI error: slot reset callback!!
[ 283.442319] input: HDA ATI HDMI HDMI/DP,pcm=3 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input11
[ 283.442436] input: HDA ATI HDMI HDMI/DP,pcm=7 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input12
[ 283.442513] input: HDA ATI HDMI HDMI/DP,pcm=8 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input13
[ 283.442587] input: HDA ATI HDMI HDMI/DP,pcm=9 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input14
[ 283.442658] input: HDA ATI HDMI HDMI/DP,pcm=10 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input15
[ 283.442730] input: HDA ATI HDMI HDMI/DP,pcm=11 as /devices/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:00.0/0001:03:00.1/sound/card0/input16
[ 284.283468] [drm] free PSP TMR buffer
[ 284.304489] amdgpu 0001:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 284.304576] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[ 284.304600] [drm] VRAM is lost due to GPU reset!
[ 284.305078] [drm] PSP is resuming...
[ 284.544795] [drm] reserve 0xa00000 from 0x83fe000000 for PSP TMR
[ 285.204874] amdgpu 0001:03:00.0: amdgpu: SMU is resuming...
[ 285.204882] amdgpu 0001:03:00.0: amdgpu: smu driver if version = 0x00000039, smu fw if version = 0x0000003b, smu fw version = 0x003a3100 (58.49.0)
[ 285.204885] amdgpu 0001:03:00.0: amdgpu: SMU driver if version not matched
[ 285.275239] amdgpu 0001:03:00.0: amdgpu: failed send message: GetDpmFreqByIndex (31) param: 0x000500ff response 0xfffffffb
[ 285.275242] amdgpu 0001:03:00.0: amdgpu: [smu_v11_0_set_single_dpm_table] failed to get dpm levels!
[ 285.275244] amdgpu 0001:03:00.0: amdgpu: Failed to setup default dpm clock tables!
[ 285.275246] amdgpu 0001:03:00.0: amdgpu: Failed to setup default dpm clock tables!
[ 285.275248] amdgpu 0001:03:00.0: amdgpu: Failed to setup smc hw!
[ 285.275315] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -5
[ 285.275397] [drm:amdgpu_pci_slot_reset [amdgpu]] *ERROR* PCIe error recovery failed, err:-5
[ 285.275401] PCI 0001:03:00.0#0000: EEH: amdgpu driver reports: 'disconnect'
[ 285.275406] PCI 0001:03:00.1#0000: EEH: driver not EEH aware
[ 285.275408] EEH: Finished:'slot_reset' with aggregate recovery state:'disconnect'
[ 285.275410] EEH: Unable to recover from failure from PHB#1-PE#0.
Please try reseating or replacing it
[ 285.275455] EEH: of node=0001:03:00.0
[ 285.275458] EEH: PCI device/vendor: 73bf1002
[ 285.275461] EEH: PCI cmd/status register: 00100546
[ 285.275463] EEH: PCI-E capabilities and status follow:
[ 285.275474] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04
[ 285.275483] EEH: PCI-E 10: 11040040 00000000 00000000 00000000
[ 285.275484] EEH: PCI-E 20: 00000000
[ 285.275486] EEH: PCI-E AER capability register set follows:
[ 285.275496] EEH: PCI-E AER 00: 20020001 00000000 00000000 00462030
[ 285.275505] EEH: PCI-E AER 10: 00000000 00002000 000001f4 60008002
[ 285.275513] EEH: PCI-E AER 20: 000000ff 00060044 00000458 00000000
[ 285.275517] EEH: PCI-E AER 30: 00000000 00000000
[ 285.275520] EEH: of node=0001:03:00.1
[ 285.275522] EEH: PCI device/vendor: ab281002
[ 285.275525] EEH: PCI cmd/status register: 00100546
[ 285.275527] EEH: PCI-E capabilities and status follow:
[ 285.275537] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04
[ 285.275545] EEH: PCI-E 10: 11040000 00000000 00000000 00000000
[ 285.275547] EEH: PCI-E 20: 00000000
[ 285.275548] EEH: PCI-E AER capability register set follows:
[ 285.275558] EEH: PCI-E AER 00: 2a020001 00000000 00000000 00462030
[ 285.275567] EEH: PCI-E AER 10: 00000000 00002000 000001f4 60008002
[ 285.275575] EEH: PCI-E AER 20: 000000ff 00060044 00000458 00000000
[ 285.275579] EEH: PCI-E AER 30: 00000000 00000000
[ 285.275581] PHB4 PHB#1 Diag-data (Version: 1)
[ 285.275582] brdgCtl: 00000002
[ 285.275585] RootSts: 00000020 00402000 a0440008 00100107 00005000
[ 285.275587] RootErrSts: 00000024 00008000 00000000
[ 285.275588] sourceId: 03010000
[ 285.275590] PhbSts: 0000001c00000000 0000001c00000000
[ 285.275592] Lem: 0000000104280000 0000000000000000 0000000100000000
[ 285.275594] PhbErr: 0000088000000000 0000008000000000 2148000098000240 a008400000000000
[ 285.275596] RxeArbErr: 8000200000000020 0000200000000000 00009fde30000000 0000000000000000
[ 285.275598] PblErr: 0000000008000000 0000000008000000 0000000000000000 0000000000000000
[ 285.275600] PcieDlp: 0000000000000000 0000000000000000 b000000000000000
[ 285.275602] RegbErr: 0000004000000000 0000004000000000 4800003c00000000 0000000000000200
[ 285.275604] PE[000] A/B: a480002a03000000 8000000000000000
[ 285.275607] EEH: Beginning: 'error_detected(permanent failure)'
[ 285.275610] PCI 0001:03:00.0#0000: EEH: not actionable (1,1,1)
[ 285.275613] PCI 0001:03:00.1#0000: EEH: not actionable (1,1,1)
[ 285.275615] EEH: Finished:'error_detected(permanent failure)'
[ 286.001810] pci 0001:03:00.1: Removing from iommu group 1
[ 286.001983] [drm:amdgpu_pci_remove [amdgpu]] *ERROR* Hotplug removal is not supported
[ 286.002383] amdgpu 0001:03:00.0: amdgpu: amdgpu: finishing device.
[ 290.430911] amdgpu: cp queue pipe 4 queue 0 preemption failed
[ 290.871333] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
[ 290.871376] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[ 291.201813] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
[ 291.201876] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[ 292.408325] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
[ 292.408380] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[ 292.848782] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
[ 292.848846] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[ 293.179174] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
[ 293.179217] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[ 293.179225] [drm] free PSP TMR buffer
[ 293.513528] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d8d7cfd5; ring_buffer_end = 000000004bc2dd70; write_frame = 00000000415de82c
[ 293.513593] [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
[ 297.650869] BUG: Unable to handle kernel data access on read at 0xf0a803030303a898
[ 297.650872] Faulting instruction address: 0xc000000000cc8298
[ 297.650875] Oops: Kernel access of bad area, sig: 11 [#1]
[ 297.650877] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[ 297.650879] Modules linked in: amdgpu mfd_core gpu_sched xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter rfkill ip6_tables iptable_filter snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_usb_audio snd_hda_codec at24 regmap_i2c snd_hda_core snd_usbmidi_lib snd_rawmidi snd_hwdep snd_seq joydev snd_seq_device crct10dif_vpmsum snd_pcm mc ofpart ipmi_powernv ipmi_devintf ipmi_msghandler powernv_flash snd_timer mtd rtc_opal snd opal_prd i2c_opal soundcore zram ip_tables ast drm_vram_helper drm_ttm_helper i2c_algo_bit ttm drm_kms_helper syscopyarea
[ 297.650935] sysfillrect sysimgblt fb_sys_fops cec drm tg3 vmx_crypto i2c_core crc32c_vpmsum drm_panel_orientation_quirks nvme nvme_core sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi fuse scsi_transport_iscsi
[ 297.650959] CPU: 23 PID: 177 Comm: eehd Not tainted 5.10.20-200.fc33.ppc64le #1
[ 297.650961] NIP: c000000000cc8298 LR: c000000000cc8bb0 CTR: c000000000cc8b30
[ 297.650963] REGS: c000000010e67630 TRAP: 0380 Not tainted (5.10.20-200.fc33.ppc64le)
[ 297.650965] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 84002822 XER: 00000000
[ 297.650973] CFAR: c000000000cc8bac IRQMASK: 0
GPR00: c000000000cc8bb0 c000000010e678c0 c0000000023dc800 f0a803030303a880
GPR04: 00000000000000c0 00000000c0000000 c00000000303a830 c00000000171f338
GPR08: 003ffff800000201 c00000000171f338 c008000004190000 c008000005f28338
GPR12: c000000000cc8b30 c000000fff6e7000 c0000000001af288 c000000010c704c0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 c00000001ee96d90 c00000001ee85b70 c00000001ee85b90
GPR24: c00000001ee85b98 c00000001ee85b88 0000000000000000 c0080000060c8dc8
GPR28: 0000000000000003 0000000000000000 c00000001ee80000 f0a803030303a880
[ 297.651005] NIP [c000000000cc8298] free_fw_priv+0x28/0x280
[ 297.651007] LR [c000000000cc8bb0] release_firmware+0x80/0xe0
[ 297.651009] Call Trace:
[ 297.651011] [c000000010e67930] [c000000000cc8bb0] release_firmware+0x80/0xe0
[ 297.651062] [c000000010e67960] [c008000005b96b48] psp_sw_fini+0x90/0x120 [amdgpu]
[ 297.651116] [c000000010e679a0] [c008000005f1fe48] amdgpu_device_fini+0x3d0/0x630 [amdgpu]
[ 297.651151] [c000000010e67a60] [c008000005acce70] amdgpu_driver_unload_kms+0x1c8/0x330 [amdgpu]
[ 297.651185] [c000000010e67aa0] [c008000005ac08bc] amdgpu_pci_remove+0x64/0xa0 [amdgpu]
[ 297.651189] [c000000010e67b10] [c000000000b3c158] pci_device_remove+0x68/0x120
[ 297.651192] [c000000010e67b50] [c000000000c93688] device_release_driver_internal+0x2f8/0x410
[ 297.651195] [c000000010e67ba0] [c000000000b26668] pci_stop_and_remove_bus_device+0xb8/0x110
[ 297.651198] [c000000010e67be0] [c0000000000732f0] pci_hp_remove_devices+0x90/0x130
[ 297.651201] [c000000010e67c70] [c00000000004e9c0] eeh_handle_normal_event+0x510/0xa40
[ 297.651203] [c000000010e67d50] [c00000000004fdd8] eeh_event_handler+0x118/0x1a0
[ 297.651206] [c000000010e67db0] [c0000000001af464] kthread+0x1e4/0x1f0
[ 297.651208] [c000000010e67e20] [c00000000000d4f0] ret_from_kernel_thread+0x5c/0x6c
[ 297.651210] Instruction dump:
[ 297.651212] 60000000 4bffffd8 3c4c0171 38424590 7c0802a6 60000000 7c0802a6 fbe1fff8
[ 297.651218] fbc1fff0 7c7f1b78 f8010010 f821ff91 <ebc30018> 7fc3f378 48601309 60000000
[ 297.651226] ---[ end trace 87a3804e7d686ea3 ]---
I speculate that the firmware might be not loaded correctly if the kernel page size is 64K so I try again with a custom 4K page size kernel but the result is the same:
[ 69.457441] amdgpu: Topology: Add CPU node
[ 69.458707] amdgpu 0001:03:00.0: enabling device (0140 -> 0142)
[ 69.458717] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1DA2:0xE438 0xC0).
[ 69.458720] amdgpu 0001:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 69.458732] [drm] register mmio base: 0x80000000
[ 69.458733] [drm] register mmio size: 1048576
[ 69.458735] [drm] PCI I/O BAR is not found.
[ 69.458744] [drm] PCIE atomic ops is not supported
[ 69.461020] [drm] add ip block number 0 <nv_common>
[ 69.461022] [drm] add ip block number 1 <gmc_v10_0>
[ 69.461023] [drm] add ip block number 2 <navi10_ih>
[ 69.461025] [drm] add ip block number 3 <psp>
[ 69.461026] [drm] add ip block number 4 <smu>
[ 69.461028] [drm] add ip block number 5 <gfx_v10_0>
[ 69.461029] [drm] add ip block number 6 <sdma_v5_2>
[ 69.461031] [drm] add ip block number 7 <vcn_v3_0>
[ 69.461032] [drm] add ip block number 8 <jpeg_v3_0>
[ 69.492308] amdgpu 0001:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 69.492311] amdgpu: ATOM BIOS: 113-E438XTX-UO2
[ 69.492324] [drm] VCN(0) decode is enabled in VM mode
[ 69.492325] [drm] VCN(1) decode is enabled in VM mode
[ 69.492327] [drm] VCN(0) encode is enabled in VM mode
[ 69.492328] [drm] VCN(1) encode is enabled in VM mode
[ 69.492330] [drm] JPEG decode is enabled in VM mode
[ 69.492336] [drm] GPU posting now...
[ 69.492367] amdgpu 0001:03:00.0: amdgpu: HBM ECC is not presented.
[ 69.492370] amdgpu 0001:03:00.0: amdgpu: SRAM ECC is not presented.
[ 69.492374] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 69.492401] amdgpu 0001:03:00.0: BAR 2: releasing [mem 0x6004010000000-0x60040101fffff 64bit pref]
[ 69.492404] amdgpu 0001:03:00.0: BAR 0: releasing [mem 0x6004000000000-0x600400fffffff 64bit pref]
[ 69.492432] pci 0001:02:00.0: BAR 15: releasing [mem 0x6004000000000-0x600403fffffff 64bit pref]
[ 69.492435] pci 0001:01:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 69.492438] pci 0001:00:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 69.492447] pci 0001:00:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 69.492451] pci 0001:01:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 69.492454] pci 0001:02:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 69.492458] amdgpu 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x60043ffffffff 64bit pref]
[ 69.492467] amdgpu 0001:03:00.0: BAR 2: assigned [mem 0x6004400000000-0x60044001fffff 64bit pref]
[ 69.492477] pci 0001:00:00.0: PCI bridge to [bus 01-03]
[ 69.492482] pci 0001:00:00.0: bridge window [mem 0x600c080000000-0x600c0ffefffff]
[ 69.492485] pci 0001:00:00.0: bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 69.492490] pci 0001:01:00.0: PCI bridge to [bus 02-03]
[ 69.492495] pci 0001:01:00.0: bridge window [mem 0x600c080000000-0x600c0ffefffff]
[ 69.492499] pci 0001:01:00.0: bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 69.492504] pci 0001:02:00.0: PCI bridge to [bus 03]
[ 69.492509] pci 0001:02:00.0: bridge window [mem 0x600c080000000-0x600c0807fffff]
[ 69.492512] pci 0001:02:00.0: bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 69.492523] amdgpu 0001:03:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[ 69.492526] amdgpu 0001:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 69.492529] [drm] Detected VRAM RAM=16368M, BAR=16384M
[ 69.492531] [drm] RAM width 256bits GDDR6
[ 69.492572] [drm] amdgpu: 16368M of VRAM memory ready
[ 69.492577] [drm] amdgpu: 16368M of GTT memory ready.
[ 69.492585] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 69.499431] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[ 69.500569] EEH: Recovering PHB#1-PE#0
[ 69.500574] EEH: PE location: UOPWR.D100020-Node0-SLOT1 PCIE 4.0 X16, PHB location: N/A
[ 69.500576] EEH: Frozen PHB#1-PE#0 detected
[ 69.500578] EEH: Call Trace:
[ 69.500583] EEH: [00000000d9e7d323] __eeh_send_failure_event+0x7c/0x160
[ 69.500588] EEH: [00000000d61ba426] eeh_dev_check_failure.part.0+0x254/0x5e0
[ 69.500693] EEH: [0000000061d1df81] amdgpu_device_rreg+0x180/0x210 [amdgpu]
[ 69.500803] EEH: [00000000ed1fb3ed] gfxhub_v2_1_set_fault_enable_default+0x68/0x150 [amdgpu]
[ 69.500913] EEH: [000000001cce1aab] gmc_v10_0_hw_init+0x198/0x290 [amdgpu]
[ 69.501014] EEH: [0000000009744e54] amdgpu_device_init+0x1a74/0x1fc0 [amdgpu]
[ 69.501110] EEH: [000000005aac3e93] amdgpu_driver_load_kms+0x30/0x520 [amdgpu]
[ 69.501204] EEH: [0000000044cf3143] amdgpu_pci_probe+0x18c/0x340 [amdgpu]
[ 69.501208] EEH: [00000000827393ff] local_pci_probe+0x68/0x110
[ 69.501211] EEH: [00000000e5937af3] work_for_cpu_fn+0x38/0x60
[ 69.501214] EEH: [0000000027a7f486] process_one_work+0x300/0x5d0
[ 69.501217] EEH: [0000000041c5aee3] worker_thread+0x360/0x780
[ 69.501219] EEH: [00000000787f3030] kthread+0x1e4/0x1f0
[ 69.501222] EEH: [0000000021927c95] ret_from_kernel_thread+0x5c/0x6c
[ 69.501224] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 69.501225] EEH: Notify device drivers to shutdown
[ 69.501228] EEH: Beginning: 'error_detected(IO frozen)'
[ 69.516456] [drm] use_doorbell being set to: [true]
[ 69.516536] [drm] use_doorbell being set to: [true]
[ 69.516639] [drm] use_doorbell being set to: [true]
[ 69.516739] [drm] use_doorbell being set to: [true]
[ 69.518119] [drm] Found VCN firmware Version ENC: 1.3 DEC: 2 VEP: 0 Revision: 17
[ 69.518135] [drm] PSP loading VCN firmware
[ 69.784609] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
[ 69.784671] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[ 69.784725] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[ 69.784727] amdgpu 0001:03:00.0: amdgpu: amdgpu_device_ip_init failed
[ 69.784738] amdgpu 0001:03:00.0: amdgpu: Fatal error during GPU init
[ 69.785890] amdgpu: probe of 0001:03:00.0 failed with error -22
[ 69.785920] PCI 0001:03:00.0#0000: EEH: no driver
[ 69.785923] PCI 0001:03:00.1#0000: EEH: driver not EEH aware
[ 69.785926] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none'
[ 69.785931] EEH: Collect temporary log
[ 69.785972] EEH: of node=0001:03:00.0
[ 69.785976] EEH: PCI device/vendor: 73bf1002
[ 69.785979] EEH: PCI cmd/status register: 00100542
[ 69.785980] EEH: PCI-E capabilities and status follow:
[ 69.785991] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04
[ 69.786000] EEH: PCI-E 10: 11040040 00000000 00000000 00000000
[ 69.786002] EEH: PCI-E 20: 00000000
[ 69.786003] EEH: PCI-E AER capability register set follows:
[ 69.786014] EEH: PCI-E AER 00: 20020001 00000000 00000000 00462030
[ 69.786023] EEH: PCI-E AER 10: 00000000 00002000 000001f4 40008001
[ 69.786033] EEH: PCI-E AER 20: 0000000f 8007f000 00000000 00000000
[ 69.786036] EEH: PCI-E AER 30: 00000000 00000000
[ 69.786039] EEH: of node=0001:03:00.1
[ 69.786042] EEH: PCI device/vendor: ab281002
[ 69.786045] EEH: PCI cmd/status register: 00100546
[ 69.786046] EEH: PCI-E capabilities and status follow:
[ 69.786057] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04
[ 69.786065] EEH: PCI-E 10: 11040040 00000000 00000000 00000000
[ 69.786067] EEH: PCI-E 20: 00000000
[ 69.786070] EEH: PCI-E AER capability register set follows:
[ 69.786080] EEH: PCI-E AER 00: 2a020001 00000000 00000000 00462030
[ 69.786089] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 69.786097] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 69.786101] EEH: PCI-E AER 30: 00000000 00000000
[ 69.786103] PHB4 PHB#1 Diag-data (Version: 1)
[ 69.786105] brdgCtl: 00000002
[ 69.786107] RootSts: 00000020 00402000 a0440008 00100107 00004000
[ 69.786109] RootErrSts: 00000024 00000000 00000000
[ 69.786110] sourceId: 03000000
[ 69.786112] PhbSts: 0000001c00000000 0000001c00000000
[ 69.786114] Lem: 0000000004000000 0000000000000000 0000000004000000
[ 69.786116] PhbErr: 0000080000000000 0000080000000000 2148000098000240 a008400000000000
[ 69.786120] RxeArbErr: 0000000000000020 0000000000000020 4000030000000000 0000000000000000
[ 69.786122] PcieDlp: 0000000000000000 0000000000000000 7000000000000000
[ 69.786126] PE[000] A/B: 8720002503000000 8000000000000000
[ 69.786128] EEH: Reset with hotplug activity
[ 69.930197] snd_hda_intel 0001:03:00.1: CORB reset timeout#2, CORBRP = 65535
[ 70.400246] snd_hda_intel 0001:03:00.1: CORB reset timeout#2, CORBRP = 65535
[ 70.825252] snd_hda_codec_hdmi hdaudioC0D0: Unable to sync register 0x2f0d00. -5
[ 70.825264] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 70.825275] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 70.825283] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 70.825291] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 70.825299] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 70.825307] snd_hda_codec_hdmi hdaudioC0D0: HDMI ATI/AMD: no speaker allocation for ELD
[ 71.335457] pci 0001:03:00.1: Removing from iommu group 1
[ 71.335661] pci 0001:03:00.0: Removing from iommu group 1
[ 73.513323] EEH: Sleep 5s ahead of complete hotplug
[ 78.547139] pci 0001:03:00.0: [1002:73bf] type 00 class 0x030000
[ 78.547163] pci 0001:03:00.0: reg 0x10: [mem 0x6004000000000-0x600400fffffff 64bit pref]
[ 78.547175] pci 0001:03:00.0: reg 0x18: [mem 0x6004010000000-0x60040101fffff 64bit pref]
[ 78.547184] pci 0001:03:00.0: reg 0x20: [io 0x0000-0x00ff]
[ 78.547191] pci 0001:03:00.0: reg 0x24: [mem 0x600c080000000-0x600c0800fffff]
[ 78.547199] pci 0001:03:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[ 78.547330] pci 0001:03:00.0: PME# supported from D1 D2 D3hot D3cold
[ 78.547423] pci 0001:03:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x4 link at 0001:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 78.547495] pci 0001:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ 78.547991] pci 0001:03:00.1: [1002:ab28] type 00 class 0x040300
[ 78.548006] pci 0001:03:00.1: reg 0x10: [mem 0x600c080120000-0x600c080123fff]
[ 78.548118] pci 0001:03:00.1: PME# supported from D1 D2 D3hot D3cold
[ 78.548638] pci 0001:02:00.0: ASPM: current common clock configuration is inconsistent, reconfiguring
[ 78.548679] pci 0001:02:00.0: BAR 13: no space for [io size 0x1000]
[ 78.548681] pci 0001:02:00.0: BAR 13: failed to assign [io size 0x1000]
[ 78.548686] pci 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x600400fffffff 64bit pref]
[ 78.548696] pci 0001:03:00.0: BAR 2: assigned [mem 0x6004010000000-0x60040101fffff 64bit pref]
[ 78.548706] pci 0001:03:00.0: BAR 5: assigned [mem 0x600c080000000-0x600c0800fffff]
[ 78.548711] pci 0001:03:00.0: BAR 6: assigned [mem 0x600c080100000-0x600c08011ffff pref]
[ 78.548713] pci 0001:03:00.1: BAR 0: assigned [mem 0x600c080120000-0x600c080123fff]
[ 78.548718] pci 0001:03:00.0: BAR 4: no space for [io size 0x0100]
[ 78.548720] pci 0001:03:00.0: BAR 4: failed to assign [io size 0x0100]
[ 78.548724] pci 0001:02:00.0: PCI bridge to [bus 03]
[ 78.548728] pci 0001:02:00.0: bridge window [mem 0x600c080000000-0x600c0807fffff]
[ 78.548732] pci 0001:02:00.0: bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 78.548736] PCI: No. 2 try to assign unassigned res
[ 78.548740] pci 0001:02:00.0: BAR 13: no space for [io size 0x1000]
[ 78.548743] pci 0001:02:00.0: BAR 13: failed to assign [io size 0x1000]
[ 78.548745] pci 0001:03:00.0: BAR 4: no space for [io size 0x0100]
[ 78.548748] pci 0001:03:00.0: BAR 4: failed to assign [io size 0x0100]
[ 78.548750] pci 0001:02:00.0: PCI bridge to [bus 03]
[ 78.548755] pci 0001:02:00.0: bridge window [mem 0x600c080000000-0x600c0807fffff]
[ 78.548758] pci 0001:02:00.0: bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 78.548770] pci 0001:03:00.0: Added to existing PE#0
[ 78.548776] pci 0001:03:00.0: Adding to iommu group 1
[ 78.548914] amdgpu 0001:03:00.0: enabling device (0140 -> 0142)
[ 78.548921] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1DA2:0xE438 0xC0).
[ 78.548925] amdgpu 0001:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 78.548937] [drm] register mmio base: 0x80000000
[ 78.548939] [drm] register mmio size: 1048576
[ 78.548940] [drm] PCI I/O BAR is not found.
[ 78.548947] [drm] PCIE atomic ops is not supported
[ 78.551169] [drm] add ip block number 0 <nv_common>
[ 78.551171] [drm] add ip block number 1 <gmc_v10_0>
[ 78.551173] [drm] add ip block number 2 <navi10_ih>
[ 78.551174] [drm] add ip block number 3 <psp>
[ 78.551176] [drm] add ip block number 4 <smu>
[ 78.551178] [drm] add ip block number 5 <gfx_v10_0>
[ 78.551180] [drm] add ip block number 6 <sdma_v5_2>
[ 78.551181] [drm] add ip block number 7 <vcn_v3_0>
[ 78.551183] [drm] add ip block number 8 <jpeg_v3_0>
[ 78.582437] amdgpu 0001:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 78.582440] amdgpu: ATOM BIOS: 113-E438XTX-UO2
[ 78.582453] [drm] VCN(0) decode is enabled in VM mode
[ 78.582455] [drm] VCN(1) decode is enabled in VM mode
[ 78.582456] [drm] VCN(0) encode is enabled in VM mode
[ 78.582458] [drm] VCN(1) encode is enabled in VM mode
[ 78.582459] [drm] JPEG decode is enabled in VM mode
[ 78.582489] amdgpu 0001:03:00.0: amdgpu: HBM ECC is not presented.
[ 78.582491] amdgpu 0001:03:00.0: amdgpu: SRAM ECC is not presented.
[ 78.582497] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 78.582522] amdgpu 0001:03:00.0: BAR 2: releasing [mem 0x6004010000000-0x60040101fffff 64bit pref]
[ 78.582525] amdgpu 0001:03:00.0: BAR 0: releasing [mem 0x6004000000000-0x600400fffffff 64bit pref]
[ 78.582552] pci 0001:02:00.0: BAR 15: releasing [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 78.582555] pci 0001:01:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 78.582558] pci 0001:00:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 78.582565] pci 0001:00:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 78.582568] pci 0001:01:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 78.582571] pci 0001:02:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 78.582574] amdgpu 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x60043ffffffff 64bit pref]
[ 78.582584] amdgpu 0001:03:00.0: BAR 2: assigned [mem 0x6004400000000-0x60044001fffff 64bit pref]
[ 78.582593] pci 0001:00:00.0: PCI bridge to [bus 01-03]
[ 78.582597] pci 0001:00:00.0: bridge window [mem 0x600c080000000-0x600c0ffefffff]
[ 78.582601] pci 0001:00:00.0: bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 78.582606] pci 0001:01:00.0: PCI bridge to [bus 02-03]
[ 78.582611] pci 0001:01:00.0: bridge window [mem 0x600c080000000-0x600c0ffefffff]
[ 78.582615] pci 0001:01:00.0: bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 78.582620] pci 0001:02:00.0: PCI bridge to [bus 03]
[ 78.582624] pci 0001:02:00.0: bridge window [mem 0x600c080000000-0x600c0807fffff]
[ 78.582628] pci 0001:02:00.0: bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 78.582639] amdgpu 0001:03:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[ 78.582642] amdgpu 0001:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 78.582645] [drm] Detected VRAM RAM=16368M, BAR=16384M
[ 78.582647] [drm] RAM width 256bits GDDR6
[ 78.582826] [drm] amdgpu: 16368M of VRAM memory ready
[ 78.582831] [drm] amdgpu: 16368M of GTT memory ready.
[ 78.582839] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 78.589574] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[ 78.596296] [drm] use_doorbell being set to: [true]
[ 78.596663] [drm] use_doorbell being set to: [true]
[ 78.597025] [drm] use_doorbell being set to: [true]
[ 78.597450] [drm] use_doorbell being set to: [true]
[ 78.597861] [drm] Found VCN firmware Version ENC: 1.3 DEC: 2 VEP: 0 Revision: 17
[ 78.597869] [drm] PSP loading VCN firmware
[ 78.853223] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
[ 78.853269] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[ 78.853306] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[ 78.853309] amdgpu 0001:03:00.0: amdgpu: amdgpu_device_ip_init failed
[ 78.853319] amdgpu 0001:03:00.0: amdgpu: Fatal error during GPU init
[ 78.853350] amdgpu: probe of 0001:03:00.0 failed with error -22
[ 78.853354] pci 0001:03:00.1: Added to existing PE#0
[ 78.853359] pci 0001:03:00.1: Adding to iommu group 1
[ 78.853444] pci 0001:03:00.1: D0 power state depends on 0001:03:00.0
[ 78.853479] snd_hda_intel 0001:03:00.1: enabling device (0140 -> 0142)
[ 78.853484] snd_hda_intel 0001:03:00.1: Force to snoop mode by module option
[ 78.853504] EEH: Notify device driver to resume
[ 78.853506] EEH: Beginning: 'resume'
[ 78.853508] PCI 0001:03:00.0#0000: EEH: no driver
[ 78.853509] PCI 0001:03:00.1#0000: EEH: driver not EEH aware
[ 78.853510] EEH: Finished:'resume'
[ 78.853511] EEH: Recovery successful.
[ 78.853514] EEH: Recovering PHB#1-PE#0
[ 78.853516] EEH: PE location: UOPWR.D100020-Node0-SLOT1 PCIE 4.0 X16, PHB location: N/A
[ 78.853517] EEH: Frozen PHB#1-PE#0 detected
[ 78.853518] EEH: Call Trace:
[ 78.853522] EEH: [00000000d9e7d323] __eeh_send_failure_event+0x7c/0x160
[ 78.853524] EEH: [00000000d61ba426] eeh_dev_check_failure.part.0+0x254/0x5e0
[ 78.853561] EEH: [0000000061d1df81] amdgpu_device_rreg+0x180/0x210 [amdgpu]
[ 78.853606] EEH: [00000000ed1fb3ed] gfxhub_v2_1_set_fault_enable_default+0x68/0x150 [amdgpu]
[ 78.853651] EEH: [000000001cce1aab] gmc_v10_0_hw_init+0x198/0x290 [amdgpu]
[ 78.853688] EEH: [0000000009744e54] amdgpu_device_init+0x1a74/0x1fc0 [amdgpu]
[ 78.853725] EEH: [000000005aac3e93] amdgpu_driver_load_kms+0x30/0x520 [amdgpu]
[ 78.853762] EEH: [0000000044cf3143] amdgpu_pci_probe+0x18c/0x340 [amdgpu]
[ 78.853764] EEH: [00000000827393ff] local_pci_probe+0x68/0x110
[ 78.853766] EEH: [00000000e5937af3] work_for_cpu_fn+0x38/0x60
[ 78.853768] EEH: [0000000027a7f486] process_one_work+0x300/0x5d0
[ 78.853769] EEH: [0000000041c5aee3] worker_thread+0x360/0x780
[ 78.853770] EEH: [00000000787f3030] kthread+0x1e4/0x1f0
[ 78.853772] EEH: [0000000021927c95] ret_from_kernel_thread+0x5c/0x6c
[ 78.853773] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
[ 78.853774] EEH: Notify device drivers to shutdown
[ 78.853775] EEH: Beginning: 'error_detected(IO frozen)'
[ 78.853777] PCI 0001:03:00.0#0000: EEH: no driver
[ 78.853778] PCI 0001:03:00.1#0000: EEH: driver not EEH aware
[ 78.853779] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none'
[ 78.853782] EEH: Collect temporary log
[ 78.853812] EEH: of node=0001:03:00.0
[ 78.853814] EEH: PCI device/vendor: 73bf1002
[ 78.853816] EEH: PCI cmd/status register: 00100542
[ 78.853817] EEH: PCI-E capabilities and status follow:
[ 78.853824] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04
[ 78.853830] EEH: PCI-E 10: 11040040 00000000 00000000 00000000
[ 78.853831] EEH: PCI-E 20: 00000000
[ 78.853832] EEH: PCI-E AER capability register set follows:
[ 78.853839] EEH: PCI-E AER 00: 20020001 00000000 00000000 00462030
[ 78.853845] EEH: PCI-E AER 10: 00000000 00002000 000001f4 40008001
[ 78.853851] EEH: PCI-E AER 20: 0000000f 8007f000 00000000 00000000
[ 78.853853] EEH: PCI-E AER 30: 00000000 00000000
[ 78.853854] EEH: of node=0001:03:00.1
[ 78.853856] EEH: PCI device/vendor: ab281002
[ 78.853858] EEH: PCI cmd/status register: 00100142
[ 78.853859] EEH: PCI-E capabilities and status follow:
[ 78.853866] EEH: PCI-E 00: 0012a010 00008fa1 00002930 00440d04
[ 78.853871] EEH: PCI-E 10: 11040040 00000000 00000000 00000000
[ 78.853872] EEH: PCI-E 20: 00000000
[ 78.853873] EEH: PCI-E AER capability register set follows:
[ 78.853880] EEH: PCI-E AER 00: 2a020001 00000000 00000000 00462030
[ 78.853886] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 78.853891] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 78.853894] EEH: PCI-E AER 30: 00000000 00000000
[ 78.853895] PHB4 PHB#1 Diag-data (Version: 1)
[ 78.853896] brdgCtl: 00000002
[ 78.853897] RootSts: 00000020 00402000 a0440008 00100107 00004000
[ 78.853898] RootErrSts: 00000024 00000000 00000000
[ 78.853899] sourceId: 03000000
[ 78.853900] PhbSts: 0000001c00000000 0000001c00000000
[ 78.853901] Lem: 0000000004000000 0000000000000000 0000000004000000
[ 78.853903] PhbErr: 0000080000000000 0000080000000000 2148000098000240 a008400000000000
[ 78.853904] RxeArbErr: 0000000000000020 0000000000000020 4000030000000000 0000000000000000
[ 78.853905] PcieDlp: 0000000000000000 0000000000000000 7000000000000000
[ 78.853906] PE[000] A/B: 8720002503000000 8000000000000000
[ 78.853908] EEH: Reset with hotplug activity
[ 78.853919] Attempt to iounmap early bolted mapping at 0x0000000000000000
[ 78.853983] pci 0001:03:00.1: Removing from iommu group 1
[ 78.854055] pci 0001:03:00.0: Removing from iommu group 1
[ 80.954155] EEH: Sleep 5s ahead of complete hotplug
[ 85.987779] ------------[ cut here ]------------
[ 85.987788] WARNING: CPU: 0 PID: 177 at arch/powerpc/kernel/eeh_pe.c:438 eeh_pe_tree_remove+0xb8/0x260
[ 85.987789] Modules linked in: amdgpu mfd_core gpu_sched xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink rfkill ip6table_filter ip6_tables iptable_filter sunrpc snd_hda_codec_hdmi snd_hda_intel snd_usb_audio snd_intel_dspcfg snd_hda_codec at24 regmap_i2c snd_hda_core snd_usbmidi_lib snd_rawmidi snd_hwdep snd_seq joydev snd_seq_device crct10dif_vpmsum snd_pcm mc ofpart ipmi_powernv ipmi_devintf powernv_flash ipmi_msghandler mtd snd_timer rtc_opal opal_prd snd i2c_opal soundcore zram ip_tables ast drm_vram_helper drm_ttm_helper ttm i2c_algo_bit drm_kms_helper syscopyarea
[ 85.987888] sysfillrect sysimgblt fb_sys_fops cec drm vmx_crypto crc32c_vpmsum tg3 i2c_core drm_panel_orientation_quirks nvme nvme_core fuse
[ 85.987907] CPU: 0 PID: 177 Comm: eehd Not tainted 5.10.21-200.4kpagesize.fc33.ppc64le #1
[ 85.987909] NIP: c00000000004b778 LR: c00000000004b710 CTR: c00000000004ce90
[ 85.987912] REGS: c00000000d14f840 TRAP: 0700 Not tainted (5.10.21-200.4kpagesize.fc33.ppc64le)
[ 85.987913] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28002842 XER: 00000000
[ 85.987926] CFAR: c00000000004b7b0 IRQMASK: 0
GPR00: c00000000004cee8 c00000000d14fad0 c000000002310900 0000000000000001
GPR04: c000000003ec94b0 c000000003ec94b0 0000000028008844 0000000000000100
GPR08: c00000000d7d4068 0000000000000000 0000000000000008 0000000000000000
GPR12: c00000000004ce90 c0000000024f1000 c0000000001a3be8 c00000000d04fcc0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000045
GPR24: 0000000000000002 0000000000000000 0000000000000000 c00000000d7a1800
GPR28: 5deadbeef0000100 5deadbeef0000122 c00000000d7d0000 c00000000d7d4000
[ 85.987975] NIP [c00000000004b778] eeh_pe_tree_remove+0xb8/0x260
[ 85.987977] LR [c00000000004b710] eeh_pe_tree_remove+0x50/0x260
[ 85.987979] Call Trace:
[ 85.987982] [c00000000d14fad0] [0000000000000027] 0x27 (unreliable)
[ 85.987987] [c00000000d14fb50] [c00000000004cee8] eeh_pe_detach_dev+0x58/0xc0
[ 85.987990] [c00000000d14fb80] [c00000000004afbc] eeh_pe_traverse+0x6c/0xf0
[ 85.987994] [c00000000d14fbc0] [c00000000004fb54] eeh_reset_device+0x21c/0x2c8
[ 85.987998] [c00000000d14fc70] [c00000000004ebd0] eeh_handle_normal_event+0x7e0/0xa40
[ 85.988001] [c00000000d14fd50] [c00000000004fd18] eeh_event_handler+0x118/0x1a0
[ 85.988005] [c00000000d14fdb0] [c0000000001a3dc4] kthread+0x1e4/0x1f0
[ 85.988009] [c00000000d14fe20] [c00000000000d4f0] ret_from_kernel_thread+0x5c/0x6c
[ 85.988011] Instruction dump:
[ 85.988013] 67bdf000 639c0100 63bd0122 fb9e0070 fbbe0078 e95f0002 ebdf0038 71490002
[ 85.988023] 41820038 480000c4 2c290000 40820008 <0fe00000> e93f0068 7c294040 418200dc
[ 85.988033] ---[ end trace c7c7bf27e0e1201f ]---
[ 85.988035] ------------[ cut here ]------------
[ 85.988039] WARNING: CPU: 0 PID: 177 at arch/powerpc/kernel/eeh_pe.c:438 eeh_pe_tree_remove+0xb8/0x260
[ 85.988040] Modules linked in: amdgpu mfd_core gpu_sched xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink rfkill ip6table_filter ip6_tables iptable_filter sunrpc snd_hda_codec_hdmi snd_hda_intel snd_usb_audio snd_intel_dspcfg snd_hda_codec at24 regmap_i2c snd_hda_core snd_usbmidi_lib snd_rawmidi snd_hwdep snd_seq joydev snd_seq_device crct10dif_vpmsum snd_pcm mc ofpart ipmi_powernv ipmi_devintf powernv_flash ipmi_msghandler mtd snd_timer rtc_opal opal_prd snd i2c_opal soundcore zram ip_tables ast drm_vram_helper drm_ttm_helper ttm i2c_algo_bit drm_kms_helper syscopyarea
[ 85.988131] sysfillrect sysimgblt fb_sys_fops cec drm vmx_crypto crc32c_vpmsum tg3 i2c_core drm_panel_orientation_quirks nvme nvme_core fuse
[ 85.988148] CPU: 0 PID: 177 Comm: eehd Tainted: G W 5.10.21-200.4kpagesize.fc33.ppc64le #1
[ 85.988150] NIP: c00000000004b778 LR: c00000000004b710 CTR: c00000000004ce90
[ 85.988152] REGS: c00000000d14f840 TRAP: 0700 Tainted: G W (5.10.21-200.4kpagesize.fc33.ppc64le)
[ 85.988153] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28002842 XER: 00000000
[ 85.988166] CFAR: c00000000004b7b0 IRQMASK: 0
GPR00: c00000000004cee8 c00000000d14fad0 c000000002310900 0000000000000001
GPR04: c000000003ec9e70 c000000003ec9e70 0000000028008844 0000000000000100
GPR08: c00000000d7d4068 0000000000000000 0000000000000008 0000000000000000
GPR12: c00000000004ce90 c0000000024f1000 c0000000001a3be8 c00000000d04fcc0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000045
GPR24: 0000000000000002 0000000000000000 0000000000000000 c00000000d7a1800
GPR28: 5deadbeef0000100 5deadbeef0000122 c00000000d7d0000 c00000000d7d4000
[ 85.988213] NIP [c00000000004b778] eeh_pe_tree_remove+0xb8/0x260
[ 85.988216] LR [c00000000004b710] eeh_pe_tree_remove+0x50/0x260
[ 85.988217] Call Trace:
[ 85.988219] [c00000000d14fad0] [0000000000000027] 0x27 (unreliable)
[ 85.988223] [c00000000d14fb50] [c00000000004cee8] eeh_pe_detach_dev+0x58/0xc0
[ 85.988227] [c00000000d14fb80] [c00000000004afbc] eeh_pe_traverse+0x6c/0xf0
[ 85.988230] [c00000000d14fbc0] [c00000000004fb54] eeh_reset_device+0x21c/0x2c8
[ 85.988234] [c00000000d14fc70] [c00000000004ebd0] eeh_handle_normal_event+0x7e0/0xa40
[ 85.988237] [c00000000d14fd50] [c00000000004fd18] eeh_event_handler+0x118/0x1a0
[ 85.988240] [c00000000d14fdb0] [c0000000001a3dc4] kthread+0x1e4/0x1f0
[ 85.988244] [c00000000d14fe20] [c00000000000d4f0] ret_from_kernel_thread+0x5c/0x6c
[ 85.988246] Instruction dump:
[ 85.988248] 67bdf000 639c0100 63bd0122 fb9e0070 fbbe0078 e95f0002 ebdf0038 71490002
[ 85.988258] 41820038 480000c4 2c290000 40820008 <0fe00000> e93f0068 7c294040 418200dc
[ 85.988268] ---[ end trace c7c7bf27e0e12020 ]---
[ 85.988318] pci 0001:03:00.0: [1002:73bf] type 00 class 0x030000
[ 85.988340] pci 0001:03:00.0: reg 0x10: [mem 0x6004000000000-0x600400fffffff 64bit pref]
[ 85.988352] pci 0001:03:00.0: reg 0x18: [mem 0x6004010000000-0x60040101fffff 64bit pref]
[ 85.988359] pci 0001:03:00.0: reg 0x20: [io 0x0000-0x00ff]
[ 85.988367] pci 0001:03:00.0: reg 0x24: [mem 0x600c080000000-0x600c0800fffff]
[ 85.988375] pci 0001:03:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[ 85.988505] pci 0001:03:00.0: PME# supported from D1 D2 D3hot D3cold
[ 85.988598] pci 0001:03:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x4 link at 0001:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 85.988667] pci 0001:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ 85.989164] pci 0001:03:00.1: [1002:ab28] type 00 class 0x040300
[ 85.989178] pci 0001:03:00.1: reg 0x10: [mem 0x600c080120000-0x600c080123fff]
[ 85.989290] pci 0001:03:00.1: PME# supported from D1 D2 D3hot D3cold
[ 85.989808] pci 0001:02:00.0: ASPM: current common clock configuration is inconsistent, reconfiguring
[ 85.989849] pci 0001:02:00.0: BAR 13: no space for [io size 0x1000]
[ 85.989851] pci 0001:02:00.0: BAR 13: failed to assign [io size 0x1000]
[ 85.989856] pci 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x600400fffffff 64bit pref]
[ 85.989866] pci 0001:03:00.0: BAR 2: assigned [mem 0x6004010000000-0x60040101fffff 64bit pref]
[ 85.989875] pci 0001:03:00.0: BAR 5: assigned [mem 0x600c080000000-0x600c0800fffff]
[ 85.989880] pci 0001:03:00.0: BAR 6: assigned [mem 0x600c080100000-0x600c08011ffff pref]
[ 85.989883] pci 0001:03:00.1: BAR 0: assigned [mem 0x600c080120000-0x600c080123fff]
[ 85.989887] pci 0001:03:00.0: BAR 4: no space for [io size 0x0100]
[ 85.989890] pci 0001:03:00.0: BAR 4: failed to assign [io size 0x0100]
[ 85.989893] pci 0001:02:00.0: PCI bridge to [bus 03]
[ 85.989898] pci 0001:02:00.0: bridge window [mem 0x600c080000000-0x600c0807fffff]
[ 85.989902] pci 0001:02:00.0: bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 85.989906] PCI: No. 2 try to assign unassigned res
[ 85.989910] pci 0001:02:00.0: BAR 13: no space for [io size 0x1000]
[ 85.989912] pci 0001:02:00.0: BAR 13: failed to assign [io size 0x1000]
[ 85.989915] pci 0001:03:00.0: BAR 4: no space for [io size 0x0100]
[ 85.989917] pci 0001:03:00.0: BAR 4: failed to assign [io size 0x0100]
[ 85.989920] pci 0001:02:00.0: PCI bridge to [bus 03]
[ 85.989925] pci 0001:02:00.0: bridge window [mem 0x600c080000000-0x600c0807fffff]
[ 85.989928] pci 0001:02:00.0: bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 85.989940] pci 0001:03:00.0: Added to existing PE#0
[ 85.989946] pci 0001:03:00.0: Adding to iommu group 1
[ 85.990081] amdgpu 0001:03:00.0: enabling device (0140 -> 0142)
[ 85.990088] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1DA2:0xE438 0xC0).
[ 85.990092] amdgpu 0001:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 85.990104] [drm] register mmio base: 0x80000000
[ 85.990105] [drm] register mmio size: 1048576
[ 85.990107] [drm] PCI I/O BAR is not found.
[ 85.990113] [drm] PCIE atomic ops is not supported
[ 85.992344] [drm] add ip block number 0 <nv_common>
[ 85.992346] [drm] add ip block number 1 <gmc_v10_0>
[ 85.992347] [drm] add ip block number 2 <navi10_ih>
[ 85.992349] [drm] add ip block number 3 <psp>
[ 85.992351] [drm] add ip block number 4 <smu>
[ 85.992353] [drm] add ip block number 5 <gfx_v10_0>
[ 85.992354] [drm] add ip block number 6 <sdma_v5_2>
[ 85.992356] [drm] add ip block number 7 <vcn_v3_0>
[ 85.992357] [drm] add ip block number 8 <jpeg_v3_0>
[ 86.023918] amdgpu 0001:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 86.023926] amdgpu: ATOM BIOS: 113-E438XTX-UO2
[ 86.023949] [drm] VCN(0) decode is enabled in VM mode
[ 86.023952] [drm] VCN(1) decode is enabled in VM mode
[ 86.023955] [drm] VCN(0) encode is enabled in VM mode
[ 86.023958] [drm] VCN(1) encode is enabled in VM mode
[ 86.023962] [drm] JPEG decode is enabled in VM mode
[ 86.024021] amdgpu 0001:03:00.0: amdgpu: HBM ECC is not presented.
[ 86.024024] amdgpu 0001:03:00.0: amdgpu: SRAM ECC is not presented.
[ 86.024033] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 86.024071] amdgpu 0001:03:00.0: BAR 2: releasing [mem 0x6004010000000-0x60040101fffff 64bit pref]
[ 86.024075] amdgpu 0001:03:00.0: BAR 0: releasing [mem 0x6004000000000-0x600400fffffff 64bit pref]
[ 86.024112] pci 0001:02:00.0: BAR 15: releasing [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 86.024116] pci 0001:01:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 86.024120] pci 0001:00:00.0: BAR 15: releasing [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 86.024132] pci 0001:00:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 86.024137] pci 0001:01:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 86.024142] pci 0001:02:00.0: BAR 15: assigned [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 86.024147] amdgpu 0001:03:00.0: BAR 0: assigned [mem 0x6004000000000-0x60043ffffffff 64bit pref]
[ 86.024160] amdgpu 0001:03:00.0: BAR 2: assigned [mem 0x6004400000000-0x60044001fffff 64bit pref]
[ 86.024174] pci 0001:00:00.0: PCI bridge to [bus 01-03]
[ 86.024180] pci 0001:00:00.0: bridge window [mem 0x600c080000000-0x600c0ffefffff]
[ 86.024185] pci 0001:00:00.0: bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 86.024192] pci 0001:01:00.0: PCI bridge to [bus 02-03]
[ 86.024200] pci 0001:01:00.0: bridge window [mem 0x600c080000000-0x600c0ffefffff]
[ 86.024205] pci 0001:01:00.0: bridge window [mem 0x6004000000000-0x6007f7ff0ffff 64bit pref]
[ 86.024213] pci 0001:02:00.0: PCI bridge to [bus 03]
[ 86.024219] pci 0001:02:00.0: bridge window [mem 0x600c080000000-0x600c0807fffff]
[ 86.024225] pci 0001:02:00.0: bridge window [mem 0x6004000000000-0x60045ffffffff 64bit pref]
[ 86.024240] amdgpu 0001:03:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[ 86.024244] amdgpu 0001:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 86.024248] [drm] Detected VRAM RAM=16368M, BAR=16384M
[ 86.024251] [drm] RAM width 256bits GDDR6
[ 86.024256] list_add corruption. prev->next should be next (c00800000067e970), but was 0000000000000000. (prev=c0000000685455b8).
[ 86.024282] ------------[ cut here ]------------
[ 86.024284] kernel BUG at lib/list_debug.c:26!
[ 86.024291] Oops: Exception in kernel mode, sig: 5 [#1]
[ 86.024296] LE PAGE_SIZE=4K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[ 86.024300] Modules linked in: amdgpu mfd_core gpu_sched xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink rfkill ip6table_filter ip6_tables iptable_filter sunrpc snd_hda_codec_hdmi snd_hda_intel snd_usb_audio snd_intel_dspcfg snd_hda_codec at24 regmap_i2c snd_hda_core snd_usbmidi_lib snd_rawmidi snd_hwdep snd_seq joydev snd_seq_device crct10dif_vpmsum snd_pcm mc ofpart ipmi_powernv ipmi_devintf powernv_flash ipmi_msghandler mtd snd_timer rtc_opal opal_prd snd i2c_opal soundcore zram ip_tables ast drm_vram_helper drm_ttm_helper ttm i2c_algo_bit drm_kms_helper syscopyarea
[ 86.024426] sysfillrect sysimgblt fb_sys_fops cec drm vmx_crypto crc32c_vpmsum tg3 i2c_core drm_panel_orientation_quirks nvme nvme_core fuse
[ 86.024454] CPU: 0 PID: 189 Comm: kworker/0:2 Tainted: G W 5.10.21-200.4kpagesize.fc33.ppc64le #1
[ 86.024461] Workqueue: events work_for_cpu_fn
[ 86.024466] NIP: c000000000a4a424 LR: c000000000a4a420 CTR: 0000000000000000
[ 86.024470] REGS: c00000000e0fb380 TRAP: 0700 Tainted: G W (5.10.21-200.4kpagesize.fc33.ppc64le)
[ 86.024474] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28002444 XER: 20040000
[ 86.024492] CFAR: c000000000216098 IRQMASK: 0
GPR00: c000000000a4a420 c00000000e0fb610 c000000002310900 0000000000000075
GPR04: ffffffffffffffea c000000002099a88 0000000000000001 0000000000000027
GPR08: c000000ffc6dcf90 ffffffffffffffd8 0000000000000023 3030303038303063
GPR12: 0000000000002000 c0000000024f1000 c00000000d14f7b0 c0000000686e5b78
GPR16: c0000000686e5b80 c0000000686e5b70 c0000000686f6d90 c0000000686e5b90
GPR20: c0000000686e5b98 c0000000686e5b88 0000000000000001 c00800000067e970
GPR24: c0080000034ae4c0 0000000000000000 c00000000cf66c58 c0000000686e55d0
GPR28: c00800000067d998 c0000000685455b8 c00800000067e920 c0000000686e55b8
[ 86.024564] NIP [c000000000a4a424] __list_add_valid+0xb4/0xc0
[ 86.024569] LR [c000000000a4a420] __list_add_valid+0xb0/0xc0
[ 86.024572] Call Trace:
[ 86.024577] [c00000000e0fb610] [c000000000a4a420] __list_add_valid+0xb0/0xc0 (unreliable)
[ 86.024592] [c00000000e0fb670] [c00800000066bf80] ttm_bo_device_init+0x158/0x2d0 [ttm]
[ 86.024728] [c00000000e0fb720] [c008000002ef4214] amdgpu_ttm_init+0xcc/0x620 [amdgpu]
[ 86.024874] [c00000000e0fb830] [c0080000033326d0] amdgpu_bo_init+0x80/0xa0 [amdgpu]
[ 86.025020] [c00000000e0fb8a0] [c008000002f9e750] gmc_v10_0_sw_init+0x338/0x480 [amdgpu]
[ 86.025158] [c00000000e0fb940] [c008000002edb3f8] amdgpu_device_init+0x1670/0x1fc0 [amdgpu]
[ 86.025294] [c00000000e0fba90] [c008000002edf108] amdgpu_driver_load_kms+0x30/0x520 [amdgpu]
[ 86.025431] [c00000000e0fbb10] [c008000002ed2a84] amdgpu_pci_probe+0x18c/0x340 [amdgpu]
[ 86.025439] [c00000000e0fbbb0] [c000000000b2d978] local_pci_probe+0x68/0x110
[ 86.025446] [c00000000e0fbc30] [c000000000192ac8] work_for_cpu_fn+0x38/0x60
[ 86.025453] [c00000000e0fbc60] [c000000000197c40] process_one_work+0x300/0x5d0
[ 86.025459] [c00000000e0fbd00] [c000000000198270] worker_thread+0x360/0x780
[ 86.025465] [c00000000e0fbdb0] [c0000000001a3dc4] kthread+0x1e4/0x1f0
[ 86.025472] [c00000000e0fbe20] [c00000000000d4f0] ret_from_kernel_thread+0x5c/0x6c
[ 86.025476] Instruction dump:
[ 86.025480] f8010070 4b7cbc59 60000000 0fe00000 7c0802a6 3c62ff34 7d465378 7d244b78
[ 86.025494] 38638bd0 f8010070 4b7cbc35 60000000 <0fe00000> 60000000 60420000 3c4c018c
[ 86.025512] ---[ end trace c7c7bf27e0e12021 ]---
Edited by Trung Lê