[navi] 5.14.x regression - GPU fails to init (RX 5500/5700 XT)
Card: RX 5500 XT (or 5700 XT)
CPU: IBM POWER9 (ppc64le)
kernel: 5.14.10 (vanilla, 4k pages)
distro: Void Linux
Starting with 5.14.x, GPU fails to initialize:
[ 2.405216] [drm] amdgpu kernel modesetting enabled.
[ 2.405616] amdgpu: CRAT table disabled by module option
[ 2.405618] amdgpu: DSDT table not found for OEM information
[ 2.405619] amdgpu: IO link not available for non x86 platforms
[ 2.405620] amdgpu: Virtual CRAT table created for CPU
[ 2.405629] amdgpu: Topology: Add CPU node
[ 2.405717] amdgpu 0000:03:00.0: vgaarb: deactivate vga console
[ 2.405779] amdgpu 0000:03:00.0: enabling device (0140 -> 0142)
[ 2.405795] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 2.440112] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 2.440115] amdgpu: ATOM BIOS: 113-E4210MB-U0D
[ 2.440220] amdgpu 0000:03:00.0: BAR 2: releasing [mem 0x6000010000000-0x60000101fffff 64bit pref]
[ 2.440225] amdgpu 0000:03:00.0: BAR 0: releasing [mem 0x6000000000000-0x600000fffffff 64bit pref]
[ 2.440286] amdgpu 0000:03:00.0: BAR 0: assigned [mem 0x6000000000000-0x60001ffffffff 64bit pref]
[ 2.440297] amdgpu 0000:03:00.0: BAR 2: assigned [mem 0x6000200000000-0x60002001fffff 64bit pref]
[ 2.440361] amdgpu 0000:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[ 2.440366] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 2.440370] amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[ 2.440426] [drm] amdgpu: 8176M of VRAM memory ready
[ 2.440429] [drm] amdgpu: 8176M of GTT memory ready.
[ 2.440838] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[ 2.449161] amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
[ 2.466838] EEH: [(____ptrval____)] amdgpu_device_rreg.part.0+0x160/0x1f0 [amdgpu]
[ 2.467066] EEH: [(____ptrval____)] nbio_v2_3_program_aspm+0x7c4/0x9a0 [amdgpu]
[ 2.467319] EEH: [(____ptrval____)] nv_common_hw_init+0x154/0x170 [amdgpu]
[ 2.467574] EEH: [(____ptrval____)] amdgpu_device_init+0x1dc0/0x21c0 [amdgpu]
[ 2.467802] EEH: [(____ptrval____)] amdgpu_driver_load_kms+0x48/0x370 [amdgpu]
[ 2.468040] EEH: [(____ptrval____)] amdgpu_pci_probe+0x174/0x330 [amdgpu]
[ 2.767564] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
[ 2.767800] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[ 2.768026] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[ 2.768229] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[ 2.768254] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[ 2.768293] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[ 2.773132] amdgpu: probe of 0000:03:00.0 failed with error -22
[ 2.773268] Modules linked in: sd_mod amdgpu(+) gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec xhci_pci xhci_pci_renesas rc_core ahci xhci_hcd libahci drm libata usbcore vmx_crypto gf128mul scsi_mod drm_panel_orientation_quirks agpgart dm_mirror dm_region_hash dm_log dm_mod btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic crc32c_vpmsum
[ 2.774004] NIP [c00800000279d608] vcn_v2_0_sw_fini+0x30/0x80 [amdgpu]
[ 2.774268] LR [c00800000260643c] amdgpu_device_fini_sw+0x124/0x3f0 [amdgpu]
[ 2.774543] [c000000005d67540] [c0080000027c5d04] jpeg_v2_0_sw_fini+0x3c/0x60 [amdgpu] (unreliable)
[ 2.774814] [c000000005d67570] [c00800000260643c] amdgpu_device_fini_sw+0x124/0x3f0 [amdgpu]
[ 2.775075] [c000000005d67620] [c00800000260e574] amdgpu_driver_release_kms+0x2c/0x60 [amdgpu]
[ 2.775908] [c000000005d67ac0] [c008000002aaeefc] amdgpu_init+0xa8/0xd0 [amdgpu]
The system first takes a long time to boot (it hangs for a while at the probe) and when it finishes booting, the driver has failed to load. The system is responsive on serial console, but can't have any graphics. 5.13.19 works OK, as does 5.12.19 (with months of uptime), and things had worked for a while before.
I am not sure if this same issue also happens on x86_64 computers, but there is definitely some new code regression.