checking error from amdgpu_ucode_validate in gfx_v10_0_init_microcode breaks firmware loading on dimgrey_cavefish
Commit 1c18f3b3c08d234d236f522b4c21787e607a817e introduced an error check in gfx_v10_0_init_microcode which leads to the following error:
Sep 26 01:18:28 lisa kernel: [ 3.168473] amdgpu 0000:03:00.0: amdgpu: STB initialized to 2048 entries
Sep 26 01:18:28 lisa kernel: [ 3.168527] [drm] Loading DMUB firmware via PSP: version=0x02020013
Sep 26 01:18:28 lisa kernel: [ 3.168645] amdgpu 0000:03:00.0: amdgpu: gfx10: Failed to load firmware "amdgpu/dimgrey_cavefish_rlc.bin"
Sep 26 01:18:28 lisa kernel: [ 3.168655] [drm:gfx_v10_0_sw_init.cold [amdgpu]] *ERROR* Failed to load gfx firmware!
Sep 26 01:18:28 lisa kernel: [ 3.168795] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init of IP block <gfx_v10_0> failed -22
Sep 26 01:18:28 lisa kernel: [ 3.168918] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
Sep 26 01:18:28 lisa kernel: [ 3.168920] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
Sep 26 01:18:28 lisa kernel: [ 3.174942] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
Sep 26 01:18:28 lisa kernel: [ 3.175240] amdgpu: probe of 0000:03:00.0 failed with error -22
Removing this error check leads to a working device again.
if (!amdgpu_sriov_vf(adev)) {
snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", chip_name);
err = request_firmware(&adev->gfx.rlc_fw, fw_name, adev->dev);
printk(KERN_INFO "gfx_v10_0_init_microcode: request_firmware returned %d\n", err);
if (err)
goto out;
err = amdgpu_ucode_validate(adev->gfx.rlc_fw);
printk(KERN_INFO "gfx_v10_0_init_microcode: amdgpu_ucode_validate returned %d\n", err);
if (err) {
//goto out;
printk(KERN_INFO "ignoring error from amdgpu_ucode_validate!\n");
}
rlc_hdr = (const struct rlc_firmware_header_v2_0 *)adev->gfx.rlc_fw->data;
version_major = le16_to_cpu(rlc_hdr->header.header_version_major);
version_minor = le16_to_cpu(rlc_hdr->header.header_version_minor);
err = amdgpu_gfx_rlc_init_microcode(adev, version_major, version_minor);
printk(KERN_INFO "gfx_v10_0_init_microcode: amdgpu_gfx_rlc_init_microcode returned %d\n", err);
if (err)
goto out;
}
Edit: This may not be a proper solution because I've experience several hard lockups with this (flashing capslock led, no traces in logs). lspci:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c3)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c3)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
04:00.0 Network controller: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
06:00.0 Non-Volatile memory controller: Micron/Crucial Technology P1 NVMe PCIe SSD (rev 03)
07:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 500c (rev 01)
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne (rev c5)
08:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
08:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
08:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
08:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
08:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor (rev 01)
08:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
08:00.7 Signal processing controller: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub
firmware is from linux-firmware-20220913.tar.gz
Edit: I just experienced another lockup with a kernel built with 2fe205008e9b70c67a9f3502831074ff36b00093 so this might be caused by an improper mitigation of this issue: https://bugzilla.kernel.org/show_bug.cgi?id=216528