[amdgpu] VBIOS from ROM BAR causes list_add corruption during init of gmc_v11_0 IP block
Brief summary of the problem:
With certain motherboard configuration options initializing a GPU will fail on my system. From the user's standpoint it looks like the boot hangs in TTY as systemd is printing its "STARTED" messages. However in reality the system boots up fully (e.g. you can ssh into it,) its just the display output that hangs (at the moment when amdgpu takes over the EFI initialized framebuffer, I imagine.) The notable differentiator between "works" and "won't work" case seems to be dependent on where the VBIOS is being loaded from. In the working configuration it comes from VFCT. Every time the VBIOS comes from ROM BAR, the initialization will fail.
Hardware description:
- CPU: 5700X
- GPU: Advanced Micro Devices, Inc. [AMD/ATI] Navi 32 [Radeon RX 7700 XT / 7800 XT] [1002:747e] (rev c8)
- System Memory: 32GiB
- Display(s): AW3423DWF
- Type of Display Connection: DP
- Motherboard: GIGABYTE B550 GAMING X V2
System information:
- Distro name and Version: NixOS
- Kernel version: 6.1.68
- Custom kernel: N/A
- AMD official driver version: N/A
How to reproduce the issue:
- Set up the system such that the VBIOS comes from ROM BAR instead of VFCT (or other similar sources?);
- In my case it seems like it may involve certain BIOS options, like for example the CSM support or perhaps the secure boot enable/disable option?
- Boot the kernel;
- Load the
amdgpu
module.
Log files (for system lockups / game freezes / crashes)
Jan 01 22:36:19 kernel: [drm] amdgpu kernel modesetting enabled.
Jan 01 22:36:19 kernel: amdgpu: Ignoring ACPI CRAT on non-APU system
Jan 01 22:36:19 kernel: amdgpu: Virtual CRAT table created for CPU
Jan 01 22:36:19 kernel: amdgpu: Topology: Add CPU node
Jan 01 22:36:19 kernel: [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x747E 0x1EAE:0x7801 0xC8).
Jan 01 22:36:19 kernel: [drm] register mmio base: 0xFC900000
Jan 01 22:36:19 kernel: [drm] register mmio size: 1048576
Jan 01 22:36:19 kernel: [drm] add ip block number 0 <soc21_common>
Jan 01 22:36:19 kernel: [drm] add ip block number 1 <gmc_v11_0>
Jan 01 22:36:19 kernel: [drm] add ip block number 2 <ih_v6_0>
Jan 01 22:36:19 kernel: [drm] add ip block number 3 <psp>
Jan 01 22:36:19 kernel: [drm] add ip block number 4 <smu>
Jan 01 22:36:19 kernel: [drm] add ip block number 5 <dm>
Jan 01 22:36:19 kernel: [drm] add ip block number 6 <gfx_v11_0>
Jan 01 22:36:19 kernel: [drm] add ip block number 7 <sdma_v6_0>
Jan 01 22:36:19 kernel: [drm] add ip block number 8 <vcn_v4_0>
Jan 01 22:36:19 kernel: [drm] add ip block number 9 <jpeg_v4_0>
Jan 01 22:36:19 kernel: [drm] add ip block number 10 <mes_v11_0>
Jan 01 22:36:19 kernel: [drm] BIOS signature incorrect 0 0
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: No more image in the PCI ROM
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: amdgpu: Fetched VBIOS from ROM BAR
Jan 01 22:36:19 kernel: amdgpu: ATOM BIOS: 113-EXT90440-100
Jan 01 22:36:19 kernel: [drm] VCN(0) encode/decode are enabled in VM mode
Jan 01 22:36:19 kernel: [drm] VCN(1) encode/decode are enabled in VM mode
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: vgaarb: deactivate vga console
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: amdgpu: MEM ECC is not presented.
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: amdgpu: SRAM ECC is not presented.
Jan 01 22:36:19 kernel: [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
Jan 01 22:36:19 kernel: [drm] Detected VRAM RAM=16368M, BAR=256M
Jan 01 22:36:19 kernel: [drm] RAM width 256bits GDDR6
Jan 01 22:36:19 kernel: ------------[ cut here ]------------
Jan 01 22:36:19 kernel: list_add corruption. prev->next should be next (ffffffffc16b7068), but was 0000000000000000. (prev=ffff8d85a98a5508).
Jan 01 22:36:19 kernel: WARNING: CPU: 6 PID: 2357 at lib/list_debug.c:30 __list_add_valid+0x7f/0xa0
Jan 01 22:36:19 kernel: Modules linked in: amdgpu(+) xt_MASQUERADE xt_mark nft_chain_nat af_packet ip6table_nat nf_nat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi gpu_sched drm_buddy video edac_mce_amd drm_ttm_helper edac_core snd_hda_intel intel_rapl_msr ttm intel_rapl_common snd_intel_dspcfg snd_intel_sdw_acpi crc32_pclmul drm_display_helper snd_hda_codec polyval_clmulni polyval_generic snd_hda_core ghash_clmulni_intel drm_kms_helper r8169 sha512_ssse3 snd_hwdep sha512_generic snd_pcm atlantic sha256_ssse3 sha1_ssse3 sp5100_tco agpgart snd_timer aesni_intel i2c_algo_bit macsec fb_sys_fops snd syscopyarea ptp realtek watchdog crypto_simd wmi_bmof joydev serio_raw gigabyte_wmi sysfillrect input_leds mdio_devres mousedev cryptd psmouse rapl libphy sysimgblt ccp i2c_piix4 soundcore zenpower(O) pps_core evdev mac_hid thermal uas wmi tpm_crb tiny_power_button gpio_amdpt tpm_tis acpi_cpufreq gpio_generic tpm_tis_core button ip6_tables xt_conntrack nf_conntrack
Jan 01 22:36:19 kernel: nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat nf_tables libcrc32c crc32c_generic nfnetlink sch_fq_codel loop tun tap macvlan bridge drm stp llc fuse deflate backlight efi_pstore configfs zstd zstd_compress zram zsmalloc efivarfs tpm rng_core dmi_sysfs ip_tables x_tables autofs4 atkbd libps2 vivaldi_fmap crc32c_intel hid_generic i8042 rtc_cmos serio dm_mod dax nls_iso8859_1 nls_cp437 vfat fat zfs(PO) spl(O) sd_mod usb_storage usbhid hid ahci libahci libata scsi_mod scsi_common xhci_pci xhci_pci_renesas xhci_hcd usbcore usb_common nvme nvme_core t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common [last unloaded: amdgpu]
Jan 01 22:36:19 kernel: CPU: 6 PID: 2357 Comm: insmod Tainted: P W O 6.1.68 #1-NixOS
Jan 01 22:36:19 kernel: Hardware name: Gigabyte Technology Co., Ltd. B550 GAMING X V2/B550 GAMING X V2, BIOS FDc 09/20/2023
Jan 01 22:36:19 kernel: RIP: 0010:__list_add_valid+0x7f/0xa0
Jan 01 22:36:19 kernel: Code: eb e9 48 89 c1 48 c7 c7 c8 d7 36 bd e8 da 7d b9 ff 0f 0b eb d6 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 18 d8 36 bd e8 c1 7d b9 ff <0f> 0b eb bd 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 68 d8 36 bd e8 a8
Jan 01 22:36:19 kernel: RSP: 0018:ffff9756064cb9f0 EFLAGS: 00010286
Jan 01 22:36:19 kernel: RAX: 0000000000000000 RBX: ffff8d85a98a5508 RCX: 0000000000000027
Jan 01 22:36:19 kernel: RDX: ffff8d8c9eba15a8 RSI: 0000000000000001 RDI: ffff8d8c9eba15a0
Jan 01 22:36:19 kernel: RBP: ffff8d85a9905508 R08: 0000000000000000 R09: ffff9756064cb880
Jan 01 22:36:19 kernel: R10: 0000000000000003 R11: ffffffffbdb4e808 R12: ffff8d859dc243f0
Jan 01 22:36:19 kernel: R13: 0000000000000000 R14: ffffffffc1f6c4a0 R15: 0000000000000000
Jan 01 22:36:19 kernel: FS: 00007fb8206621c0(0000) GS:ffff8d8c9eb80000(0000) knlGS:0000000000000000
Jan 01 22:36:19 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 01 22:36:19 kernel: CR2: 00005626d1a15378 CR3: 000000013c614000 CR4: 0000000000750ee0
Jan 01 22:36:19 kernel: PKRU: 55555554
Jan 01 22:36:19 kernel: Call Trace:
Jan 01 22:36:19 kernel: <TASK>
Jan 01 22:36:19 kernel: ? __warn+0x7d/0xc0
Jan 01 22:36:19 kernel: ? __list_add_valid+0x7f/0xa0
Jan 01 22:36:19 kernel: ? report_bug+0xe2/0x150
Jan 01 22:36:19 kernel: ? handle_bug+0x41/0x70
Jan 01 22:36:19 kernel: ? exc_invalid_op+0x13/0x60
Jan 01 22:36:19 kernel: ? asm_exc_invalid_op+0x16/0x20
Jan 01 22:36:19 kernel: ? __list_add_valid+0x7f/0xa0
Jan 01 22:36:19 kernel: ttm_device_init+0x132/0x170 [ttm]
Jan 01 22:36:19 kernel: amdgpu_ttm_init+0xb8/0x440 [amdgpu]
Jan 01 22:36:19 kernel: ? _printk+0x68/0x83
Jan 01 22:36:19 kernel: gmc_v11_0_sw_init+0x294/0x3c0 [amdgpu]
Jan 01 22:36:19 kernel: amdgpu_device_init.cold+0x1304/0x1ede [amdgpu]
Jan 01 22:36:19 kernel: amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
Jan 01 22:36:19 kernel: amdgpu_pci_probe+0x13f/0x360 [amdgpu]
Jan 01 22:36:19 kernel: local_pci_probe+0x3e/0x80
Jan 01 22:36:19 kernel: pci_device_probe+0xbf/0x230
Jan 01 22:36:19 kernel: ? sysfs_do_create_link_sd+0x6e/0xe0
Jan 01 22:36:19 kernel: really_probe+0xde/0x380
Jan 01 22:36:19 kernel: ? pm_runtime_barrier+0x50/0x90
Jan 01 22:36:19 kernel: __driver_probe_device+0x78/0x120
Jan 01 22:36:19 kernel: driver_probe_device+0x1f/0x90
Jan 01 22:36:19 kernel: __driver_attach+0xce/0x1c0
Jan 01 22:36:19 kernel: ? __device_attach_driver+0x110/0x110
Jan 01 22:36:19 kernel: bus_for_each_dev+0x87/0xd0
Jan 01 22:36:19 kernel: bus_add_driver+0x1ae/0x200
Jan 01 22:36:19 kernel: driver_register+0x89/0xe0
Jan 01 22:36:19 kernel: ? 0xffffffffc20e9000
Jan 01 22:36:19 kernel: do_one_initcall+0x59/0x220
Jan 01 22:36:19 kernel: do_init_module+0x4a/0x1e0
Jan 01 22:36:19 kernel: __do_sys_init_module+0x17f/0x1b0
Jan 01 22:36:19 kernel: do_syscall_64+0x3a/0x90
Jan 01 22:36:19 kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
Jan 01 22:36:19 kernel: RIP: 0033:0x7fb8207767be
Jan 01 22:36:19 kernel: Code: 48 8b 0d 75 a6 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 42 a6 0c 00 f7 d8 64 89 01 48
Jan 01 22:36:19 kernel: RSP: 002b:00007ffc542205c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
Jan 01 22:36:19 kernel: RAX: ffffffffffffffda RBX: 00007fb81e70c010 RCX: 00007fb8207767be
Jan 01 22:36:19 kernel: RDX: 00000000007992a0 RSI: 000000000125a848 RDI: 00007fb81e70c010
Jan 01 22:36:19 kernel: RBP: 00000000007997a0 R08: 0000000000000001 R09: 0000000000000000
Jan 01 22:36:19 kernel: R10: 0000000000000071 R11: 0000000000000246 R12: 00000000007992a0
Jan 01 22:36:19 kernel: R13: 0000000000000000 R14: 0000000000799730 R15: 0000000000000002
Jan 01 22:36:19 kernel: </TASK>
Jan 01 22:36:19 kernel: ---[ end trace 0000000000000000 ]---
Jan 01 22:36:19 kernel: [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init of IP block <gmc_v11_0> failed -22
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: amdgpu: amdgpu_device_ip_init failed
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: amdgpu: Fatal error during GPU init
Jan 01 22:36:19 kernel: amdgpu 0000:09:00.0: amdgpu: amdgpu: finishing device.
Jan 01 22:36:19 kernel: amdgpu: probe of 0000:09:00.0 failed with error -22