[bisected] System crash with new IP discovery code with Green Sardine & Navy Flunder (Asus ROG G513QY)
Both the drm branch and agd5f's drm-next branch fail to boot with amdgpu built in
03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT / 6800M] [1002:73df] (rev c3)
08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [1002:1638] (rev c4)
The crash is in early boot, the ROG boot screen remains on the screen and it periodically reboots, no logs are available
I bisected to:
commit 75aa18415a4c56d1aacc07cac00f813fdd5d8799
Author: Alex Deucher <alexander.deucher@amd.com>
Date: Wed Jul 28 11:16:12 2021 -0400
drm/amdgpu: drive all navi asics from the IP discovery table
Rather than hardcoding based on asic_type, use the IP
discovery table to configure the driver.
v2: rebase
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
I've since switched to compiling amdgpu as a module, which allows the system to start, modprobing then shows the following error:
[ 46.938606] amdgpu: unknown parameter 'resize_bar' ignored
[ 46.940250] [drm] amdgpu kernel modesetting enabled.
[ 46.941738] vga_switcheroo: detected switching method \_SB_.PCI0.GP17.VGA_.ATPX handle
[ 46.943496] ATPX version 1, functions 0x00000201
[ 46.945147] ATPX Hybrid Graphics
[ 46.949454] amdgpu: Virtual CRAT table created for CPU
[ 46.951071] amdgpu: Topology: Add CPU node
[ 46.952657] checking generic (fc20000000 e10000) vs hw (f800000000 400000000)
[ 46.952658] checking generic (fc20000000 e10000) vs hw (fc00000000 10000000)
[ 46.952659] checking generic (fc20000000 e10000) vs hw (fca00000 100000)
[ 46.952690] amdgpu 0000:03:00.0: enabling device (0000 -> 0002)
[ 46.954398] [drm] initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1043:0x16C2 0xC3).
[ 46.956041] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 46.957845] [drm] register mmio base: 0xFCA00000
[ 46.959467] [drm] register mmio size: 1048576
[ 46.962330] [drm] add ip block number 0 <nv_common>
[ 46.963951] [drm] add ip block number 1 <gmc_v10_0>
[ 46.965598] [drm] add ip block number 2 <navi10_ih>
[ 46.967196] [drm] add ip block number 3 <psp>
[ 46.968775] [drm] add ip block number 4 <smu>
[ 46.970326] [drm] add ip block number 5 <dm>
[ 46.971876] [drm] add ip block number 6 <gfx_v10_0>
[ 46.973427] [drm] add ip block number 7 <sdma_v5_2>
[ 46.974964] [drm] add ip block number 8 <vcn_v3_0>
[ 46.976486] [drm] add ip block number 9 <jpeg_v3_0>
[ 46.978921] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM
[ 46.980417] amdgpu: ATOM BIOS: SWBRT77321.001
[ 46.981922] [drm] VCN(0) decode is enabled in VM mode
[ 46.983364] [drm] VCN(0) encode is enabled in VM mode
[ 46.984802] [drm] JPEG decode is enabled in VM mode
[ 46.986251] [drm] GPU posting now...
[ 46.987662] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 46.989077] amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[ 46.990509] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 46.991871] amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[ 46.993212] [drm] Detected VRAM RAM=12272M, BAR=16384M
[ 46.994449] [drm] RAM width 192bits GDDR6
[ 46.995717] [drm] amdgpu: 12272M of VRAM memory ready
[ 46.996914] [drm] amdgpu: 12272M of GTT memory ready.
[ 46.998073] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 46.999390] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[ 47.000730] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[ 48.988424] [drm] Loading DMUB firmware via PSP: version=0x02020003
[ 48.991702] [drm] use_doorbell being set to: [true]
[ 48.992919] [drm] use_doorbell being set to: [true]
[ 48.994076] [drm] Found VCN firmware Version ENC: 1.16 DEC: 2 VEP: 0 Revision: 1
[ 48.995192] amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
[ 49.176113] [drm] reserve 0xa00000 from 0x82fe000000 for PSP TMR
[ 51.258871] [drm] failed to load ucode DMCUB(0x22)
[ 51.258874] [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
[ 51.261543] [drm:psp_hw_init.llvm.2454167530214584832 [amdgpu]] *ERROR* PSP firmware loading failed
[ 51.262747] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[ 51.263942] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[ 51.265087] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[ 51.269457] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[ 51.272071] amdgpu: probe of 0000:03:00.0 failed with error -22
[ 51.273225] BUG: unable to handle page fault for address: ffffc9033ecfe000
[ 51.274322] #PF: supervisor write access in kernel mode
[ 51.275391] #PF: error_code(0x0002) - not-present page
[ 51.276482] PGD 100000067 P4D 100000067 PUD 111f42067 PMD 0
[ 51.277572] Oops: 0002 [#1] SMP NOPTI
[ 51.278652] CPU: 10 PID: 497 Comm: modprobe Not tainted 5.15.0-rc2-drm+ #960
[ 51.279789] Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.315 09/03/2021
[ 51.280992] RIP: 0010:vcn_v3_0_sw_fini.llvm.5082569627238247841+0x60/0xb0 [amdgpu]
[ 51.282289] Code: 0f 1f 84 00 00 00 00 00 66 90 48 ff c2 0f b6 f0 48 81 c1 90 0c 00 00 48 39 f2 73 21 8b b3 9c 24 01 00 0f a3 d6 72 e3 48 8b 01 <c7> 00 00 00 00 00 c6 40 41 00 0f b6 83 01 0b 01 00 eb cd 8b 7c 24
[ 51.284947] RSP: 0018:ffff88810d44b950 EFLAGS: 00010246
[ 51.286296] RAX: ffffc9033ecfe000 RBX: ffff8881091e0000 RCX: ffff8881091f1788
[ 51.287674] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffff8317ac90
[ 51.289092] RBP: ffff8881091e5ff8 R08: ffffea0000000000 R09: 0000000000000001
[ 51.290478] R10: 0000000000000001 R11: ffffffff00000000 R12: 0000000000000008
[ 51.291870] R13: 0000000000000080 R14: ffff8881091f5a78 R15: ffff8881091e0000
[ 51.293266] FS: 00007f3242be2380(0000) GS:ffff888fde680000(0000) knlGS:0000000000000000
[ 51.294671] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 51.296096] CR2: ffffc9033ecfe000 CR3: 000000010ef00000 CR4: 0000000000150ee0
[ 51.297545] Call Trace:
[ 51.298952] ? amdgpu_device_fini_sw+0x140/0x300 [amdgpu]
[ 51.300433] ? amdgpu_driver_release_kms+0xd/0x20 [amdgpu]
[ 51.301899] ? devm_drm_dev_init_release+0x2a/0x60
[ 51.303291] ? devres_release_all+0x9b/0x110
[ 51.304686] ? really_probe+0x11c/0x360
[ 51.306048] ? __driver_probe_device+0xe4/0x150
[ 51.307408] ? driver_probe_device+0x1a/0x190
[ 51.308805] ? __driver_attach.llvm.10157869819964109194+0xf7/0x240
[ 51.310185] ? bus_add_driver+0x20e/0x2a0
[ 51.311560] ? driver_register+0x81/0x120
[ 51.312936] ? 0xffffffffa082d000
[ 51.314254] ? do_one_initcall+0xfb/0x260
[ 51.315578] ? prep_compound_gigantic_page+0x290/0x2e0
[ 51.316945] ? do_init_module+0x1f/0x210
[ 51.318272] ? do_init_module+0x55/0x210
[ 51.319572] ? load_module+0x20c1/0x24f0
[ 51.320878] ? __do_sys_init_module+0x110/0x170
[ 51.322146] ? do_syscall_64+0x70/0xa0
[ 51.323390] ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 51.324642] Modules linked in: amdgpu(+) drm_ttm_helper ttm gpu_sched
[ 51.325875] CR2: ffffc9033ecfe000
[ 51.327070] ---[ end trace 80be1ec562f38dfa ]---
[ 51.448866] RIP: 0010:vcn_v3_0_sw_fini.llvm.5082569627238247841+0x60/0xb0 [amdgpu]
[ 51.450247] Code: 0f 1f 84 00 00 00 00 00 66 90 48 ff c2 0f b6 f0 48 81 c1 90 0c 00 00 48 39 f2 73 21 8b b3 9c 24 01 00 0f a3 d6 72 e3 48 8b 01 <c7> 00 00 00 00 00 c6 40 41 00 0f b6 83 01 0b 01 00 eb cd 8b 7c 24
[ 51.452904] RSP: 0018:ffff88810d44b950 EFLAGS: 00010246
[ 51.454236] RAX: ffffc9033ecfe000 RBX: ffff8881091e0000 RCX: ffff8881091f1788
[ 51.455561] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffff8317ac90
[ 51.456888] RBP: ffff8881091e5ff8 R08: ffffea0000000000 R09: 0000000000000001
[ 51.458239] R10: 0000000000000001 R11: ffffffff00000000 R12: 0000000000000008
[ 51.459546] R13: 0000000000000080 R14: ffff8881091f5a78 R15: ffff8881091e0000
[ 51.460821] FS: 00007f3242be2380(0000) GS:ffff888fde680000(0000) knlGS:0000000000000000
[ 51.462114] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 51.463338] CR2: ffffc9033ecfe000 CR3: 000000010ef00000 CR4: 0000000000150ee0
The full dmesg: