drm_atomic_private_obj_fini list corruption on driver init since 5.19 with RX 5700XT
On RX 5700 XT, ever since kernel 5.19, I'm getting kernel list corruption with CONFIG_BUG_ON_DATA_CORRUPTION=y
on ppc64le.
Full backtrace:
[ 3.121785] list_del corruption, c000000003a953f0->next is NULL
[ 3.121820] kernel BUG at lib/list_debug.c:49!
[ 3.121835] Oops: Exception in kernel mode, sig: 5 [#1]
[ 3.121849] LE PAGE_SIZE=4K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[ 3.121875] Modules linked in: zfs(PO) amdgpu(+) zunicode(PO) zzstd(O) zlua(O) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) sd_mod gpu_sched drm_buddy i2c_algo_bit drm_display_helper cec rc_core drm_ttm_helper ttm drm_kms_helper xhci_pci ahci xhci_pci_renesas libahci syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_hcd libata drm usbcore vmx_crypto gf128mul scsi_mod drm_panel_orientation_quirks usb_common scsi_common agpgart dm_mirror dm_region_hash dm_log dm_mod btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic crc32c_vpmsum
[ 3.122094] CPU: 0 PID: 229 Comm: kworker/0:3 Tainted: P O 6.0.8_1 #1
[ 3.122134] Workqueue: events work_for_cpu_fn
[ 3.122162] NIP: c0000000007d9f3c LR: c0000000007d9f38 CTR: c00000000091f060
[ 3.122201] REGS: c00000000aa9f360 TRAP: 0700 Tainted: P O (6.0.8_1)
[ 3.122248] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28002242 XER: 00000000
[ 3.122287] CFAR: c0000000001fa15c IRQMASK: 0
[ 3.122287] GPR00: c0000000007d9f38 c00000000aa9f600 c000000001969100 0000000000000033
[ 3.122287] GPR04: 00000000ffffbfff c00000000aa9f410 c00000000aa9f408 0000000000000000
[ 3.122287] GPR08: c00000000166e0c0 0000000000000001 0000000000000000 0000000000000001
[ 3.122287] GPR12: 0000000000002000 c000000001b50000 c000000003a90000 c000000003a80000
[ 3.122287] GPR16: c000000003a86068 c000000003a98330 c000000003a86088 c000000003a86090
[ 3.122287] GPR20: c000000003a86080 c008000002a73acc 0000000000000100 0000000000000001
[ 3.122287] GPR24: 0000000000000001 c008000002a33c40 c000000003a90000 c008000002a7437c
[ 3.122287] GPR28: c008000002a33bd0 0000000000000000 c000000003a96458 c000000003a953f0
[ 3.122537] NIP [c0000000007d9f3c] __list_del_entry_valid+0x9c/0x150
[ 3.122565] LR [c0000000007d9f38] __list_del_entry_valid+0x98/0x150
[ 3.122601] Call Trace:
[ 3.122620] [c00000000aa9f600] [c0000000007d9f38] __list_del_entry_valid+0x98/0x150 (unreliable)
[ 3.122675] [c00000000aa9f660] [c0080000015a1ec0] drm_atomic_private_obj_fini+0x28/0xd0 [drm]
[ 3.122725] [c00000000aa9f690] [c0080000027e58cc] amdgpu_dm_fini+0x94/0x200 [amdgpu]
[ 3.122996] [c00000000aa9f6d0] [c0080000027f2248] amdgpu_dm_init.isra.0+0x4c0/0x1e50 [amdgpu]
[ 3.123266] [c00000000aa9f930] [c0080000027f3c00] dm_hw_init+0x28/0x60 [amdgpu]
[ 3.123533] [c00000000aa9f960] [c0080000024a57f4] amdgpu_device_init+0x1c4c/0x2300 [amdgpu]
[ 3.123744] [c00000000aa9fac0] [c0080000024a7748] amdgpu_driver_load_kms+0x30/0x1e0 [amdgpu]
[ 3.123965] [c00000000aa9fb40] [c00800000249be68] amdgpu_pci_probe+0x1f0/0x540 [amdgpu]
[ 3.124171] [c00000000aa9fbe0] [c0000000008d4db8] local_pci_probe+0x68/0x110
[ 3.124223] [c00000000aa9fc60] [c00000000017f4e8] work_for_cpu_fn+0x38/0x60
[ 3.124261] [c00000000aa9fc90] [c000000000184e14] process_one_work+0x2a4/0x570
[ 3.124310] [c00000000aa9fd30] [c000000000185960] worker_thread+0x280/0x5b0
[ 3.124347] [c00000000aa9fdc0] [c0000000001919a0] kthread+0x120/0x130
[ 3.124397] [c00000000aa9fe10] [c00000000000cecc] ret_from_kernel_thread+0x5c/0x64
[ 3.124437] Instruction dump:
[ 3.124457] 7c252040 40820030 38600001 38210060 4e800020 7c0802a6 7c641b78 3c62ff73
[ 3.124497] 38636ac8 f8010070 4ba201e1 60000000 <0fe00000> 7c0802a6 3c62ff73 7d264b78
[ 3.124544] ---[ end trace 0000000000000000 ]---
I have not been able to narrow it down so far. It seems with CONFIG_BUG_ON_DATA_CORRUPTION
disabled, the problem may not be encountered, but this still indicates a bug in the driver.
On 5.18 kernel and earlier, this problem does not happen.
Edited by nina