WARNING in kernel/workqueue.c:3094 __flush_work.isra.0+0x20e/0x220 when accessing /dev/kfd
Brief summary of the problem:
Any access to /dev/kfd results in a WARNING in kernel/workqueue.c:3094 __flush_work.isra.0+0x20e/0x220. I have noticed this at first while testing the OpenCL and ROCM 5.0 stacks, but while trying to collect information for the report, I have noticed that this happens anything tries to access it.
No graphical interface is in use, the driver is only in use by the framebuffer console, and unbinding fbcon still leads to the same issue.
Hardware description:
- CPU: AMD Ryzen 7 3700X
- GPU: 07:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev e7)
- System Memory: 64GB
- Display(s): ASUS VG248
- Type of Display Connection: DVI
System information:
- Distro name and Version: Debian bookworm/sid
- Kernel version: Linux labrador 5.16.0-1-amd64 #1 (closed) SMP PREEMPT Debian 5.16.7-2 (2022-02-09) x86_64 GNU/Linux
- Custom kernel: N/A
- AMD official driver version: N/A
How to reproduce the issue:
I first experienced it when trying to run clinfo after install the rocm 5.0 stack, but something as simple as od -N1 /dev/kfd
is sufficient to trigger the issue
Attached files:
Log files (for system lockups / game freezes / crashes)
Here is a sample log when modprobing amdgpu followed by od -N1 /dev/kfd
:
[Feb11 15:31] [drm] amdgpu kernel modesetting enabled.
[ +0.000107] amdgpu: Ignoring ACPI CRAT on non-APU system
[ +0.000005] amdgpu: Virtual CRAT table created for CPU
[ +0.000007] amdgpu: Topology: Add CPU node
[ +0.000073] amdgpu 0000:07:00.0: vgaarb: deactivate vga console
[ +0.000104] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1DA2:0xE353 0xE7).
[ +0.000005] amdgpu 0000:07:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ +0.000011] [drm] register mmio base: 0xFCF00000
[ +0.000002] [drm] register mmio size: 262144
[ +0.000007] [drm] add ip block number 0 <vi_common>
[ +0.000003] [drm] add ip block number 1 <gmc_v8_0>
[ +0.000003] [drm] add ip block number 2 <tonga_ih>
[ +0.000003] [drm] add ip block number 3 <gfx_v8_0>
[ +0.000002] [drm] add ip block number 4 <sdma_v3_0>
[ +0.000003] [drm] add ip block number 5 <powerplay>
[ +0.000002] [drm] add ip block number 6 <dm>
[ +0.000003] [drm] add ip block number 7 <uvd_v6_0>
[ +0.000003] [drm] add ip block number 8 <vce_v3_0>
[ +0.000016] amdgpu 0000:07:00.0: amdgpu: Fetched VBIOS from VFCT
[ +0.000004] amdgpu: ATOM BIOS: 113-4E353BU-O6B
[ +0.000015] [drm] UVD is enabled in VM mode
[ +0.000003] [drm] UVD ENC is enabled in VM mode
[ +0.000004] [drm] VCE enabled in VM mode
[ +0.000012] amdgpu 0000:07:00.0: amdgpu: PCI CONFIG reset
[ +0.000115] [drm] GPU posting now...
[ +0.109349] [drm] vm size is 256 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[ +0.000037] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_mc.bin
[ +0.000014] amdgpu 0000:07:00.0: amdgpu: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[ +0.000008] amdgpu 0000:07:00.0: amdgpu: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[ +0.000013] [drm] Detected VRAM RAM=8192M, BAR=256M
[ +0.000005] [drm] RAM width 256bits GDDR5
[ +0.000033] [drm] amdgpu: 8192M of VRAM memory ready
[ +0.000005] [drm] amdgpu: 8192M of GTT memory ready.
[ +0.000007] [drm] GART: num cpu pages 65536, num gpu pages 65536
[ +0.002527] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ +0.000121] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_pfp_2.bin
[ +0.000022] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_me_2.bin
[ +0.000017] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_ce_2.bin
[ +0.000006] [drm] Chained IB support enabled!
[ +0.000023] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_rlc.bin
[ +0.000092] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_mec_2.bin
[ +0.000091] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_mec2_2.bin
[ +0.000918] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_sdma.bin
[ +0.000021] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_sdma1.bin
[ +0.000060] amdgpu: hwmgr_sw_init smu backed is polaris10_smu
[ +0.000131] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_uvd.bin
[ +0.000008] [drm] Found UVD firmware Version: 1.130 Family ID: 16
[ +0.001153] amdgpu 0000:07:00.0: firmware: direct-loading firmware amdgpu/polaris10_vce.bin
[ +0.000008] [drm] Found VCE firmware Version: 53.26 Binary ID: 3
[ +0.272622] [drm] Display Core initialized with v3.2.160!
[ +0.000732] snd_hda_intel 0000:07:00.1: bound 0000:07:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[ +0.052830] [drm] UVD and UVD ENC initialized successfully.
[ +0.099952] [drm] VCE initialized successfully.
[ +0.001314] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ +0.000142] amdgpu: SRAT table not found
[ +0.000005] amdgpu: Virtual CRAT table created for GPU
[ +0.000111] amdgpu: Topology: Add dGPU node [0x67df:0x1002]
[ +0.000009] kfd kfd: amdgpu: added device 1002:67df
[ +0.000021] amdgpu 0000:07:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 9, active_cu_number 36
[ +0.002687] [drm] fb mappable at 0xE0530000
[ +0.000006] [drm] vram apper at 0xE0000000
[ +0.000004] [drm] size 8294400
[ +0.000003] [drm] fb depth is 24
[ +0.000004] [drm] pitch is 7680
[ +0.000075] fbcon: amdgpudrmfb (fb0) is primary device
[ +0.004998] Console: switching to colour frame buffer device 240x67
[ +0.018943] amdgpu 0000:07:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[ +0.022046] amdgpu 0000:07:00.0: amdgpu: Using BACO for runtime pm
[ +0.000641] [drm] Initialized amdgpu 3.44.0 20150101 for 0000:07:00.0 on minor 0
[Feb11 15:32] ------------[ cut here ]------------
[ +0.000037] WARNING: CPU: 14 PID: 51379 at kernel/workqueue.c:3094 __flush_work.isra.0+0x20e/0x220
[ +0.000044] Modules linked in: amdgpu gpu_sched drm_ttm_helper ttm drm_kms_helper cec rc_core i2c_algo_bit snd_seq_dummy snd_hrtimer snd_seq snd_seq_device xt_CHECKSUM xt_MASQUERADE xt_conntrack xt_tcpudp nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc cpufreq_conservative cpufreq_userspace cpufreq_ondemand cpufreq_powersave ipt_REJECT nf_reject_ipv4 xt_multiport nft_compat nft_counter nf_tables nfnetlink uinput nct6775 hwmon_vid drivetemp nls_ascii nls_cp437 vfat binfmt_misc fat intel_rapl_msr intel_rapl_common snd_hda_codec_realtek snd_hda_codec_generic edac_mce_amd ledtrig_audio snd_hda_codec_hdmi snd_hda_intel kvm_amd snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec kvm snd_hda_core irqbypass ghash_clmulni_intel aesni_intel eeepc_wmi asus_wmi crypto_simd cryptd snd_hwdep platform_profile battery rapl sparse_keymap snd_pcm evdev rfkill snd_timer video serio_raw efi_pstore ccp wmi_bmof pcspkr snd sp5100_tco k10temp watchdog rng_core soundcore sg button
[ +0.000062] acpi_cpufreq nfsd netconsole loop firewire_sbp2 firewire_core crc_itu_t auth_rpcgss ipmi_devintf ipmi_msghandler nfs_acl msr lockd parport_pc ppdev grace lp parport drm fuse sunrpc configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic zstd_compress efivarfs raid10 raid0 multipath linear raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid1 raid6_pq libcrc32c crc32c_generic hid_generic usbhid hid md_mod sd_mod t10_pi crc_t10dif crct10dif_generic ahci xhci_pci libahci xhci_hcd libata r8169 realtek crct10dif_pclmul usbcore scsi_mod crct10dif_common crc32_pclmul mdio_devres crc32c_intel i2c_piix4 libphy scsi_common usb_common wmi [last unloaded: amdgpu]
[ +0.000623] CPU: 14 PID: 51379 Comm: od Tainted: G W 5.16.0-1-amd64 #1 Debian 5.16.7-2
[ +0.000041] Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 1404 11/08/2019
[ +0.000039] RIP: 0010:__flush_work.isra.0+0x20e/0x220
[ +0.000025] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 29 ff ff ff 0f 0b e9 52 ff ff ff <0f> 0b 45 31 ed e9 48 ff ff ff e8 93 cb 88 00 0f 1f 00 0f 1f 44 00
[ +0.000075] RSP: 0018:ffffa8074114fc60 EFLAGS: 00010246
[ +0.000025] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ +0.000031] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff91b3b779f728
[ +0.000031] RBP: ffff91b3b779f728 R08: 0000000000000000 R09: 0000000000000000
[ +0.000030] R10: 0000000000000000 R11: 000000000000000e R12: ffff91b3b779f728
[ +0.000031] R13: 0000000000000001 R14: 0000000000000001 R15: ffff91b28bb17bb8
[ +0.000031] FS: 0000000000000000(0000) GS:ffff91c15ed80000(0000) knlGS:0000000000000000
[ +0.000036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000026] CR2: 00007ffece6c2e10 CR3: 000000012ba74000 CR4: 0000000000350ee0
[ +0.000031] Call Trace:
[ +0.000015] <TASK>
[ +0.000012] ? preempt_count_add+0x68/0xa0
[ +0.000023] ? _raw_spin_lock_irq+0x1a/0x40
[ +0.000024] ? wait_for_completion+0x33/0xe0
[ +0.000022] __cancel_work_timer+0x105/0x190
[ +0.000024] kfd_process_notifier_release+0x8b/0x150 [amdgpu]
[ +0.000248] __mmu_notifier_release+0x73/0x200
[ +0.000025] exit_mmap+0x1a9/0x1f0
[ +0.000019] ? netlink_unicast+0x309/0x350
[ +0.000023] ? preempt_count_add+0x68/0xa0
[ +0.000021] ? mutex_lock+0xe/0x30
[ +0.000018] mmput+0x56/0x140
[ +0.000019] do_exit+0x319/0xb30
[ +0.001907] ? vfs_write+0x209/0x2a0
[ +0.001904] do_group_exit+0x33/0xa0
[ +0.001933] __x64_sys_exit_group+0x14/0x20
[ +0.001880] do_syscall_64+0x3b/0xc0
[ +0.001848] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ +0.001816] RIP: 0033:0x7faaaa7626b9
[ +0.001827] Code: Unable to access opcode bytes at RIP 0x7faaaa76268f.
[ +0.001827] RSP: 002b:00007ffece6c57d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ +0.001844] RAX: ffffffffffffffda RBX: 00007faaaa856610 RCX: 00007faaaa7626b9
[ +0.001842] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
[ +0.001824] RBP: 0000000000000001 R08: ffffffffffffff88 R09: 0000000000000001
[ +0.001831] R10: fffffffffffffb85 R11: 0000000000000246 R12: 00007faaaa856610
[ +0.001825] R13: 0000000000000002 R14: 00007faaaa856ae8 R15: 0000000000000000
[ +0.001838] </TASK>
[ +0.001815] ---[ end trace 974f9df1dc68ef3f ]---