Kernel BUG: occassional invalid opcode in amdgpu_vm_fini starting from 6.5.6
Brief summary of the problem:
This issue was originally reported in #3007 (closed)
Usually the kernel crashes at arbitrary place due to corruption of kernel data structure. The invalid opcode in amdgpu_vm_fini happens most frequently.
I narrowed down the faulty kernel version:
6.5.5 works fine for days
https://koji.fedoraproject.org/koji/buildinfo?buildID=2294111 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdkfd?id=2309983b0ac063045af3b01b0251dfd118d45449
6.5.6 crashes in couple of hours
Hardware description:
inxi -C -m -G
Memory:
System RAM: total: 32 GiB available: 31.26 GiB used: 4.11 GiB (13.1%)
Message: For most reliable report, use superuser + dmidecode.
Array-1: capacity: 128 GiB slots: 8 modules: 4 EC: Multi-bit ECC
Device-1: DIMM1 type: DDR4 size: 8 GiB speed: 2133 MT/s
Device-2: DIMM5 type: no module installed
Device-3: DIMM3 type: DDR4 size: 8 GiB speed: 2133 MT/s
Device-4: DIMM7 type: no module installed
Device-5: DIMM2 type: DDR4 size: 8 GiB speed: 2133 MT/s
Device-6: DIMM6 type: no module installed
Device-7: DIMM4 type: DDR4 size: 8 GiB speed: 2133 MT/s
Device-8: DIMM8 type: no module installed
CPU:
Info: 16-core model: Intel Xeon E5-2698 v3 bits: 64 type: MT MCP cache: L2: 4 MiB
Speed (MHz): avg: 2295 min/max: 1200/3600 cores: 1: 2295 2: 2295 3: 2295 4: 2295 5: 2295
6: 2295 7: 2295 8: 2295 9: 2295 10: 2295 11: 2295 12: 2295 13: 2295 14: 2295 15: 2295 16: 2295
17: 2295 18: 2295 19: 2295 20: 2295 21: 2295 22: 2295 23: 2295 24: 2295 25: 2295 26: 2295
27: 2295 28: 2295 29: 2295 30: 2295 31: 2295 32: 2295
Graphics:
Device-1: NVIDIA GK107GL [Quadro K2000] driver: N/A
Device-2: AMD Vega 20 [Radeon VII] driver: amdgpu v: kernel
Display: server: X.org v: 1.20.14 with: Xwayland v: 23.2.6 driver: X:
loaded: amdgpu,radeon,vesa unloaded: fbdev,modesetting gpu: amdgpu tty: 254x61
API: OpenGL Message: GL data unavailable in console. Try -G --display
API: EGL Message: EGL data unavailable in console, eglinfo missing.
System information:
- Fedora Core 39/40
- Kernel version: 6.5.6, 6.8.7, 6.8.10
Here is the latest kernel that I tried out:
Linux localhost 6.8.10-300.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 17 21:20:54 UTC 2024 x86_64 GNU/Linux
How to reproduce the issue:
run an opencl job (e.g. https://einsteinathome.org)
Attached files:
Log files (for system lockups / game freezes / crashes)
dmesg BUG report with 6.8.10
[ 8337.166430] ------------[ cut here ]------------
[ 8337.166437] kernel BUG at mm/slub.c:553!
[ 8337.166450] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 8337.166456] CPU: 22 PID: 817 Comm: kworker/22:2 Tainted: P S OE 6.8.10-300.fc40.x86_64 #1
[ 8337.166461] Hardware name: Dell Inc. Precision Tower 5810/0K240Y, BIOS A32 09/24/2019
[ 8337.166464] Workqueue: events delayed_fput
[ 8337.166475] RIP: 0010:__slab_free+0x152/0x310
[ 8337.166483] Code: 00 4c 89 ff e8 9f e2 d2 00 48 8b 14 24 48 8b 4c 24 20 48 89 44 24 08 48 8b 03 48 c1 e8 09 83 e0 01 88 44 24 13 e9 71 ff ff ff <0f> 0b 66 41 f7 44 24 09 0d 21 75 b3 eb a9 66 41 f7 44 24 09 0d 21
[ 8337.166487] RSP: 0000:ffffaac640967bf0 EFLAGS: 00010246
[ 8337.166492] RAX: ffff9582881cf280 RBX: fffff20d04207380 RCX: 000000008020001f
[ 8337.166496] RDX: ffff9582881cf200 RSI: fffff20d04207380 RDI: ffffaac640967c60
[ 8337.166498] RBP: ffffaac640967c90 R08: 0000000000000001 R09: ffffffffc079a5f9
[ 8337.166501] R10: ffff9589af922018 R11: 0000000000000007 R12: ffff958280046a00
[ 8337.166504] R13: ffff9582881cf200 R14: ffffffffc079a5f9 R15: ffff958445b60000
[ 8337.166507] FS: 0000000000000000(0000) GS:ffff9589af900000(0000) knlGS:0000000000000000
[ 8337.166510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8337.166514] CR2: 00007f3e885c1000 CR3: 0000000111876001 CR4: 00000000001706f0
[ 8337.166518] Call Trace:
[ 8337.166522] <TASK>
[ 8337.166526] ? die+0x36/0x90
[ 8337.166532] ? do_trap+0xda/0x100
[ 8337.166539] ? __slab_free+0x152/0x310
[ 8337.166544] ? do_error_trap+0x6a/0x90
[ 8337.166549] ? __slab_free+0x152/0x310
[ 8337.166553] ? exc_invalid_op+0x50/0x70
[ 8337.166558] ? __slab_free+0x152/0x310
[ 8337.166563] ? asm_exc_invalid_op+0x1a/0x20
[ 8337.166571] ? amdgpu_vm_fini+0x49/0x500 [amdgpu]
[ 8337.167234] ? amdgpu_vm_fini+0x49/0x500 [amdgpu]
[ 8337.167807] ? __slab_free+0x152/0x310
[ 8337.167815] ? __cancel_work_timer+0x103/0x1a0
[ 8337.167822] ? amdgpu_vm_fini+0x49/0x500 [amdgpu]
[ 8337.168399] kfree+0x2c6/0x2f0
[ 8337.168406] amdgpu_vm_fini+0x49/0x500 [amdgpu]
[ 8337.168978] amdgpu_driver_postclose_kms+0x199/0x290 [amdgpu]
[ 8337.169536] drm_file_free+0x21c/0x270
[ 8337.169543] drm_release+0x64/0xc0
[ 8337.169548] __fput+0x9a/0x2c0
[ 8337.169557] delayed_fput+0x23/0x30
[ 8337.169561] process_one_work+0x172/0x330
[ 8337.169567] worker_thread+0x273/0x3c0
[ 8337.169573] ? __pfx_worker_thread+0x10/0x10
[ 8337.169578] kthread+0xe8/0x120
[ 8337.169582] ? __pfx_kthread+0x10/0x10
[ 8337.169586] ret_from_fork+0x34/0x50
[ 8337.169593] ? __pfx_kthread+0x10/0x10
[ 8337.169597] ret_from_fork_asm+0x1b/0x30
[ 8337.169604] </TASK>
[ 8337.169607] Modules linked in: nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(POE) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nvidia(POE) nf_tables sunrpc intel_rapl_msr intel_rapl_common sb_edac vfat fat x86_pkg_temp_thermal intel_powerclamp pktcdvd coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel kvm snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq irqbypass mei_wdt snd_seq_device dell_smbios rapl snd_pcm iTCO_wdt dcdbas mei_me intel_pmc_bxt dell_smm_hwmon iTCO_vendor_support intel_cstate dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof snd_timer e1000e intel_uncore tg3 pcspkr mei snd i2c_i801 i2c_smbus lpc_ich soundcore loop nfnetlink zram amdgpu video amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy drm_display_helper crct10dif_pclmul crc32_pclmul
[ 8337.169710] crc32c_intel polyval_clmulni polyval_generic mxm_wmi ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 cec ata_generic pata_acpi wmi fuse
[ 8337.169751] ---[ end trace 0000000000000000 ]---
[ 8337.169756] RIP: 0010:__slab_free+0x152/0x310
[ 8337.169762] Code: 00 4c 89 ff e8 9f e2 d2 00 48 8b 14 24 48 8b 4c 24 20 48 89 44 24 08 48 8b 03 48 c1 e8 09 83 e0 01 88 44 24 13 e9 71 ff ff ff <0f> 0b 66 41 f7 44 24 09 0d 21 75 b3 eb a9 66 41 f7 44 24 09 0d 21
[ 8337.169766] RSP: 0000:ffffaac640967bf0 EFLAGS: 00010246
[ 8337.169771] RAX: ffff9582881cf280 RBX: fffff20d04207380 RCX: 000000008020001f
[ 8337.169775] RDX: ffff9582881cf200 RSI: fffff20d04207380 RDI: ffffaac640967c60
[ 8337.169778] RBP: ffffaac640967c90 R08: 0000000000000001 R09: ffffffffc079a5f9
[ 8337.169781] R10: ffff9589af922018 R11: 0000000000000007 R12: ffff958280046a00
[ 8337.169784] R13: ffff9582881cf200 R14: ffffffffc079a5f9 R15: ffff958445b60000
[ 8337.169788] FS: 0000000000000000(0000) GS:ffff9589af900000(0000) knlGS:0000000000000000
[ 8337.169792] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8337.169795] CR2: 00007f3e885c1000 CR3: 0000000111876001 CR4: 00000000001706f0