BUG: KASAN: slab-use-after-free in amdgpu_amdkfd_gpuvm_acquire_process_vm+0x10fe
slab-use-after-free happens when I launch BlackmagicRAWPlayer.
So for reproduce it firstly you should:
- Install rocm-opencl
dnf install rocm-opencl
- download and install DaVinci Resolve https://www.blackmagicdesign.com/products/davinciresolve
Downloads/DaVinci_Resolve_18.6.4_Linux.run
- and then launch BlackmagicRAWPlayer.
/opt/resolve/BlackmagicRAWPlayer/BlackmagicRAWPlayer
Demonstration:
Screencast_from_2023-12-17_16-07-23
Kernel log: dmesg.zip
distro: Fedora Rawhide
kernel: 6.7.0-0.rc5 3f7168591ebf
mesa: 24.0.0 ddf2ca4faffd
GPU: 7900XTX
CPU: 7950x
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Mikhail Gavrilov changed the description
changed the description
- Mikhail Gavrilov changed the description
changed the description
- Alex Deucher added 7000 dGPU series amdkfd labels
added 7000 dGPU series amdkfd labels
The backtraces before the slab-use-after-free are similar to those reported by kmemleak in #3094, and the one in dmesg-libreoffice log in #2991 (closed), and the one in dmesg.txt in #3097 (closed).
[ 11.926843] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:306 [ 11.926859] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 655, name: (udev-worker) [ 11.926874] preempt_count: 2, expected: 0 [ 11.926882] RCU nest depth: 0, expected: 0 [ 11.926890] 1 lock held by (udev-worker)/655: [ 11.926898] #0: ffff888112d301b0 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x1da/0x4a0 [ 11.926923] Preemption disabled at: [ 11.926924] [<ffffffffc153b151>] dc_fpu_begin+0x31/0x180 [amdgpu] [ 11.927266] CPU: 13 PID: 655 Comm: (udev-worker) Tainted: G L ------- --- 6.7.0-0.rc5.20231215git3f7168591ebf.45.fc40.x86_64+debug #1 [ 11.927288] Hardware name: Micro-Star International Co., Ltd. MS-7D73/MPG B650I EDGE WIFI (MS-7D73), BIOS 1.60 11/07/2023 [ 11.927306] Call Trace: [ 11.927312] <TASK> [ 11.927317] dump_stack_lvl+0xb1/0xd0 [ 11.927328] __might_resched+0x3d9/0x600 [ 11.927338] ? local_clock_noinstr+0xd/0xc0 [ 11.927348] ? __pfx___might_resched+0x10/0x10 [ 11.927360] __kmem_cache_alloc_node+0x36a/0x390 [ 11.927371] ? __pfx___up_read+0x10/0x10 [ 11.927380] ? dcn32_clock_source_create+0x53/0x120 [amdgpu] [ 11.927651] kmalloc_trace+0x2a/0xc0 [ 11.927660] dcn32_clock_source_create+0x53/0x120 [amdgpu] [ 11.927915] dcn32_create_resource_pool+0x2df0/0xd830 [amdgpu] [ 11.928168] ? __pfx_dcn32_create_resource_pool+0x10/0x10 [amdgpu] [ 11.928420] ? rcu_is_watching+0x15/0xb0 [ 11.928430] ? __kmalloc+0xe4/0x160 [ 11.928440] dc_create_resource_pool+0x41b/0x7f0 [amdgpu] [ 11.928702] dc_create+0x614/0x1b40 [amdgpu] [ 11.928953] ? __pfx_dc_create+0x10/0x10 [amdgpu] [ 11.929195] ? dmi_matches+0xa8/0x220 [ 11.929205] ? kasan_set_track+0x25/0x30 [ 11.929216] amdgpu_dm_init.isra.0+0x6ce/0x5f90 [amdgpu] [ 11.929489] ? local_clock_noinstr+0xd/0xc0 [ 11.929501] ? dev_printk_emit+0xf9/0x140 [ 11.929511] ? __pfx_amdgpu_dm_init.isra.0+0x10/0x10 [amdgpu] [ 11.929765] ? amdgpu_device_wreg.part.0+0x2b7/0x350 [amdgpu] [ 11.929998] ? _raw_spin_unlock_irqrestore+0x66/0x80 [ 11.930009] ? lockdep_hardirqs_on+0x81/0x110 [ 11.930029] ? smu_hw_init+0x522/0x830 [amdgpu] [ 11.930292] ? __pfx_smu_hw_init+0x10/0x10 [amdgpu] [ 11.930540] dm_hw_init+0x12/0x30 [amdgpu] [ 11.930803] amdgpu_device_init+0x5afd/0x8870 [amdgpu] [ 11.931044] ? __pfx_amdgpu_device_init+0x10/0x10 [amdgpu] [ 11.931276] ? __pfx_pci_bus_read_config_word+0x10/0x10 [ 11.931291] ? do_pci_enable_device+0x22d/0x2a0 [ 11.931303] ? pci_wait_for_pending+0xd1/0x110 [ 11.931316] amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu] [ 11.931547] amdgpu_pci_probe+0x2d5/0xc00 [amdgpu] [ 11.931776] ? _raw_spin_unlock_irqrestore+0x4f/0x80 [ 11.931788] ? __pfx_amdgpu_pci_probe+0x10/0x10 [amdgpu] [ 11.932016] local_pci_probe+0xda/0x190 [ 11.932026] pci_device_probe+0x23a/0x780 [ 11.932034] ? kernfs_add_one+0x326/0x490 [ 11.932044] ? kernfs_get.part.0+0x4c/0x70 [ 11.932053] ? __pfx_pci_device_probe+0x10/0x10 [ 11.932064] ? kernfs_create_link+0x16b/0x230 [ 11.932074] ? kernfs_put+0x1c/0x40 [ 11.932082] ? sysfs_do_create_link_sd+0x8e/0x100 [ 11.932095] really_probe+0x3df/0xb80 [ 11.932105] __driver_probe_device+0x18c/0x450 [ 11.932116] driver_probe_device+0x4a/0x120 [ 11.932125] __driver_attach+0x1e5/0x4a0 [ 11.932134] ? __pfx___driver_attach+0x10/0x10 [ 11.932143] bus_for_each_dev+0x106/0x190 [ 11.932153] ? __pfx_bus_for_each_dev+0x10/0x10 [ 11.932167] bus_add_driver+0x2a1/0x570 [ 11.932178] driver_register+0x134/0x460 [ 11.932187] ? __pfx_amdgpu_init+0x10/0x10 [amdgpu] [ 11.932419] do_one_initcall+0xd3/0x430 [ 11.932429] ? __pfx_do_one_initcall+0x10/0x10 [ 11.932442] ? kasan_unpoison+0x44/0x70 [ 11.932453] do_init_module+0x238/0x770 [ 11.932466] load_module+0x5581/0x6f10 [ 11.932486] ? __pfx_load_module+0x10/0x10 [ 11.932499] ? find_held_lock+0x34/0x120 [ 11.932509] ? local_clock_noinstr+0xd/0xc0 [ 11.932523] ? __might_fault+0xc6/0x180 [ 11.932532] ? __pfx___might_resched+0x10/0x10 [ 11.932544] ? __do_sys_init_module+0x1f2/0x220 [ 11.932553] __do_sys_init_module+0x1f2/0x220 [ 11.932563] ? __pfx___do_sys_init_module+0x10/0x10 [ 11.932582] do_syscall_64+0x61/0xe0 [ 11.932590] ? do_syscall_64+0x70/0xe0 [ 11.932602] ? asm_exc_page_fault+0x26/0x30 [ 11.932612] ? lockdep_hardirqs_on+0x81/0x110 [ 11.932621] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 11.932631] RIP: 0033:0x7efe0b03919e [ 11.932643] Code: 48 8b 0d 7d 8c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4a 8c 0c 00 f7 d8 64 89 01 48 [ 11.932672] RSP: 002b:00007ffdd042cce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af [ 11.932686] RAX: ffffffffffffffda RBX: 0000564656496e20 RCX: 00007efe0b03919e [ 11.932699] RDX: 00005646564930e0 RSI: 00000000041fc65e RDI: 00007efe04b02010 [ 11.932711] RBP: 00007ffdd042cda0 R08: 000056465648f010 R09: 0000000000000007 [ 11.932723] R10: 0000000000000006 R11: 0000000000000246 R12: 00005646564930e0 [ 11.932735] R13: 0000000000020000 R14: 00005646564ce830 R15: 00005646564c15a0 [ 11.932753] </TASK>
[ 180.401188] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:306 [ 180.401246] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 10658, name: KMS thread [ 180.401251] preempt_count: 2, expected: 0 [ 180.401255] RCU nest depth: 0, expected: 0 [ 180.401259] 3 locks held by KMS thread/10658: [ 180.401263] #0: ffffc9000d20fab8 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_mode_atomic_ioctl+0x322/0x22f0 [ 180.401283] #1: ffff8881366005a0 (crtc_ww_class_mutex){+.+.}-{3:3}, at: modeset_lock+0xe7/0x550 [ 180.401299] #2: ffff88813664cb90 (&adev->dm.dc_lock){+.+.}-{3:3}, at: amdgpu_dm_atomic_commit_tail+0x48c9/0xfb00 [amdgpu] [ 180.401742] Preemption disabled at: [ 180.401744] [<ffffffffc153b151>] dc_fpu_begin+0x31/0x180 [amdgpu] [ 180.402160] CPU: 23 PID: 10658 Comm: KMS thread Tainted: G W OEL ------- --- 6.7.0-0.rc5.20231215git3f7168591ebf.45.fc40.x86_64+debug #1 [ 180.402166] Hardware name: Micro-Star International Co., Ltd. MS-7D73/MPG B650I EDGE WIFI (MS-7D73), BIOS 1.60 11/07/2023 [ 180.402170] Call Trace: [ 180.402174] <TASK> [ 180.402178] dump_stack_lvl+0xb1/0xd0 [ 180.402186] __might_resched+0x3d9/0x600 [ 180.402195] ? __pfx___might_resched+0x10/0x10 [ 180.402206] __kmem_cache_alloc_node+0x36a/0x390 [ 180.402213] ? dc_create_stream_for_sink+0x58/0xf40 [amdgpu] [ 180.402620] kmalloc_trace+0x2a/0xc0 [ 180.402628] dc_create_stream_for_sink+0x58/0xf40 [amdgpu] [ 180.403029] dcn32_add_phantom_pipes+0x91/0xc90 [amdgpu] [ 180.403443] dcn32_internal_validate_bw+0x3ab1/0x6d70 [amdgpu] [ 180.403853] ? amdgpu_dm_atomic_commit_tail+0x4dfa/0xfb00 [amdgpu] [ 180.404266] ? srso_alias_untrain_ret+0x1/0x10 [ 180.404291] ? __pfx_dcn32_internal_validate_bw+0x10/0x10 [amdgpu] [ 180.404700] ? kernel_fpu_begin_mask+0xff/0x200 [ 180.404710] ? kasan_set_track+0x25/0x30 [ 180.404717] ? rcu_is_watching+0x15/0xb0 [ 180.404727] dml1_validate+0x21b/0x9a0 [amdgpu] [ 180.405132] ? __pfx_dml1_validate+0x10/0x10 [amdgpu] [ 180.405543] ? dc_resource_state_copy_construct+0x4b7/0x760 [amdgpu] [ 180.405946] dc_update_planes_and_stream+0x11d8/0x31b0 [amdgpu] [ 180.406352] ? __pfx_dc_update_planes_and_stream+0x10/0x10 [amdgpu] [ 180.406746] ? __flush_workqueue+0x40e/0x12e0 [ 180.406759] ? find_held_lock+0x34/0x120 [ 180.406775] ? __pfx___might_resched+0x10/0x10 [ 180.406784] ? rcu_is_watching+0x15/0xb0 [ 180.406790] ? __mutex_lock+0x536/0x18b0 [ 180.406814] ? mark_held_locks+0x96/0xe0 [ 180.406828] amdgpu_dm_atomic_commit_tail+0x4dfa/0xfb00 [amdgpu] [ 180.407237] ? unwind_get_return_address+0x5e/0xa0 [ 180.407246] ? __pfx_mark_lock+0x10/0x10 [ 180.407279] ? __pfx_amdgpu_dm_atomic_commit_tail+0x10/0x10 [amdgpu] [ 180.407683] ? lock_acquire+0x1a6/0x4f0 [ 180.407688] ? find_held_lock+0x34/0x120 [ 180.407726] ? __pfx_lock_release+0x10/0x10 [ 180.407738] ? drm_crtc_commit_wait+0x32/0x160 [ 180.407743] ? drm_atomic_helper_wait_for_dependencies+0x48a/0x7c0 [ 180.407757] commit_tail+0x1ad/0x310 [ 180.407766] drm_atomic_helper_commit+0x229/0x2a0 [ 180.407773] ? __pfx_drm_atomic_helper_commit+0x10/0x10 [ 180.407777] drm_atomic_commit+0x1d4/0x2a0 [ 180.407784] ? __pfx_drm_atomic_commit+0x10/0x10 [ 180.407789] ? __pfx___drm_printfn_info+0x10/0x10 [ 180.407796] ? _raw_spin_unlock_irqrestore+0x4f/0x80 [ 180.407802] ? drm_event_reserve_init+0x1c3/0x240 [ 180.407811] drm_mode_atomic_ioctl+0x161e/0x22f0 [ 180.407833] ? __pfx_drm_mode_atomic_ioctl+0x10/0x10 [ 180.407838] ? lock_acquire+0x1a6/0x4f0 [ 180.407843] ? find_held_lock+0x34/0x120 [ 180.407866] ? do_raw_spin_unlock+0x58/0x1f0 [ 180.407874] ? __pfx_drm_mode_atomic_ioctl+0x10/0x10 [ 180.407880] drm_ioctl_kernel+0x202/0x3e0 [ 180.407888] ? __pfx_drm_ioctl_kernel+0x10/0x10 [ 180.407893] ? __might_fault+0xc6/0x180 [ 180.407906] drm_ioctl+0x4ce/0xab0 [ 180.407916] ? __pfx_drm_mode_atomic_ioctl+0x10/0x10 [ 180.407923] ? __pfx_drm_ioctl+0x10/0x10 [ 180.407942] ? _raw_spin_unlock_irqrestore+0x66/0x80 [ 180.407947] ? lockdep_hardirqs_on+0x81/0x110 [ 180.407953] ? _raw_spin_unlock_irqrestore+0x4f/0x80 [ 180.407962] amdgpu_drm_ioctl+0xd8/0x1c0 [amdgpu] [ 180.408317] __x64_sys_ioctl+0x134/0x1b0 [ 180.408326] do_syscall_64+0x61/0xe0 [ 180.408331] ? audit_reset_context+0x8c5/0xee0 [ 180.408344] ? do_syscall_64+0x70/0xe0 [ 180.408348] ? lockdep_hardirqs_on+0x81/0x110 [ 180.408354] ? do_syscall_64+0x70/0xe0 [ 180.408362] ? asm_exc_page_fault+0x26/0x30 [ 180.408367] ? lockdep_hardirqs_on+0x81/0x110 [ 180.408372] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 180.408377] RIP: 0033:0x7ff20d1279ed [ 180.408397] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 [ 180.408401] RSP: 002b:00007ff1f25fd9c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 180.408407] RAX: ffffffffffffffda RBX: 00007ff1e0028fb0 RCX: 00007ff20d1279ed [ 180.408411] RDX: 00007ff1f25fda60 RSI: 00000000c03864bc RDI: 000000000000000c [ 180.408414] RBP: 00007ff1f25fda10 R08: 0000000000000180 R09: 0000000000000001 [ 180.408418] R10: 000000000000000e R11: 0000000000000246 R12: 00007ff1f25fda60 [ 180.408421] R13: 00000000c03864bc R14: 000000000000000c R15: 00007ff1e0002450 [ 180.408438] </TASK> [ 237.248544] amdgpu 0000:03:00.0: amdgpu: bo 00000000f64ad033 va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000840 [ 237.248597] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22 [ 237.248600] amdgpu: Failed to map bo to gpuvm [ 237.302481] ================================================================== [ 237.302488] BUG: KASAN: slab-use-after-free in amdgpu_amdkfd_gpuvm_acquire_process_vm+0x10fe/0x1120 [amdgpu] [ 237.302730] Read of size 8 at addr ffff888159a688e8 by task BlackmagicRAWPl/12009
These "sleeping function called from invalid context" errors don't look right. Sebastian Andrzej Siewior recently fixed a similar problem in amdgpu - https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg97401.html - the commit explains:
dcn21_validate_bandwidth_fp() is invoked while FPU access has been enabled. FPU access requires disabling preemption even on PREEMPT_RT. It is not possible to allocate memory with disabled preemption even with GFP_ATOMIC on PREEMPT_RT.
Looking at the amdgpu code, it appears this kind of error is all over the place (?), e.g.
dcn32_resource_construct
- DC_FP_START();
- dcn32_clock_source_create
- dcn32_*_create (like dcn32_opp_create, dcn32_timing_generator_create, ...)
- DC_FP_END();
The dcn32_*_create functions are calling kzalloc with arg GFP_KERNEL:
dcn32_clock_source_create
- struct dce110_clk_src *clk_src = kzalloc(sizeof(struct dce110_clk_src), GFP_KERNEL);
dml1_validate
- DC_FP_START
- dcn32_internal_validate_bw
- calls functions that end up calling kref_put/kfree e.g. dc_plane_state_release
- DC_FP_END
"Using GFP_KERNEL means that kmalloc can put the current process to sleep waiting for a page when called in low-memory situations. A function that allocates memory using GFP_KERNEL must, therefore, be reentrant and cannot be running in atomic context."
This might explain why #3094 is related to high memory pressure, as kmalloc will be more likely to sleep under high memory pressure.
Edited by Chris Bainbridge- Chris Bainbridge mentioned in issue #2832
mentioned in issue #2832
Hi @siqueira @ckoenig , regarding these functions:
void dc_fpu_begin(const char *function_name, const int line) { int depth; WARN_ON_ONCE(!in_task()); preempt_disable(); depth = __this_cpu_inc_return(fpu_recursion_depth); if (depth == 1) { #if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH) kernel_fpu_begin(); #elif defined(CONFIG_PPC64) if (cpu_has_feature(CPU_FTR_VSX_COMP)) enable_kernel_vsx(); else if (cpu_has_feature(CPU_FTR_ALTIVEC_COMP)) enable_kernel_altivec(); else if (!cpu_has_feature(CPU_FTR_FPU_UNAVAILABLE)) enable_kernel_fp(); #elif defined(CONFIG_ARM64) kernel_neon_begin(); #endif } TRACE_DCN_FPU(true, function_name, line, depth); } void dc_fpu_end(const char *function_name, const int line) { int depth; depth = __this_cpu_dec_return(fpu_recursion_depth); if (depth == 0) { #if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH) kernel_fpu_end(); #elif defined(CONFIG_PPC64) if (cpu_has_feature(CPU_FTR_VSX_COMP)) disable_kernel_vsx(); else if (cpu_has_feature(CPU_FTR_ALTIVEC_COMP)) disable_kernel_altivec(); else if (!cpu_has_feature(CPU_FTR_FPU_UNAVAILABLE)) disable_kernel_fp(); #elif defined(CONFIG_ARM64) kernel_neon_end(); #endif } else { WARN_ON_ONCE(depth < 0); } TRACE_DCN_FPU(false, function_name, line, depth); preempt_enable(); }
Why are these functions directly calling preempt_enable() and preempt_disable()? Why is this required as turning preemption off/on is already done by kernel_fpu_begin() and kernel_fpu_end()?
When DC_FP_START is called in a nested way (dc_fpu_start, dc_fpu_start, dc_fpu_end, dc_fpu_end), the first call to dc_fpu_end is going to call preempt_enable(), and preemption will then be wrongly enabled even though the FPU is still in use?
Also, many of the functions that are being called when the FPU is in use are using kzalloc(sizeof(*x), GFP_KERNEL) which can sleep and schedule another task. kalloc calls should use GFP_ATOMIC when the FPU is in use. I got as far as verifying the allocations in the following files are called after DC_FP_START, this is not a complete list:
drivers/gpu/drm/amd/display/dc/dce/dce_audio.c drivers/gpu/drm/amd/display/dc/dce/dce_audio.c drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dccg.c drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dccg.c drivers/gpu/drm/amd/display/dc/irq/dcn30/irq_service_dcn30.c drivers/gpu/drm/amd/display/dc/irq/dcn30/irq_service_dcn30.c drivers/gpu/drm/amd/display/dc/resource/dcn30/dcn30_resource.c drivers/gpu/drm/amd/display/dc/resource/dcn30/dcn30_resource.c drivers/gpu/drm/amd/display/dc/resource/dcn31/dcn31_resource.c drivers/gpu/drm/amd/display/dc/resource/dcn31/dcn31_resource.c drivers/gpu/drm/amd/display/dc/resource/dcn314/dcn314_resource.c drivers/gpu/drm/amd/display/dc/resource/dcn314/dcn314_resource.c
This is not just a theoretical problem as you can see from the backtrace at #3058 (comment 2258160)