Possible memory leaks in `amdgpu` driver and/or `drm` kernel subsystem
Brief summary of the problem:
For the past few months I have noticed my system's memory usage continually climbing over the course of a few hours/days to unreasonable amounts. The memory does not get freed, even when I close all applications, log out, and shut down the Wayland compositor (leaving nothing but a TTY to log in to).
None of my monitoring programs show any process taking up the used memory either, so I've essentially ruled out any userspace process.
Multiple calls to sync
and sysctl vm.drop_caches=3
a couple of times in a row does not reduce the used memory, so I'm ruling out the page cache (which free
seems to indicate as a negligible amount anyways).
I have not ruled out the TTM cache from information I found in this kernel report. But running the cat /sys/kernel/debug/dri/0/amdgpu_evict_vram
, cat /sys/kernel/debug/dri/0/amdgpu_evict_gtt
, and "horrible incantation" of for i in {1..1000}; do cat /sys/kernel/debug/ttm/page_pool_shrink; done
does not seem to affect memory usage either. If anyone has a better method of checking this, please let me know and I'll give it a shot.
In an attempt to debug, I have compiled a kernel with kmemleak
enabled and have found a very large number of reported memory leaks. These might be false positives, but they all seem to be coming from the amdgpu
driver and drm
subsystems. The primary one seems to be located in dcn32_*
functions called from amdgpu_dm_atomic_commit_tail
. Here's an example:
unreferenced object 0xffff8881699f0000 (size 24624):
comm "kworker/u66:1", pid 198, jiffies 4294694718 (age 25021.159s)
hex dump (first 32 bytes):
01 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff9a1040b5>] __kmalloc_large_node+0xc5/0x150
[<ffffffff9a1048d6>] __kmalloc_node+0xc6/0x150
[<ffffffff9a0f6373>] kvmalloc_node+0x43/0xd0
[<ffffffffc09be00d>] dc_create_transfer_func+0x1d/0x30 [amdgpu]
[<ffffffffc09bee23>] dc_create_stream_for_sink+0x233/0x2a0 [amdgpu]
[<ffffffffc091d85e>] dcn32_add_phantom_pipes+0x4e/0x470 [amdgpu]
[<ffffffffc0826238>] dcn32_internal_validate_bw+0x1288/0x1ca0 [amdgpu]
[<ffffffffc0826f24>] dcn32_calculate_wm_and_dlg_fpu+0x144/0x14d0 [amdgpu]
[<ffffffffc091df05>] dcn32_calculate_wm_and_dlg+0x45/0x60 [amdgpu]
[<ffffffffc092a7a5>] dml1_validate+0x135/0x350 [amdgpu]
[<ffffffffc09b2718>] dc_update_planes_and_stream+0x7e8/0x1230 [amdgpu]
[<ffffffffc073d0ec>] amdgpu_dm_atomic_commit_tail+0x198c/0x3a90 [amdgpu]
[<ffffffffc0326ca4>] commit_tail+0x94/0x130 [drm_kms_helper]
[<ffffffff99ecc006>] process_one_work+0x176/0x340
[<ffffffff99ecc45b>] worker_thread+0x27b/0x3a0
[<ffffffff99ed5ff7>] kthread+0xd7/0x100
I will attach a log of a handful of these messages further down.
Hardware description:
- CPU: AMD Ryzen 7 3800X 8-Core Processor
- GPU:
0d:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX] [1002:744c] (rev c8)
- System Memory: 64GB DDR4
- Display(s):
- QX2710LED, 2560x1440, DP->DVI-D convertor
- Alienware AW3423DWF, 3440x1440, DP
System information:
- Distro name and Version: I am using NixOS, and I'm running on the
nixpkgs-unstable
channel. - Kernel version: 6.7.0-rc7 (I have had this issue on 6.6 and 6.5 as well, not sure before that)
- Custom kernel: N/A
- AMD official driver version:
- Mainline
amdgpu
driver - Mesa 23.1.9
- Mainline
How to reproduce the issue:
Unfortunately I'm not sure how to reproduce other than having the same hardware/software setup indicated above.