Possible memory leaks in `amdgpu` driver and/or `drm` kernel subsystem

Ive been having the same issue for the past few months, finally got around to searching for it and ended up here. Here's what I've noticed: Running lsmod | sort -rnk 2 | head shows that amdgpu module is using 12,316,672 bytes of memory after 4 days of uptime.

$ cat /proc/meminfo
#...
Slab:           16684132 kB
SReclaimable:     404464 kB
SUnreclaim:     16279668 kB
#...

The SUnreclaim is very high and slabtop only shows 774,375k worth of slab objects

I have a 12700k, 7900xtx and 2 monitors (LG 32UD59-b 3840x2160@60hz, DELL U2711 2560x1440@60hz, both running through display port) I am running arch on the 6.5.9 kernel w/ mesa 23.2.1 using xorg and bspwm.

I have not tried compiling the kernel with kmemleak and I do not get any error messages related to atomics. I am happy to help diagnose this if needed.

Running lsmod | sort -rnk 2 | head shows that amdgpu module is using 12,316,672 bytes of memory after 4 days of uptime.

Same for me after 4 days, but that's only ~12MB, right?

$ lsmod | sort -rnk 2 | head
amdgpu              12439552  248

After 4 days, nearly 50% of my system's 64GB of memory has been eaten up, which is far more than the reported 12MB (and this is after clearing page cache + TTM cache):

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        28Gi        32Gi       699Mi       1.5Gi        32Gi
Swap:             0B          0B          0B

This might be a red-herring though, as others have reported the same issue here

Yeah, that's probably a sway/wlroots issue.

Yea, the only reason I brought that up is because kmemleak indicates the issue seems to be during a DRM atomic commit. I'm not sure if the userspace issue in wlroots might be the one triggering the kernel memory leak or if it's completely unrelated.

FWIW I'm also seeing the same thing on 6.5.13 and Navi3x: free memory (as reported by free -h) decreases over time and kswapd0 uses more and more CPU.

Would it be possible that such a memory leak in the kernel could lead to issues with suspend and resume?

Because I've intermittent issues with that (#2626) , where amdgpu sometimes crashed due to page allocation failure. But only when running the system for a long time with high memory pressure...

Newer kernels should block suspend while under memory pressure.

mentioned in issue #3058

added TTM label

changed the description

Possible memory leaks in `amdgpu` driver and/or `drm` kernel subsystem

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Log files (for system lockups / game freezes / crashes)

Designs

Child items ...

Activity

Admin message

Admin message

Possible memory leaks in `amdgpu` driver and/or `drm` kernel subsystem

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Log files (for system lockups / game freezes / crashes)

Activity