AMDGPU overheating and triggering shutdown on Kernel 5.15.15
Brief summary of the problem:
Hello! I hope I am posting this in the correct place. After upgrading to Kernel 5.15.15 (or possibly the latest Mesa drivers -- 21.3.6) I have encountered an issue where, when placed under load, my GPU is reaching its emergency junction temperature (115C) causing a system shutdown without warning.
Feb 11 23:02:23 pop-os kernel: amdgpu 0000:07:00.0: amdgpu: ERROR: GPU over temperature range(SW CTF) detected!
Feb 11 23:02:23 pop-os kernel: amdgpu 0000:07:00.0: amdgpu: ERROR: System is going to shutdown due to GPU SW CTF!
Now, normally, I believe the system should throttle the GPU before it reaches this temperature, but I could be mistaken. In any case, this issue appears to be new, as running the system under identical circumstances with the old kernel (5.13) and Mesa driver (unfortunately, I do not know the previous version number) never resulted in a system shutdown, nor does the GPU cause a shutdown when running under Windows. But since the recent round of system updates, I have experienced this issue three times in a span of 24 hours.
Hardware description:
- CPU: AMD Ryzen 7 5800X
- GPU 1: AMD Radeon RX 6800 XT
- GPU 2 (Disabled on Linux): Nvidia GeForce RTX 3060 Ti
- System Memory: 32 GB DDR4
- Display(s): 2 x 1920x1080@144Hz
- Type of Display Connection: 2 x DP
System information:
- Distro name and Version: Pop!_OS 21.10 x86_64
- Kernel version: 5.15.15-76051515-generic #202201160435~1642693824~21.10~97db1bb
- Mesa Driver Version: 21.3.6
How to reproduce the issue:
- Start system.
- Launch in a GPU-intensive OpenGL application (in this particular case, X-Plane 11)
- Run the application for an indefinite period, and monitor the output of
watch -n 2 sensors
- Eventually, GPU junction temperature reaches
emerg
temperature of 115C, and the system triggers a shutdown.- Time to this happening appears to vary, as, under load, the GPU junction temperature is sitting between 109C and 112C and appears to spike.
Attached files:
N/A
Screenshots/video files
N/A
Log files (for system lockups / game freezes / crashes)
- Systemd/journald log (overheat shown at line 256)