Critical Thermal Fault (CTF) automatic shutdown triggers too easily
Brief summary of the problem:
As of kernel 5.8, amdgpu now includes a CTF Critical thermal Fault automatic shutdown function. I have a Vega 64 GPU that, when running intensive 3D applications or benchmarks, will quickly reach 85C on the "edge" temperature sensor which has a "crit" value of 85C.
Reaching this 85C temperature triggers the new CTF automatic shutdown process, which is sudden and unexpected. I have observed the same shutdown process occur when junction sensor hits crit (105C) when restricting airflow to test as well. I assume that if ANY of the sensors reach their listed crit temperature, the automatic software shutdown is triggered.
Running the card in identical thermal conditions with an equivalent workload under Microsoft Windows 10 with the current Adrenalin 2020 Edition 20.8.2 shows that the card will reach the same temperatures, 85C (or 105C for junction), but will, aside from the occasional overshooting of temperature, simply throttle and bounce off the temperature limit continuously without any software initiated shutdown, graphics artifacting, hardware shutdown, or any other issue, even after hours of continuous use.
The new functionality of the software shutdown is appreciated for protecting the hardware, but it seems to trigger too easily in its current state. I feel like what I'm seeing in the Windows driver as to how the thermal limits are managed is the expected behavior.
Either the implementation of the new CTF automatic shutdown function is using incorrect values ("crit" instead of "emerg"), or the trigger needs "softening" so that it doesn't initiate a shutdown until the temperature has lingered above the threshold for too long.
I am not aware if this affects any other graphics chips as I have none to test with.
Hardware description:
- CPU: Ryzen 7 1800x
- GPU: Radeon Vega 64
- System Memory: 16GB
- Display(s): Single 3840x2160
- Type of Diplay Connection: DP
System infomration:
- Distro name and Version: Arch Linux
- Kernel version: 5.8.1 (5.8.1-arch1-1)
- AMD package version: No package
How to reproduce the issue:
- Use Kernel 5.8 and Radeon Vega 64 GPU (unknown if other GPUs are affected as well)
- Run any intensive 3D application or other program that utilizes GPU resources.
- Monitor graphics chip temperatures.
- Observe sudden and immediate trigger of Critical Thermal Fault shutdown process as any sensor (edge, junction, memory) hits the temperature indicated as "crit".
- Review log, find record of CTF-triggered shutdown.
Attached files:
- Dmesg log (from journalctl) dmesg.txt
- Xorg log Xorg.0.log
- Sample GPU sensors outputsensors.txt