Diagnosing issues with Radeon VII
Submitted by L.S.S.
Assigned to Default DRI bug account
I'm in the process of diagnosing issues with a Radeon VII that I might have damaged during the attempts to improve its thermal conditions. Prior to all this the card has no major issues, just that it still runs too hot while mining (around 80-90 celsius even with fan maxed out via Radeon Profile, which, as well as the noise, was beyond acceptable and was the main reason why I wanted to improve the thermal condition in the first place).
The GPU in question now automatically switches to some kind of "safe clock" of 700/350 (as observed in Radeon Profile) when under heavy load such as mining (using ROCm backend on Manjaro/Arch), and cannot return to normal clock on its own. While I can force the default clocks back using Radeon Profile, however, if the card is still under load, the screen will immediately become messed up and a few seconds later the system hard resets with the GPU not detected in subsequent boots (as the screen got routed to the BMC on the motherboard instead of the video card) until I do a power cycle (manually or via IPMI).
After some failed attempts to mod the stock cooler to improve thermal condition (during which the symptoms began), I eventually replaced the cooler altogether with an Alphacool Eiswolf for this card. Despite the thermal condition has been improved greatly (it can still run Unigine Heaven tests at full clock for a short while without issues and at an acceptable 60 celsius), however, the issue with entering "safe clock" while mining does not go away.
I was able to get a usable under-load GPU clock of 1150MHz with Radeon Profile after some testing (it runs at around 40 celsius under load), but the condition only gets worse as now I can only maintain stable clock at around 1000MHz without entering "safe clock" too quickly. The "safe clock" can still kick in when I'm doing something else while mining, but as long as the clocks are set below safe ranges, I do not get system lockup/resets if I force the clock back (by reapplying).
I couldn't get any detailed logs yet as I haven't switched on debug parameters for amdgpu, but recently I was able to capture one occurrence with the log ended with "ring timeout" and "GPU reset begin" before the system hard reset.
I don't know where to start the investigation and find what caused the "safe clock" to trigger and, in case the card really got damaged, which CUs are causing issues (that I need to disable, as I just found out that I could disable CUs using boot parameters). I'm not sure which debug parameters I can use to get the information I need to look into the issue.
The current PSU installed on the system is an EVGA Supernova 750 P2 (750W 80+ Platinum) and I have both power connectors on the video card connected. The power supply should be sufficient and shouldn't be a problem.
After all, the experience with this card raised a lot of questions that I previously have neglected, especially regarding cooling, such as which kind of thermal compound/pads to use, where and how to apply/place them... but personally, cooling was never this hard to get right even with some very power-hungry CPUs I currently have.