Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
[RX 5600XT] Random GPU hang while playing games, GPU reset
While playing any kind of game, there is a chance that my GPU just hangs and tries to restart itself.
The result is a still image and the only thing I can do is to restart the whole PC from a TTY terminal.
This problem is not a new one, I had it more than a year now.
Just play some games (Cyberpunk, Stellaris, Star Trek Online, World War Z). Many times, this hang occurs quite soon, but some time it occurs after more than an hour of normal gaming.
I did all the stability test, the system is rock stable. I also tested the card under Windows just to be sure. Everything is stable.
Today, I was able to reproduce this problem on a second PC, also with Manjaro KDE and also with an RX 5600XT (not the same card!). So this problem is not related to one particular VGA card or PC hardware. So I'm quite sure it is software related. This PC with an older RX570 card had no problem at all.
The type of game is not relevant. Also, I got this problem once during browsing with Firefox.
There where some kernel and firmware package combinations
that does not had this hang... Maybe around kernel 5.15, but I'm not sure.
Attached files:
I have a bunch of dmesg outputs... See n9 for the latest one.
Confirm having this problem too, wasn't sure if it was my desktop manager or gpu related but going to go with the safe bet it's gpu, however I have not had this problem until upgrading past 5.19.11 just recently to 6.0.2 due to #2189 (closed)
What might be interesting to note is we both use KDE. It is possible it is a KDE 5.26 bug, but since you've been having this for a while I suspect that's not the case
Well, this was fast. I got a crash after 1 minute of gameplay. Then the GPU reset actually triggered a complete PC restart (yes, this was usual for me at the time of this old kernel).
Now I'm moving to kernel 6.1RC. That may work better. - or not. Gaming does not work on 6.1RC at all. So it is 6.0.2 then. I have no better idea.
I don't know how to test it. At the moment, I can install kernel 6.1 RC04 from Manjaro, but that's the latest kernel in our repos.
Afaik there are a kernel source git repo for the latest graphics patches, but I'm not sure. It's a bit hard to find info about this things.
If you are not comfortable compiling the kernel on your own, better wait until the patches go downstream. That may take time though. Unfortunately the kernel maintainers have a very clumsy development cycle.
I am not familiar with Manjaro (or Arch) but there are various results on the Arch wiki and Manjaro forums on how to compile your custom kernel. On Fedora which I use, all I need is: first clone the kernel repo, check out the latest RC then apply any patches I need, then:
# Configure the kernel according to the currently running distro kernelmake olddefconfig# Compile everything on 16 threadsmake -j16# Install all modules and the bootloadersudo make modules_install install
I am not sure if the same will work on Manjaro or not, please seek advice from the Manjaro community.
Ok, I was able to compile the 6.0.7 kernel, with 2 files modified according this commit: agd5f/linux@8a1a7d74
Afaik that commit is the patch for the "timed out fences" problem. It made no difference, I got the same problem running gputest: problem_tamaskernel_2.txt
Unfortunately this is not really a useful list of steps to reproduce. I used to play some games on the same GPU (Navi 10 - RX 5700XT) which didn't exhibit this problem. From what you described thus far, this could either be a kernel issue (related to power management or whatever else), or a userspace driver issue with the specific game you are playing.
It seems that on one occasion the hang occoured on the GFX ring (graphics queue) and on another occasion the compute queue. Judging by how unstable the machine is, I recommend to first try to rule out power management related issues.
Games:
Cyberpunk, Stellaris, Star Trek Online, World War Z.
"Judging by how unstable the machine is, I recommend to first try to rule out power management related issues." - what do you recommend? I did lengthy stress tests under both Linux and Windows, but the machine(s) remained rock solid. I even tested the situation where both CPU and GPU was loaded to maximum.
Okay, so on Linux you ran Unigine Superposition and on Windows you ran the same and various others, right? Making sure things are stable on Windows is useful because it rules out any hardware defect. Was there anything else you tried to run on Linux? And how long did you run Unigine Superposition for?
Yes, that tool. I just let gputest 0.7.0 (with the default tricolour triangle) run while I used Firefox to read the news. Why do you think it was a kernel crash? I see the same / similar issue there:
[ 7528.559420] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=21355393, emitted seq=21355395[ 7528.559655] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1016 thread Xorg:cs0 pid 1220[ 7528.559842] amdgpu 0000:08:00.0: amdgpu: GPU reset begin![ 7532.559894] amdgpu 0000:08:00.0: amdgpu: failed to suspend display audio[ 7532.882661] amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)[ 7532.882777] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed[ 7533.119190] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx[ 7533.134841] [drm] free PSP TMR buffer
Meanwhile, I bought the paid version of Unigine Superposition, and with it I was able to run it's stress test for more than an hour. I got no crash so far.
@Venemo Jeah, but this is quite random. Right now gputest is running fine since about half an hour...
Some times I can play for hours and everything is OK, but other times it just crash when it wants.
Okay. I do recommend to try to rule out power management related problems. Disable any overclocking you may have on your system and set the GPU to one of the fixed-clock profiles. Then see if that improves it or not. That is unless @agd5f has a better suggestion.
Well, I was able to cause this crash 5 times in a row using gputest tool. It happens usually under 5 minutes of test run. Gputest tool also caused this crash at my other PC (with same mainboard, cpu, vga and OS). So this tool is good to reproduce the problem.
Woah, the default "triangles" test of GpuTest is broken. I tried it on Windows too and that test crashed the gpu driver under 2 minutes. But crashed it so hard, I had to reinstall the whole driver o.O
Back to Linux... Now I'm running the Furmark test from GpuTool. That did not crashed the driver so far.
Stellaris did crashed in the "usual" way, without the illegal register thing:
[ 3764.769063] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=516115, emitted seq=516117[ 3764.769279] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process stellaris pid 5362 thread stellaris:cs0 pid 5367[ 3764.769466] amdgpu 0000:09:00.0: amdgpu: GPU reset begin![ 3768.769490] amdgpu 0000:09:00.0: amdgpu: failed to suspend display audio[ 3769.116068] amdgpu 0000:09:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)[ 3769.116182] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed[ 3769.369121] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx[ 3769.384853] [drm] free PSP TMR buffer[ 3769.418457] amdgpu 0000:09:00.0: amdgpu: BACO reset[ 3771.538431] amdgpu 0000:09:00.0: amdgpu: GPU reset succeeded, trying to resume[ 3771.538570] [drm] PCIE GART of 512M enabled (table at 0x0000008000F00000).
Stellaris is a native Linux game, but STO runs with Proton (Proton-experimental in my case). Proton could be the source of the illegal register problem... I will try to test this with Cyberpunk. And will check out umr too.
But I did discovered something today playing with CoreCtrl and vga settings. There was a couple of guys that had similar ring timeout problems and some of them fixed it by running the card in the fixed low power mode. https://bugzilla.kernel.org/show_bug.cgi?id=201957
By default, the card can draw 135W power, with 1780MHz core and 1500MHz ram clocks. In reality, the core clock is around 1550Mhz while the card is stressed with furmark. With these default settings, STO does crash just under minutes.
But with power limit set to 115W, I was able to play STO for more than 2 hours, it was crash-free. With this setting, the core clock is around 1400MHz at full load, that means I still get around 90% of the card's maximum performance. I can live with this if it fixes the problem in return...
This mode will need some more testing.
I still think this is a pure software issue, because of this card runs fine under Windows at the default settings, with 135W power draw. There are no overheat or power supply related problems.
Great, thanks for the testing. Please include some information here about what exactly you did. If your GPU works correctly with those settings, then this is 99% a power management issue, likely a bug in the kernel driver.
Btw, testing this will take time. For example, STO with 125W power limit does still crashed, but not that soon like with the default settings. This is why I'm testing 115W power limit for now.
Yes, in both of my PC's, I have this card: Sapphire Pulse 5600XT - this is the "normal" version, they have a "Black edition" variant, but that's just a cheep junk that loud as a hair dryer.
I did experienced one crash with Stellaris using 115W as power limit... Now I'm testing 100W.
I have a suspicion that something goes wrong when booting... In many cases, such a crash is preceded by some video stuttering, but not so much that you have to restart the machine because of it.
It is difficult to decide whether there is a driver reason for this, or e.g. the game itself is problematic. And it doesn't even show up before every crash. So this is just a suspicion on my part.
I did some cross-checking and may have found the reason for all of these errors.
I ran GpuTest with furmark stress test on both Linux and Windows. The GPU voltages should be similar under load, but they didn't.
Under Windows, the GPU voltage stayed between 900 - 850mV range but under Linux, it started from 850mV and goes even below 834mV. - at same clock speeds.
Until now, I thought that the frequency-voltage curve settings were the same under all operating systems, because they are defined in the VGA bios - at least until someone changes it in the driver.
GPUs are pushed to the limit, if they don't get the required voltage, it can easily cause freezing.