GPU crash and failed reset leading to deadlock on Polaris 22 XL [Radeon RX Vega M GL]
Submitted by Rémi Verschelde
Assigned to Default DRI bug account
Link to original bug (#110413)
Description
Created attachment 143950
lspci -vvv output for HP Spectre 360x
My HP Spectre x360 laptop bought in March 2019 comes with KabyLake G HD Graphics 630 and a discrete AMD Radeon RX Vega M GL GPU.
I only enable the Radeon GPU when needed to play graphics intensive games with DRI_PRIME=1
, and so far I experience a lot of GPU deadlocks with the following symptoms:
- Temperatures raise, the CPUs are throttled. Framerate drops when this happens.
- Later on, GPU faults are reported in dmesg, the game's rendering freezes (but music continues playing). I am still able to alt+tab back to desktop or open a terminal, but the game's process can't be killed. If I'm monitoring temperatures, lm_sensors always reports a bogus 511°C temperature for the AMD dGPU at this point, before breaking.
- Any subsequent attempt at using the AMD GPU will cause a system deadlock, and I need to force shutdown with the power button.
My testing so far has covered:
- Unity3D games like For The King or StarCrawlers. The crash happens mid-game, not in a strictly reproducible manner, but seems related to CPU temperature/throttling.
* I could also reproduce the crash with SuperTuxKart, not in-game but when alt-tabbing back to desktop.
* I could not get the crash yet with glmark2. With For The King, I can reliably get a crash within 1 to 10 minutes in-game when playing with "High" or "Dream" graphics quality.
- Kernel 5.0.x (up to 5.0.7) from Mageia 7 (Cauldron), e.g. 5.0.7-desktop-4.mga7.
* I also tried `git://people.freedesktop.org/~agd5f/linux -b amd-staging-drm-next` at b07c394a327fc9e435ee03288584c111fa73d963, but I still got the same symptoms. dmesg output was in part different though, more spammy.
* Following discussions in bug 109692, I tried the patches provided by Andrey Grodzovsky in bug 109692 comment 34, but they did not solve the issue for me.
- Mesa 19.0.0 to 19.0.2 built against LLVM 7.0.1.
- Suspecting the CPU temperature/throttling as a trigger, I'm using https://github.com/kitsunyan/intel-undervolt to undervolt the CPU Cache by -100 mV and set the CPU limit temperature to 80°C instead of 100°C. This has helped with throttling issues I had during code compilation, but no visible change on my GPU crashes that I can tell. I can disable this undervolting when doing tests if required.
I found various bug reports which might well be duplicates, but I'm opening my own to avoid hijacking discussions on what may or may not be the same root cause: bug 109461, bug 109466, bug 109692 (I installed Shadow of the Tomb Raider but haven't checked if I can reproduce this one's symptoms yet), bug 109819.
I attach some relevant logs on the system and the bug. Please ask for anything else you may need.
**Attachment 143950**, "lspci -vvv output for HP Spectre 360x":
lspci-vvv.log