RX 560 gfx timeout
I am seeing amdgpu crashes on an AMD RX 560 video card.
I can now immediately reproduce this with the "nextcloud" client or "steam." In general any opengl application can trigger this, including freecad, blender, etc. Nextcloud crashes before opening the first window after a few seconds. Steam takes a while to start up but will crash part way through opening the first window (logging in to the steam network).
$ dmesg | grep drm:amdgpu_job_timedout
[ 64.482881] [drm:amdgpu_job_timedout] ERROR ring gfx timeout, signaled seq=1115, emitted seq=1117 [ 64.482883] [drm:amdgpu_job_timedout] ERROR Process information: process nextcloud pid 4318 thread nextcloud:cs0 pid 4420
The dmesg error is always the same ring gfx timeout.
The screen becomes corrupted and there is a graphics card crash: screen scrambled (mucked up colours, frozen screen), mouse is still overlaid and moves a square of shifted colours. After a short period (1-2 sec) the keyboard no longer responds.
The keyboard becomes non-responsive once the gpu reset is underway, so that Ctrl-Alt-1 will only get you to a console if you are quick enough. From a console, "/etc/init.d/xdm restart" will often recover to a login screen.
If the amdgpu driver has "recovery" enabled (depends on kernel version, disabled in earlier kernels), the X server sometimes recovers on it's own with an X restart and you are returned to the X login screen. One can still SSH in to reboot. Restarting X will often recover.
There are more details on what I've previously tried at https://bugs.gentoo.org/720044
I have tried this on both Gentoo linux and Ubuntu 18.04.4 on the same hardware (dual booting). The only working setup has been with Ubuntu 18.04.4 using the amdgpu-pro 20.10 drivers.
Installing the amdgpu-pro drivers with full opencl support in Ubuntu, results in a working system. (Ubuntu 18.04.4, last updated as of May 4, 2020)
$ tar -Jxvf amdgpu-pro-20.10-1048554-ubuntu-18.04.tar.xz
$ cd amdgpu-pro-20.10-1048554-ubuntu-18.04
$ ./amdgpu-install --pro -y --opencl=legacy,pal
With the amdgpu-pro drivers in Ubuntu, there are still some graphics corruption artifacts (small regions of ~10x10 pixels are scrambled), but the card no longer crashes, the system is usable, and I can successfully run opengl software and launch steam. Running clgpustress gives correct results.
As a result of the working Ubuntu amdgpu-pro install, I believe I do not have broken hardware.
On Gentoo I have tried the following, all with no change in the symptoms:
Kernels:
- linux-4.14.166-gentoo, (hangs since gpu_recovery is disabled)
- linux-4.19.97-gentoo,
- linux-5.4.36-gentoo,
- linux-5.5.11-gentoo,
- linux-5.5.19-gentoo,
- linux-5.6.8-gentoo,
- drm-next (git: bb0b6c08974d),
- amd-staging-drm-next (git: 5004d907789e)
Mesa:
- media-libs/mesa-19.3.5,
- media-libs/mesa-20.0.2
- media-libs/mesa-9999 (git master May 11)
LIBDRM:
- x11-libs/libdrm-2.4.100
- x11-libs/libdrm-9999 (git master May 11)
LLVM:
- sys-devel/llvm-9.0.1
- sys-devel/llvm-10.0.0
X:
- x11-base/xorg-server-1.20.7
On Ubuntu, using the non amdgpu-pro (default amdgpu kernel driver/meas/etc installation) configuration gives the same crash outcome.
On Ubuntu, using the amdgpu-pro "Open Stack" gives the same crash outcome.
Using GALLIUM_DDEBUG seems to sometimes allow nextcloud to start up the first time, but a second run will crash. It always crashes immediately without the GALLIUM_DDEBUG enabled.
I have attached three .tar.gz for crashes on Gentoo which include UMR output and the GALLIUM_DDEBUG output when running nextcloud. The scripts used to reproduce the output are included. The most detailed are the two drm-next. The file *-drm_debug_0x1ff.tar.gz had drm.debug=0x1ff on the kernel command line.
I have also included a .tar.gz of the successful Ubuntu configuration for reference.
The nextcloud client which triggers the crash is www-apps/nextcloud-18.0.2.
gentoo-amdgpu-debuginfo-20200510-5.6.0-drm-next+.tar.gz
gentoo-amdgpu-debuginfo-20200510-5.6.0-drm-next+-drm_debug_0x1ff.tar.gz
gentoo-amdgpu-debuginfo-20200510-5.6.8-gentoo.tar.gz
ubuntu-amdgpu-debuginfo-20200425.tar.gz
Thanks for your time. Hope there is something in here that can help you narrow down what has gone wrong. If there is anything further I can provide, please ask.