[amdgpu] Navi / 5700xt System lockup brings down all PCIE devices
Brief summary of the problem:
Running kenrels 5.8.10-5.8.14 and 5.4.70 the system will lock up during opengl games/benchmarks and bring down all devices on PCIE including the Ethernet, Chipset, USB, etc. This makes getting DMESG difficult because even when i SSH into the system i get nothing as the network connection goes down also. Ive been chasing this issue for weeks now trying to figure out why. This does not happen with anything Vulkan but vulkan games do occasionally stop for a moment then continue. I managed during one crash to see the dmesg before the screen froze show all the PCIE devices resetting.
So far Ive found this does not happen on kernel version 5.4.60 and prior but havent yet tested more as my time is limited. On those kernels I can leave Unigine heaven running 8-12hrs easily with no issue. None of the hardware has been overclocked
Hardware description:
CPU: AMD Ryzen 7 3600
GPU: AMD Radeon RX 5700xt Sapphire Pulse
MOBO: Gigabyte X470 AORUS GAMING 7 WIFI (rev. 1.1)
System Memory: 16GB Crucial Ballistix DDR3 3600mhz
Display(s): Dell U2713h with edited edid to force RGB mode as YPbPr is used otherwise
Type of Display Connection: HDMI
System information:
Distro name and Version: Ubuntu 20.04, EndeavorOS, Arch, and Manjaro
Kernel version: 5.8.10-5.8.14 and 5.4.70
Custom kernel: none
AMD package version: Mesa 20.2
How to reproduce the issue:
Boot with kernel 5.8.14
launch Unigine Heaven
Let the benchmark loop for 15min-90min
System will lock up