amdgpu [RX Vega 64] system freeze while gaming (VSYNC enabled)
Submitted by Mauro Gaspari
Assigned to Default DRI bug account
Link to original bug (#109955)
Description
Symptoms:
During gaming sessions, system locks up and freezes completely. Audio seems to keep working for a few seconds more, but full desktop is frozen, no mouse and keyboard actions available. Hard reset only possible action on local pc. I have not tried to ssh in the PC from another box.
Some times I can play for 20 minutes, some times for a few hours. Freezes seem unrelated to any activity running in-game. All system temperatures are under control.
The system outside of 3d gaming is very stable, including playing videos, encoding videos, regular desktop usage.
Further testing done:
- Installed Windows10 on same hardware, same BIOS settings. Running same games has no issue at all. No hangs, no problems.
- Ran same games on my NVIDIA+Intel based laptop. No issue at all on same distributions and kernels. No hangs, no problems.
Additional information:
This issue has been going on for a while now. It comes and goes with Mesa versions (or Mesa+kernel combinations). Some times an update comes and I have no freezes for weeks. Then next update gets installed and the issue comes back.
I have tested this mainly on openSUSE Tumbleweed, Ubuntu 18.04 and Ubuntu 18.10.
-- Ubuntu testing:
Ubuntu 18.04 was running well for months, then latest mesa updates that got in 2 weeks ago, re-introduced the issue. System started freezing again. I tried updating to 18.10 but I had the same issue. I enabled oibaf PPA for video drivers and the issue disappeared. Then after a few days a new mesa came in and the issue came back. I am now running on Padoka unstable PPA with Mesa 19 and LLVM9. The issue still happens.
-- Tumbleweed testing:
I am adding my previous bug report I filed with Tumbleweed. A couple of occurrences with system logs. I will post more as I collect them.
OS: OpenSUSE tumbleweed x86_64 updated (2018 04 21)
Kernel: 4.16.2-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 3.0 Mesa 18.0.0
GPU: AMD Radeon RX Vega 64 8GB
System Logs:
Apr 21 17:08:34 STUDIO kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Apr 21 17:08:34 STUDIO kernel: [drm] No hardware hang detected. Did some blocks stall?
Apr 21 17:08:44 STUDIO kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=128859, last emitted seq=128861
Apr 21 17:08:44 STUDIO kernel: [drm] No hardware hang detected. Did some blocks stall?
-- Reboot --
Dmesg lines relative to amdgpu:
[ 3.407020] [drm] amdgpu kernel modesetting enabled.
[ 3.411462] fb: switching to amdgpudrmfb from VESA VGA
[ 3.426163] amdgpu 0000:04:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[ 3.426261] amdgpu 0000:04:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[ 3.426263] amdgpu 0000:04:00.0: GTT: 256M 0x000000F600000000 - 0x000000F60FFFFFFF
[ 3.426371] [drm] amdgpu: 8176M of VRAM memory ready
[ 3.426372] [drm] amdgpu: 8176M of GTT memory ready.
[ 4.031665] fbcon: amdgpudrmfb (fb0) is primary device
[ 4.083803] amdgpu 0000:04:00.0: fb0: amdgpudrmfb frame buffer device
[ 4.096086] amdgpu 0000:04:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[ 4.096088] amdgpu 0000:04:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[ 4.096089] amdgpu 0000:04:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[ 4.096090] amdgpu 0000:04:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[ 4.096091] amdgpu 0000:04:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[ 4.096093] amdgpu 0000:04:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[ 4.096094] amdgpu 0000:04:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[ 4.096095] amdgpu 0000:04:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[ 4.096096] amdgpu 0000:04:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[ 4.096098] amdgpu 0000:04:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[ 4.096099] amdgpu 0000:04:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[ 4.096100] amdgpu 0000:04:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[ 4.096101] amdgpu 0000:04:00.0: ring 12(uvd) uses VM inv eng 6 on hub 1
[ 4.096103] amdgpu 0000:04:00.0: ring 13(uvd_enc0) uses VM inv eng 7 on hub 1
[ 4.096104] amdgpu 0000:04:00.0: ring 14(uvd_enc1) uses VM inv eng 8 on hub 1
[ 4.096105] amdgpu 0000:04:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[ 4.096107] amdgpu 0000:04:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[ 4.096108] amdgpu 0000:04:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[ 4.096662] [drm] Initialized amdgpu 3.23.0 20150101 for 0000:04:00.0 on minor 0
The issue was later identified here https://bugs.freedesktop.org/show_bug.cgi?id=105317 and fixed with Mesa 18.0.1.
Then, The issue was noticed again after a few months:
OS: OpenSUSE tumbleweed x86_64 updated (2018 08 10)
Kernel: 4.17.2-1-default
Desktop Environment: KDE Plasma (x11)
OpenGL version string: 3.1 Mesa 18.1.5
GPU: AMD Radeon RX Vega 64 8GB
Relevant log lines I found during freeze:
2018-08-09T23:16:53.103775+08:00 MGDT-Tumbleweed kernel: [ 6305.852703] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1745163, last emitted seq=
1745165
2018-08-09T23:16:53.103795+08:00 MGDT-Tumbleweed kernel: [ 6305.852704] [drm] No hardware hang detected. Did some blocks stall?
Dmesg lines relative to amdgpu:
[ 3.130759] [drm] amdgpu kernel modesetting enabled.
[ 3.135770] fb: switching to amdgpudrmfb from EFI VGA
[ 3.136106] amdgpu 0000:03:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[ 3.136171] amdgpu 0000:03:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[ 3.136173] amdgpu 0000:03:00.0: GTT: 512M 0x000000F600000000 - 0x000000F61FFFFFFF
[ 3.136494] [drm] amdgpu: 8176M of VRAM memory ready
[ 3.136495] [drm] amdgpu: 8176M of GTT memory ready.
[ 4.114469] fbcon: amdgpudrmfb (fb0) is primary device
[ 4.141179] amdgpu 0000:03:00.0: fb0: amdgpudrmfb frame buffer device
[ 4.164072] amdgpu 0000:03:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[ 4.164074] amdgpu 0000:03:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[ 4.164075] amdgpu 0000:03:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[ 4.164075] amdgpu 0000:03:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[ 4.164076] amdgpu 0000:03:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[ 4.164077] amdgpu 0000:03:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[ 4.164078] amdgpu 0000:03:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[ 4.164079] amdgpu 0000:03:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[ 4.164079] amdgpu 0000:03:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[ 4.164080] amdgpu 0000:03:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[ 4.164081] amdgpu 0000:03:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[ 4.164082] amdgpu 0000:03:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[ 4.164083] amdgpu 0000:03:00.0: ring 12(uvd) uses VM inv eng 6 on hub 1
[ 4.164084] amdgpu 0000:03:00.0: ring 13(uvd_enc0) uses VM inv eng 7 on hub 1
[ 4.164085] amdgpu 0000:03:00.0: ring 14(uvd_enc1) uses VM inv eng 8 on hub 1
[ 4.164085] amdgpu 0000:03:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[ 4.164086] amdgpu 0000:03:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[ 4.164087] amdgpu 0000:03:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[ 4.164553] [drm] Initialized amdgpu 3.25.0 20150101 for 0000:03:00.0 on minor 0