Computer lock up when game started on second of two identical GPU
System information
System: Host: mercury Kernel: 5.7.6-arch1-1 x86_64 bits: 64 compiler: gcc v: 10.1.0 Desktop: Gnome 3.36.3 wm: gnome-shell
dm: GDM Distro: Arch Linux
CPU: Topology: Quad Core model: Intel Core i7-6700K bits: 64 type: MT MCP arch: Skylake-S rev: 3 L2 cache: 8192 KiB
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 64026
Speed: 800 MHz min/max: 800/4200 MHz Core speeds (MHz): 1: 800 2: 800 3: 800 4: 802 5: 800 6: 800 7: 800 8: 800
Graphics: Device-1: Advanced Micro Devices [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
vendor: Micro-Star MSI driver: amdgpu v: kernel bus ID: 01:00.0 chip ID: 1002:67df
Device-2: Advanced Micro Devices [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
vendor: Micro-Star MSI driver: amdgpu v: kernel bus ID: 02:00.0 chip ID: 1002:67df
Display: x11 server: X.Org 1.20.8 compositor: gnome-shell driver: modesetting unloaded: fbdev,vesa alternate: ati
resolution: 1: 2560x1440~60Hz 2: 2560x1440~60Hz s-dpi: 96
OpenGL: renderer: AMD Radeon RX 480 Graphics (POLARIS10 DRM 3.37.0 5.7.6-arch1-1 LLVM 10.0.0) v: 4.6 Mesa 20.1.2
direct render: Yes
Describe the issue
My computer has two seats with one GPU assigned to each seat through loginctl. A few games appear to start by default on the GPU assigned to the other seat. While I can work around this more or less easily (I understand this is not an issue from Mesa and this is not what I am reporting here), the main problem is that it triggers a partial computer lock up. This lock up is not immediate but is triggered within seconds 100% of the time on some GPU-heavy scenes. When the lock up happens the display freezes but the music continues playing and I can still log into the computer through ssh. If I try to kill the process and reboot then the lock up will be total and the computer needs to be power cycled manually. The Total War games appear to be particularly affected, in particular Rome 2 (through wine) and Three Kingdoms (native).
I understand starting the game on the other GPU is not desirable for many reasons, but I think that should not lead to a lock up of the computer.
Log files as attachment
[ 281.868252] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=13089, emitted seq=13091
[ 281.868301] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ThreeKingdoms pid 5082 thread WebViewRenderer pid 5163
[ 281.868304] amdgpu 0000:02:00.0: GPU reset begin!
[ 281.868664] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868665] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868665] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868666] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868667] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868667] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868668] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868668] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868669] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868670] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868670] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868671] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868672] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868672] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868673] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868674] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868674] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868675] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868676] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868676] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868677] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868677] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868678] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868679] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868679] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868680] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868680] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868681] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868682] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868682] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868683] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868684] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.868684] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.868685] amdgpu: [powerplay]
failed to send message 261 ret is 65535
[ 281.877051] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110
[ 281.890870] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890872] amdgpu: [powerplay]
failed to send message 133 ret is 65535
[ 281.890874] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890875] amdgpu: [powerplay]
failed to send message 306 ret is 65535
[ 281.890875] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890876] amdgpu: [powerplay]
failed to send message 5e ret is 65535
[ 281.890877] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890877] amdgpu: [powerplay]
failed to send message 145 ret is 65535
[ 281.890878] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890878] amdgpu: [powerplay]
failed to send message 146 ret is 65535
[ 281.890879] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890880] amdgpu: [powerplay]
failed to send message 148 ret is 65535
[ 281.890881] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890881] amdgpu: [powerplay]
failed to send message 145 ret is 65535
[ 281.890882] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890882] amdgpu: [powerplay]
failed to send message 146 ret is 65535
[ 281.890883] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890884] amdgpu: [powerplay]
failed to send message 16a ret is 65535
[ 281.890885] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890885] amdgpu: [powerplay]
failed to send message 186 ret is 65535
[ 281.890886] amdgpu: [powerplay]
last message was failed ret is 65535
[ 281.890886] amdgpu: [powerplay]
failed to send message 54 ret is 65535
[ 282.125246] amdgpu: [powerplay]
last message was failed ret is 65535
[ 282.125247] amdgpu: [powerplay]
failed to send message 26b ret is 65535
[ 282.125248] amdgpu: [powerplay]
last message was failed ret is 65535
[ 282.125249] amdgpu: [powerplay]
failed to send message 13d ret is 65535
[ 282.125250] amdgpu: [powerplay]
last message was failed ret is 65535
[ 282.125250] amdgpu: [powerplay]
failed to send message 14f ret is 65535
[ 282.125251] amdgpu: [powerplay]
last message was failed ret is 65535
[ 282.125251] amdgpu: [powerplay]
failed to send message 151 ret is 65535
[ 282.125252] amdgpu: [powerplay]
last message was failed ret is 65535
[ 282.125253] amdgpu: [powerplay]
failed to send message 135 ret is 65535
[ 282.125253] amdgpu: [powerplay]
last message was failed ret is 65535
[ 282.125254] amdgpu: [powerplay]
failed to send message 190 ret is 65535
[ 282.125254] amdgpu: [powerplay]
last message was failed ret is 65535
[ 282.125255] amdgpu: [powerplay]
failed to send message 63 ret is 65535
[ 282.125257] amdgpu: [powerplay]
last message was failed ret is 65535
[ 282.125257] amdgpu: [powerplay]
failed to send message 84 ret is 65535
[ 282.125258] amdgpu: [powerplay] Failed to force to switch arbf0!
[ 282.125259] amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM!
[ 282.125309] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -22
[ 282.241984] amdgpu 0000:02:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[ 282.242043] [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[ 282.476103] cp is busy, skip halt cp
[ 282.592763] rlc is busy, skip halt rlc
[ 282.710841] amdgpu 0000:02:00.0: GPU BACO reset
[ 343.399630] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 10secs aborting
[ 343.399678] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C120 (len 62, WS 0, PS 0) @ 0xC13C
[ 343.399704] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing AB4C (len 133, WS 0, PS 8) @ 0xAB67
[ 343.399705] [drm] asic atom init failed!
[ 343.399706] amdgpu 0000:02:00.0: GPU reset succeeded, trying to resume
[ 343.516180] amdgpu 0000:02:00.0: Wait for MC idle timedout !
[ 343.633199] amdgpu 0000:02:00.0: Wait for MC idle timedout !
[ 343.633619] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 343.633623] [drm] VRAM is lost due to GPU reset!
[ 343.636082] amdgpu: [powerplay] Failed to send Message.
[ 343.636085] amdgpu: [powerplay] SMC address must be 4 byte aligned.
[ 343.636085] amdgpu: [powerplay] [AVFS][Polaris10_SetupGfxLvlStruct] Problems copying VRConfig value over to SMC
[ 343.636086] amdgpu: [powerplay] [AVFS][Polaris10_AVFSEventMgr] Could not Copy Graphics Level table over to SMU
[ 343.636087] amdgpu: [powerplay]
last message was failed ret is 65535
[ 343.636088] amdgpu: [powerplay]
failed to send message 252 ret is 65535
[ 343.636089] amdgpu: [powerplay]
last message was failed ret is 65535
[ 343.636089] amdgpu: [powerplay]
failed to send message 253 ret is 65535
[ 343.636090] amdgpu: [powerplay]
last message was failed ret is 65535
[ 343.636091] amdgpu: [powerplay]
failed to send message 250 ret is 65535
[ 343.636091] amdgpu: [powerplay]
last message was failed ret is 65535
[ 343.636092] amdgpu: [powerplay]
failed to send message 251 ret is 65535
[ 343.636093] amdgpu: [powerplay]
last message was failed ret is 65535
[ 343.636093] amdgpu: [powerplay]
failed to send message 254 ret is 65535
[ 343.753681] [drm] Timeout wait for RLC serdes 0,0
[ 343.871811] amdgpu 0000:02:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[ 343.871847] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
[ 343.871866] amdgpu 0000:02:00.0: GPU reset(2) failed
[ 343.871889] amdgpu 0000:02:00.0: GPU reset end with ret = -110
[ 346.168931] GpuWatchdog[2140]: segfault at 0 ip 000055c23549bdfd sp 00007f94b5fa4470 error 6 in electron[55c231ebd000+5aa7000]
[ 346.168935] Code: 48 39 c7 74 06 ff 15 f2 14 8e 02 c7 45 b0 aa aa aa aa 0f ae f0 41 8b 84 24 e8 00 00 00 89 45 b0 48 8d 7d b0 ff 15 5b 90 8f 02 <c7> 04 25 00 00 00 00 37 13 00 00 64 48 8b 04 25 28 00 00 00 48 3b
[ 353.969810] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 363.993132] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered