[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout VM fault / GPU fault detected
I'm unsure if my bug is related to #3501 (closed) but I SUSPECT the patch for that bug will fix my issue and maybe 3+ others.
More details can be found in mpv's bug tracker https://github.com/mpv-player/mpv/issues/14600 ; however based on the string of VM / GPU faults I that patch might solve this issue as well.
Some other bugs with keywords 6.10 + drm:amdgpu_job_timedout , amdgpu_job_timedout , or ring gfx timeout might be related.
- #3470 ( the reply #3470 (comment 2477133) seems related )
- #3437 (closed)
- I'm unsure if #3440 is related, it seems too old, but it IS from SteamDeck so it might have used some custom tweaks / cherry picks.
Hardware description:
- CPU: Intel Xeon E3-1240
- GPU: R9 285 (2GB) 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] [1002:6939]
- System Memory: 32GB ECC
- Display(s): 2
- Type of Display Connection: DP and DVI
- resources: irq:51 memory:e0000000-efffffff memory:f0000000-f01fffff ioport:e000(size=256) memory:f7e00000-f7e3ffff memory:c0000-dffff
System information:
- Distro name and Version: ArchLinux (rolling release)
- Kernel version (effected): Linux 6.10.1-arch1-1 #1 (closed) SMP PREEMPT_DYNAMIC Wed, 24 Jul 2024 22:25:43 +0000 x86_64 GNU/Linux
- AMD official driver version: amdgpu + OpenGL version string: 4.6 (Compatibility Profile) Mesa 24.1.4-arch1.2
- plasma-desktop 6.1.3-1
- ArchLinux current stable builds
How to reproduce the issue:
My crash is during video playback (using mpv). Seeks or out of order decode / placement are all crash risks (including initial startup). This does not happen with all files, and I don't have any examples that can be shared as every crash is super painful (full X session crash).
[ 1766.321165] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0a22c802
[ 1766.321171] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321172] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101F44
[ 1766.321174] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0C8002
[ 1766.321175] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1056580, write from 'TC3' (0x54433300) (200)
[ 1766.321237] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07f2a002
[ 1766.321238] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321239] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010120C
[ 1766.321240] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002
[ 1766.321241] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053196, write from 'CB2' (0x43423200) (32)
[ 1766.321244] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b29002
[ 1766.321245] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321247] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101237
[ 1766.321247] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B010002
[ 1766.321248] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053239, write from 'CB3' (0x43423300) (16)
[ 1766.321255] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772e002
[ 1766.321256] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321257] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101200
[ 1766.321258] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002
[ 1766.321258] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053184, write from 'CB4' (0x43423400) (160)
[ 1766.321262] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772d002
[ 1766.321263] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321264] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101232
[ 1766.321264] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002
[ 1766.321265] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053234, write from 'CB4' (0x43423400) (160)
[ 1766.321268] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07729002
[ 1766.321269] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321271] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010123A
[ 1766.321271] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002
[ 1766.321272] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053242, write from 'CB1' (0x43423100) (80)
[ 1766.321275] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732d002
[ 1766.321276] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321277] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001012AB
[ 1766.321278] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002
[ 1766.321279] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053355, write from 'CB2' (0x43423200) (32)
[ 1766.321282] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07126002
[ 1766.321283] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321284] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010124C
[ 1766.321285] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0E0002
[ 1766.321286] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053260, write from 'CB6' (0x43423600) (224)
[ 1766.321289] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b21002
[ 1766.321290] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321291] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101223
[ 1766.321292] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002
[ 1766.321293] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80)
[ 1766.321296] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002
[ 1766.321297] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321298] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101277
[ 1766.321298] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002
[ 1766.321299] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208)
[ 1777.234990] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=168813, emitted seq=168816
[ 1777.236251] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80)
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101277
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208)
Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=168813, emitted seq=168816
Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
Jul 25 22:09:21 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
Jul 25 22:09:21 HOSTNAME kernel: amdgpu: cp is busy, skip halt cp
Jul 25 22:09:22 HOSTNAME kernel: amdgpu: rlc is busy, skip halt rlc
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
Jul 25 22:09:22 HOSTNAME kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400800000).
Jul 25 22:09:22 HOSTNAME kernel: [drm] VRAM is lost due to GPU reset!
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
Jul 25 22:09:22 HOSTNAME kernel: [drm] UVD initialized successfully.
Jul 25 22:09:22 HOSTNAME kernel: [drm] VCE initialized successfully.
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow start
Jul 25 22:09:22 HOSTNAME mpv[5307]: amdgpu: The CS has cancelled because the context is lost. This context is innocent.
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow done
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) succeeded!
Jul 25 22:09:22 HOSTNAME systemd-coredump[5681]: Process 5307 (mpv) of user 1000 terminated abnormally with signal 6/ABRT, processing...
Jul 25 22:09:22 HOSTNAME systemd[1]: Created slice Slice /system/drkonqi-coredump-processor.
-- Subject: A start job for unit system-drkonqi\x2dcoredump\x2dprocessor.slice has finished successfully