[renoir] instability & graphics stack freeze
Brief summary of the problem:
When using a Renoir Ryzen 4750G APU with a wide range of kernel versions, intensive use of the GPU functionality will result in a lockup of the graphics stack.
Hardware description:
- CPU: Ryzen 4750G
- GPU: no dGPU
- System Memory: 16GB
- Display(s): Dell U2415
- Type of Display Connection: DisplayPort
System information:
- Distro name and Version: Ubuntu 20.04.1
- Kernel version: 5.4.0-48-generic, 5.6.0-1028-oem
- Custom kernel: 5.9.0-050900rc7drmtip20201001-generic has been tested as well.
- AMD package version: no package (amdgpu firmware 20.30 from linux-firmware)
- MESA version: tested both default 20.0.8-0ubuntu1
20.04.1 and 20.3git2010011930.237f4doibaff - Environment variables: AMD_DEBUG=nodma and AMD_DEBUG=nongg both tested to no effect
- Kernel parameters: exp_hw_support=1 and amd_iommu=off both tested
How to reproduce the issue:
Run a graphics intensive game (e.g. EVE Online, World of Warcraft). Notably, Glamour acceleration works just fine, as does Chromium, glxgears, etc. -- it's only when dx9 or dx12/vulkan calls are made that things start hanging. Sometimes recovery does succeed; in other cases the system stays locked and needs to be control-alt-f7 + control-alt-del or hard reset with the power button.
Attached files:
Oct 1 21:15:45 foxglove kernel: [ 220.126969] amdgpu 0000:0c:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32771, for process exefile.exe pid 6623 thread exefile.ex:cs0 pid 6653)
Oct 1 21:15:45 foxglove kernel: [ 220.126977] amdgpu 0000:0c:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 27
Oct 1 21:15:45 foxglove kernel: [ 220.126981] amdgpu 0000:0c:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00601431
Oct 1 21:15:45 foxglove kernel: [ 220.126984] amdgpu 0000:0c:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa)
Oct 1 21:15:45 foxglove kernel: [ 220.126986] amdgpu 0000:0c:00.0: amdgpu: MORE_FAULTS: 0x1
Oct 1 21:15:45 foxglove kernel: [ 220.126989] amdgpu 0000:0c:00.0: amdgpu: WALKER_ERROR: 0x0
Oct 1 21:15:45 foxglove kernel: [ 220.126991] amdgpu 0000:0c:00.0: amdgpu: PERMISSION_FAULTS: 0x3
Oct 1 21:15:45 foxglove kernel: [ 220.126993] amdgpu 0000:0c:00.0: amdgpu: MAPPING_ERROR: 0x0
Oct 1 21:15:45 foxglove kernel: [ 220.126995] amdgpu 0000:0c:00.0: amdgpu: RW: 0x0
...
Oct 1 21:15:55 foxglove kernel: [ 225.322312] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
...
Oct 1 21:15:55 foxglove kernel: [ 230.186377] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
...
Oct 1 21:19:45 foxglove kernel: [ 166.643252] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
Oct 1 21:19:45 foxglove kernel: [ 171.783430] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=18164, emitted seq=18166
Oct 1 21:19:45 foxglove kernel: [ 171.783479] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Battle.net.exe pid 5271 thread Battle.net.exe pid 5717
Oct 1 21:19:45 foxglove kernel: [ 171.783482] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Oct 1 21:19:45 foxglove kernel: [ 171.909600] [drm] free PSP TMR buffer
Oct 1 21:19:45 foxglove kernel: [ 171.942524] amdgpu 0000:0c:00.0: amdgpu: MODE2 reset
Oct 1 21:19:45 foxglove kernel: [ 171.942731] amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
Oct 1 21:19:45 foxglove kernel: [ 171.942854] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
Oct 1 21:19:45 foxglove kernel: [ 171.943005] [drm] PSP is resuming...
Oct 1 21:19:45 foxglove kernel: [ 171.962871] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMR
Oct 1 21:19:46 foxglove kernel: [ 172.338338] amdgpu 0000:0c:00.0: amdgpu: SMU is resuming...
Oct 1 21:19:46 foxglove kernel: [ 172.338561] amdgpu 0000:0c:00.0: amdgpu: SMU is resumed successfully!
Oct 1 21:19:46 foxglove kernel: [ 172.517862] [drm] kiq ring mec 2 pipe 1 q 0
Oct 1 21:19:46 foxglove kernel: [ 172.531014] [drm] DMUB hardware initialized: version=0x01000000
Oct 1 21:19:46 foxglove kernel: [ 172.581268] [drm] Failed to add display topology, DTM TA is not initialized.
Oct 1 21:19:46 foxglove kernel: [ 172.599979] [drm] VCN decode and encode initialized successfully(under DPG Mode).
Oct 1 21:19:46 foxglove kernel: [ 172.600277] [drm] JPEG decode initialized successfully.
Oct 1 21:19:46 foxglove kernel: [ 172.600280] amdgpu 0000:0c:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
Oct 1 21:19:46 foxglove kernel: [ 172.600281] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Oct 1 21:19:46 foxglove kernel: [ 172.600282] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Oct 1 21:19:46 foxglove kernel: [ 172.600283] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Oct 1 21:19:46 foxglove kernel: [ 172.600283] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Oct 1 21:19:46 foxglove kernel: [ 172.600284] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Oct 1 21:19:46 foxglove kernel: [ 172.600285] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Oct 1 21:19:46 foxglove kernel: [ 172.600286] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Oct 1 21:19:46 foxglove kernel: [ 172.600286] amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Oct 1 21:19:46 foxglove kernel: [ 172.600287] amdgpu 0000:0c:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Oct 1 21:19:46 foxglove kernel: [ 172.600288] amdgpu 0000:0c:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
...
Oct 1 21:19:46 foxglove kernel: [ 172.605079] amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow start
Oct 1 21:19:46 foxglove kernel: [ 172.605080] amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow done
Oct 1 21:19:46 foxglove kernel: [ 172.605100] amdgpu 0000:0c:00.0: amdgpu: GPU reset(2) succeeded!
See bug.tar.gz for full logs requested in https://amdgpu-install.readthedocs.io/en/latest/install-bugrep.html