[RadeonSI/HadesCanyon] GPU faults with all(?) 3D programs
System information
- OS: Ubuntu 20.04 LTS
- HW: Intel HadesCanyon NUC
- GPU: Polaris 22 XT [Radeon RX Vega M GH] (rev c0)
- Desktop manager and compositor: Unity / Compiz
- Mesa version: Mesa 20.2.0-devel git (see below)
- Kernel version: drm-git 5.6.0 ("drm-tip: 2020y-03m-30d-16h-57m-53s UTC integration manifest"), or newer
- Xserver version: 1.20.99.1 (b56e501092, 2020-03-23, "glx: fixup symbol name for get_extensions function"), or newer
Describe the issue
Dmesg is full of GPU fault messages, starting from (automatic) desktop LightDM log in:
[ 6.657673] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x0b41650c for process compiz pid 1356 thread Compiz:cs0 pid 1411
[ 6.657677] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00105D68
[ 6.657678] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C16500C
[ 6.657681] amdgpu 0000:01:00.0: amdgpu: VM fault (0x0c, vmid 6, pasid 32773) at page 1072488, read from 'DBH3' (0x44424833) (357)
[ 6.657688] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x0b41150c for process compiz pid 1356 thread Compiz:cs0 pid 1411
[ 6.657690] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00105D68
[ 6.657691] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D155014
[ 6.657693] amdgpu 0000:01:00.0: amdgpu: VM fault (0x14, vmid 6, pasid 32773) at page 1072488, write from 'DBH5' (0x44424835) (341)
[ 6.657778] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x0b48550c for process compiz pid 1356 thread Compiz:cs0 pid 1411
[ 6.657780] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000
[ 6.657782] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C0E500C
[ 6.657784] amdgpu 0000:01:00.0: amdgpu: VM fault (0x0c, vmid 6, pasid 32773) at page 0, read from 'DBH1' (0x44424831) (229)
[ 6.657791] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x0b48650c for process compiz pid 1356 thread Compiz:cs0 pid 1411
...
Regression
I have test data only with drm-tip kernel 5.6.0 or newer, so I don't know whether something in latest drm-tip kernels is needed to trigger it, but this regression definitely is due to Mesa change. It seems to have been introduced about a week ago.
Unfortunately it doesn't seem to happen with every boot or mesa/drm-tip combo, but from the data I have, I'm pretty sure the regression is somewhere within 1 day between following Mesa commits:
- 2020-04-28 523e9603 radv: enable FMASK for color attachments only
- 2020-04-29 e581ddee intel/fs: Don't delete coalesced MOVs if they have a cmod
Whether Ubuntu 20.04 + LLVM v9 or 18.04 + LLVM v8 is used, doesn't seem to affect it. Nor whether drm-git kernel is 5.6.0 or 5.7.0-rc3.