SDMA Errors
Brief summary of the problem:
Running OpenCl compute tasks results in a flood of SDMA errors that eventually lead to a kernel lockup forcing a hard restart.
Hardware description:
- CPU: Threadripper 3960X
- GPU: 4x Radeon VII
- System Memory: 128 GB GSkill Ripjaws
- Display(s): 3
- Type of Diplay Connection: 2x DP, 1x HDMI
System infomration:
- Distro name and Version: Arch Linux
- Kernel version: 5.11.15-arch1-1
- Custom kernel: amd-staging-drm-next 5.12.987234.b54280b32ebb-1-x86_64
How to reproduce the issue:
Run OpenCl compute tasks with ROCm 4.1 and the latest amd-staging-drm-next kernel. As soon as the OpenCl kernel is loaded and the compute task starts journal is flooded with SDMA errors leading to the system locking up.
Error:
amdgpu: SDMA gets an Register Write SRBM_WRITE command in non-privilege command buffer
Running this kernel does fix the issue with running a Radeon VII with ROCM 4.1, and it also fixes a previous issue that I had with a memory leak on kernels > 5.9.14.