AMD Ryzen 7 4800H iGPU reset when any GPGPU (ROCm) process exits
Brief summary of the problem:
While GPGPU applications built atop of ROCm (HIP or OpenCL) seem to work the moment the process exits iGPU goes through GPU reset. It doesn't really matter what specific application is GPU compute used by. However it seems to be related to kernel submission as this does happen for simple "clpeak" (it runs OK up until the exit), but GPU reset is not triggered by "rocm-bandwidth-test".
This issue can be easily reproduced and persists over many kernel and ROCm versions (up to ROCm 4.3.1 and kernel 5.14.11 as packaged in OpenSUSE Tumbleweed). The ROCm installation is using upstream AMDGPU.
The relevant part of system log appears to be (full log attached bellow):
Oct 22 21:03:45 Lenovo-Legion-5 kernel: Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
Oct 22 21:03:45 Lenovo-Legion-5 kernel: amdgpu_bo_unref+0x1a/0x30 [amdgpu 6ab19f992d715cc3ec06ce19987cd61f7e1fbad3]
Oct 22 21:03:45 Lenovo-Legion-5 kernel: amdgpu_gem_object_free+0x30/0x50 [amdgpu 6ab19f992d715cc3ec06ce19987cd61f7e1fbad3]
Oct 22 21:03:45 Lenovo-Legion-5 kernel: amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x326/0x390 [amdgpu 6ab19f992d715cc3ec06ce19987cd61f7e1fbad3]
Oct 22 21:03:45 Lenovo-Legion-5 kernel: kfd_process_device_free_bos+0x9d/0xe0 [amdgpu 6ab19f992d715cc3ec06ce19987cd61f7e1fbad3]
Oct 22 21:03:45 Lenovo-Legion-5 kernel: kfd_process_wq_release+0x20d/0x2e0 [amdgpu 6ab19f992d715cc3ec06ce19987cd61f7e1fbad3]
Oct 22 21:08:01 Lenovo-Legion-5 kernel: amdgpu: qcm fence wait loop timeout expired
Oct 22 21:08:01 Lenovo-Legion-5 kernel: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
Oct 22 21:08:01 Lenovo-Legion-5 kernel: amdgpu: Failed to evict process queues
Oct 22 21:08:01 Lenovo-Legion-5 kernel: amdgpu: Failed to quiesce KFD
Oct 22 21:08:01 Lenovo-Legion-5 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset begin!
Hardware description:
- CPU: AMD Ryzen 7 4800H
- GPU: (gfx902) 06:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir [1002:1636] (rev c6)
- System Memory: 64GB
System information:
- Distro name and Version: OpenSUSE Tumbleweed
- Kernel version: Linux Lenovo-Legion-5 5.14.11-2-default #1 (closed) SMP Sun Oct 10 08:34:34 UTC 2021 (834dddd) x86_64 x86_64 x86_64 GNU/Linux
- Custom kernel: N/A (upstream/default from OpenSUSE Tumbleweed)
- AMD official driver version: N/A
- ROCm 4.3.1 (from AMD repositories without DKMS driver)
How to reproduce the issue:
Run any GPGPU application built on ROCm.