Can reliably trigger GPU reset with small openCL program
Brief summary of the problem:
I'm toying with OpenCL on my desktop, and can reliably cause a GPU reset when my CL kernel runs a loop for too long (a few seconds, about 25 million iterations) on the GPU. The GPU also runs Xorg so I'm losing the session each time too.
Signature line in dmesg:
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=11628, emitted seq=11630
I'm using the AMD openCL libs installed with amdgpu-install --usecase=opencl
on a Debian stable, stock 6.1 kernel. Maybe that's a problem?
Could be a duplicate of #2643.
Hardware description:
- CPU: Intel i5-3570K
- GPU: RX 6700XT
- System Memory: 16 GB
- Display(s): 1
- Type of Display Connection: HDMI
System information:
- Distro name and Version: Debian 12.5
- Kernel version:
Linux erda 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
- Custom kernel: N/A
- AMD official driver version: Mesa 22.3.6 from Debian, openCL libs 6.0.60001-1710620.20.04 from AMD
How to reproduce the issue:
Here is the program I reproduce with: mandel.tar.gz
Compile and run with make
. The key line is #define MAXITER ...
, where large values cause a GPU reset. With small values, the program runs fine and gives correct results.