kernel memory corruption when running a lot of OpenCL tests in parallel
steps to reproduce:
- compile rusticl from https://gitlab.freedesktop.org/karolherbst/mesa/-/commits/rusticl/wip_next (intels OpenCL runtime might trigger the same bug, but I can't get any kernel logs with it, but it does crash my entire machine as well)
- get the OpenCL CTS https://github.com/KhronosGroup/OpenCL-CTS
- download my OpenCL CTS runner (it has some hardcoded paths for the OpenCL CTS though): https://gitlab.freedesktop.org/karolherbst/opencl_cts_runner
- run it using rusticl or intels CL runtime:
OCL_ICD_VENDORS=....icd CLOVER_ENABLE_CL=1 run_local_mesa ../opencl_cts_runner/clctsrunner.py -w -j 24 > /dev/null
setOCL_ICD_VENDORS
and whatever else needed for a local mesa copy
- CPU: i7-12700
- GPU: 00:02.0 VGA compatible controller: Intel Corporation AlderLake-S GT1 (rev 0c)
- Arch: x86_64
- Linux: drm-tip
- Distribution: Fedora 35
- Kernel Log (via serial console): dmesg
Some say there is a kernel race condition related to context creation and execbuffer submission, which be the cause here as with the script above I do submit a lot of jobs and create tons of contexts.
Edited by Karol Herbst