Kernel memory corruption when running piglit with OA enabled
When gpu metrics are enabled, a small set of piglit tests can quickly corrupt kernel memory and/or hang the machine in an unrecoverable way.
dmesg will often point to a backtrace that includes i915_oa_stream_destroy, but sometimes does not reference OA (see attached logs). I have not been able to reproduce the same error when OA is disabled. Any root execution of piglit will reproduce the bug, because OA is always enabled for the root user.
This issue hampers Mesa development, because Mesa CI hardware is taken offline on a constant basis as our test systems are corrupted and put into a state where the kernel cannot shut down or reboot. We have to remove power or physically shut down (hold power button) on about 40% of our fleet over the span of a week.
Based on CI data, this bug appears to impact at least gen9->gen12. It may impact hardware as old as HSW, but we don't have clear data. Recent (12 months) kernels are affected, at least up through the current 5.17.9. The issue should be easy to bisect in the kernel.
To reproduce:
- build this tree in piglit, which isolates a small number of tests: https://gitlab.freedesktop.org/majanes/piglit/-/tree/panic
- enable OA
sudo sysctl dev.i915.perf_stream_paranoid=0
- capture dmesg in a way that will survive a system hang:
sudo dmesg -w | tee panic.log
- run the 'panic' suite in a loop:
for ((;;)) do piglit run panic /tmp/panic; done
If piglit execution is a barrier for kernel devs, Mesa CI's docker containers can be used to set up an environment: https://gitlab.freedesktop.org/Mesa_CI/mesa_jenkins/-/tree/master/docker
For docker execution of piglit, the set of docker steps are:
$ git clone https://gitlab.freedesktop.org/Mesa_CI/mesa_jenkins.git
$ cd mesa_jenkins
mesa_jenkins$ python3 fetch_sources.py --project piglit-test piglit=97cbd16527c2c95e54716e1e86fb0ff687b42808
Ignore any errors about access to internal sources, they are not needed. Wait for all sources to be cloned. Then:
mesa_jenkins$ cd docker
mesa_jenkins/docker$ docker-compose build conformance_m64
mesa_jenkins/docker$ docker-compose run --rm conformance_m64 /bin/bash
/sources# python3 scripts/build_local.py --project piglit-test --action build
/sources# export LD_LIBRARY_PATH=/tmp/build_root/m64/lib:/tmp/build_root/m64/lib/dri:/tmp/build_root/m64/lib/x86_64-linux-gnu:/tmp/build_root/m64/lib/x86_64-linux-gnu/dri:/tmp/build_root/m64/lib64:/tmp/build_root/m64/lib64/dri:/usr/lib:/usr/lib/dri:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/dri:/usr/lib64:/usr/lib64/dri
/sources# export LIBGL_DRIVERS_PATH=/tmp/build_root/m64/lib64/dri:/tmp/build_root/m64/lib/x86_64-linux-gnu/dri:/tmp/build_root/m64/lib/dri:/usr/lib64/dri:/usr/lib/x86_64-linux-gnu/dri:/usr/lib/dri
/sources# export DISPLAY=:0
/sources# export PATH=$PATH:/tmp/build_root/m64/bin/
/sources# for ((;;)) do piglit run panic /tmp/panic; done