Lax gem cache policy triggers system reset when running VulkanCTS
This issue comes out of an investigation into why Mesa CI test systems (BDW-TGL) were spontaneously rebooting when running VulkanCTS.
Polling /sys/kernel/debug/dri/0/i915_gem_objects
reveals that resident graphics memory climbs incrementally during a conformance test, eventually consuming several GB over the course of minutes/hours. For systems with pressure on CPU memory, the combination of CPU/GPU memory will exceed what is available and trigger an immediate host reset. Instrumenting VulkanCTS execution reveals that some tests leave as much as 1GB of gem objects resident on the system (see mesa/mesa#6896 (closed)).
Forcing the kernel to drop cached gem objects (echo 0x3FF > /sys/kernel/debug/dri/0/i915_gem_drop_caches
) recovers most of the available memory, and prevents system reset.
The tests listed in mesa issue 6896 (see attached csv) were listed with linux 5.18. On 5.16, fewer tests are printed from the cts instrumentation, presumably because gem cache cleanup is more prompt in the older version.
There are several critical issues here:
- Objects cached in in GPU memory should never accumulate to such an extent, over such a long duration. The garbage collection process responsible for this cache is wholly inadequate.
- When a system is under CPU + GPU memory pressure, the result should never be immediate system reset. No warnings of any kind (oom, etc) are printed to dmesg, the system resets directly. We are fortunate to have even found the cause of the reset.
- Apart from improved incremental garbage collection, caches should be dropped as part of the error path when memory is not available for GPU allocation.
- System memory allocated to the GPU is not accounted for in any tool that developers will know of (top, meminfo, etc).