Skip to content

Use new ringMonitoring feature to detect renderer crashes

Ryan Neph requested to merge ryanneph/mesa:handle-crash into main

When a Venus renderer advertises the new ringMonitoring feature, the driver may configure periodic ring health monitoring that works robustly regardless of whether ring(s)/renderer command streams are currently blocked by an async-wait barrier in the on the renderer-side (see !21716 (merged)).

It works as follows:

  1. During ring creation, driver checks ringMonitoring feature. If supported it chains VkRingMonitorInfoMESA to VkRingCreateInfoMESA::pNext with a uint32_t maxReportingPeriodMicroseconds.
    • the maxReportingPeriodMicroseconds is the longest the renderer is permitted to wait between successive ring "alive" reports.
    • the driver must wait at least as long as maxReportingPeriodMicroseconds before checking the most recent report. In practice we ensure this is at least met with an extra margin of 0.25s.
    • actual driver report check timing is dictated by the timing of vn_relax()'s "warn_order" (ensuring this is >= maxReportingPeriodMicroseconds with hardcoded params and a runtime assert).
  2. Every driver-side ring wait reaching a "warn_order" iteration, will check the ring's "alive" status, in addition to the existing FATAL status check.
    • only one guest thread currently in a ring-wait tests the shared ALIVE status bit directly, setting an internal atomic_bool alive to match the last confirmed status. It also unsets the ALIVE status bit to be re-set by the renderer before the next test by this thread.
    • all other waiting guest threads indirectly check the ring health by testing the atomic_bool alive instead.
  3. If the renderer fails to report by setting the ALIVE status bit, the driver will call abort() during the next "warn_order" iteration performed by the single monitoring guest thread.

See !21542 (932d80f3, comment 1807535) for earlier design discussion.

Related Changes:

Edited by Ryan Neph

Merge request reports