Skip to content

Draft: gitlab-ci: Timeout detection for LAVA jobs

Guilherme Gallo requested to merge gallo/mesa:ci-timeout-detection into main

Motivation

When a device from LAVA job hangs, the CI waits until the timeout (30 minutes) to terminate the job.

Empirically, when the device does not produce any output for 5 minutes, generally it means that the device hanged and will not finish the job.

So, to optimize the CI workflow, it would be good to put a timeout with some retries wrapping each LAVA job execution. If the job fails in all attempts, consider that the job have failed, cancel it and submit a new one.

In addition, we discovered that the scheduler.jobs.logs RPC calls don't block, even when there is no new output to get. So this MR adds a sleep(1) for every RPC call made in the follow_job_execution. Addressed in !12797 (merged)

Timeout Parameters

  • Timeout duration 5 minutes
  • # retries 2
  • Timeout reset action whenever a new job log is fetched

TODO

- [ ] Increase JWT timeout to 8h No need for this, as it is possible to run the main pieces of the Mesa CI locally without JWT token.

  • Simplify timeout logic by checking log output timings, instead of using the Linux process signaling.
    • Implemented
    • Reviewed
  • Add a step and dependencies on Mesa CI for testing internal CI modules with pytest.
    • Implemented
    • Reviewed
  • Wait for LAVA job to start before checking for timeouts.
    • Implemented
    • Reviewed
Edited by Guilherme Gallo

Merge request reports