Draft: gitlab-ci: Timeout detection for LAVA jobs
Motivation
When a device from LAVA job hangs, the CI waits until the timeout (30 minutes) to terminate the job.
Empirically, when the device does not produce any output for 5 minutes, generally it means that the device hanged and will not finish the job.
So, to optimize the CI workflow, it would be good to put a timeout with some retries wrapping each LAVA job execution. If the job fails in all attempts, consider that the job have failed, cancel it and submit a new one.
In addition, we discovered that the
Addressed in !12797 (merged)scheduler.jobs.logs
RPC calls don't block, even when there is no new output to get. So this MR adds a sleep(1)
for every RPC call made in the follow_job_execution
.
Timeout Parameters
- Timeout duration 5 minutes
- # retries 2
- Timeout reset action whenever a new job log is fetched
TODO
- [ ] Increase JWT timeout to 8h
No need for this, as it is possible to run the main pieces of the Mesa CI locally without JWT token.
-
Simplify timeout logic by checking log output timings, instead of using the Linux process signaling. -
Implemented -
Reviewed
-
-
Add a step and dependencies on Mesa CI for testing internal CI modules with pytest
.-
Implemented -
Reviewed
-
-
Wait for LAVA job to start before checking for timeouts. -
Implemented -
Reviewed
-