Skip to content

ci/lava: Enhance error handling and Job submission logic

Guilherme Gallo requested to merge gallo/mesa:ci-lava-smart-wait into main

Overview

This merge request enhances the wait for job mechanism, making it smarter by integrating a stop condition based on the remaining execution time. Using a new environment variable, EXPECTED_JOB_DURATION_SEC, which is customizable in the job definition, the script will now cease waiting and fail the job if insufficient time remains. This will accelerate job failures, thereby updating the merge queue more efficiently.

Improved Job Failure Timing

Job Link Original Issue New Behavior
Job 57451349 & Job 57452819 The LAVA job didn't even start. The job would fail 10 minutes earlier.
Job 56981488 The LAVA job had less than 9 minutes to run. The job would fail 10 minutes earlier.

Additional Enhancements

  • Refactored Exception Hierarchy: The exception structure has been overhauled to clearly differentiate between errors that can be retried and those that are fatal.

    • MesaCIRetriableException: New base class for exceptions that should trigger a job retry.
    • MesaCIFatalException: Introduced for irremediable errors that necessitate an immediate halt.
  • Improved Error Logging: Enhanced the logging mechanism to record the full exception message instead of merely the type, applicable only to structured logs.

  • Clearer Script Interruption Messages: More explicit interruption messages have been implemented to facilitate quicker understanding and resolution of job submission failures.

Merge request reports