Skip to content

ci/lava: Implement Hard Reset Detection in LAVA Test Phase

Guilherme Gallo requested to merge gallo/mesa:ci-lava-detect-reboots into main

What does this MR do and why?

Overview

This Merge Request introduces an important enhancement to our LAVA integration testing framework: the detection of hard resets during the test phase. The goal is to ensure that unexpected hard resets are promptly identified and handled, improving the reliability and efficiency of our testing process.

Details

  • Implemented a new feature to detect hard resets in real-time during the test phase.
  • Utilizes specific log messages to identify hard reset occurrences.
  • Raises an exception to flag the issue as soon as a hard reset is detected. This prevents the test job from running indefinitely on GitLab and LAVA, awaiting a timeout.

Impact

Hard resets during the test phase are generally unexpected and can be indicative of underlying issues. Detecting them promptly allows for quicker response and resolution, which is crucial for maintaining the integrity and efficiency of our continuous integration pipeline. This update significantly reduces the risk of prolonged, unproductive test runs and aids in faster identification of potential issues with the tested systems or the testing process itself.

Rationale

The need for this enhancement was identified through specific cases where hard resets occurred but were not detected efficiently, leading to prolonged test runs. Implementing this feature ensures that similar situations are handled more effectively in the future.

Real Case Example

Comparison

  • Before source: Took almost 27 minutes to react to the hard reset. Job timed out, no chance for retries.
    2024-01-09 23:56:15.298044: Pass: 4437, ExpectedFail: 24, Warn: 1, Skip: 117, Duration: 56, Remaining: 2:50
    2024-01-09 23:56:15.298062: Pass: 4550, ExpectedFail: 24, Warn: 1, Skip: 123, Duration: 58, Remaining: 2:48
    2024-01-09 23:56:25.436115: Pass: 4672, ExpectedFail: 25, Warn: 1, Skip: 128, Duration: 1:00, Remaining: 2:46
    2024-01-10 00:23:14.396337: Marking unfinished test run as failed
    2024-01-10 00:23:14.396476: Disconnecting from shell: Finalise
    2024-01-10 00:23:14.396487: lava-docker-test-shell timed out after 1744 seconds
    2024-01-10 00:23:14.396493: lava-docker-test-shell timed out after 1744 seconds
  • After (reproduced locally via integration test). Took 30 seconds to react. And there is a chance for retrying.
    2024-01-09 23:56:29.364394: Pass: 4437, ExpectedFail: 24, Warn: 1, Skip: 117, Duration: 56, Remaining: 2:50
    2024-01-09 23:56:29.364394: Pass: 4550, ExpectedFail: 24, Warn: 1, Skip: 123, Duration: 58, Remaining: 2:48
    2024-01-09 23:56:29.364394: Pass: 4672, ExpectedFail: 25, Warn: 1, Skip: 128, Duration: 1:00, Remaining: 2:46
    
    2024-01-09 23:56:59.383091: Known issue: Forced reboot detected during test phase, failing the job...
    2024-01-09 23:56:59.383091: LAVA Job finished with status: canceled

Merge request reports