Skip to content

executor/jobs: allow marking machines unfit for service + add user watchdogs

Martin Roukala requested to merge watchdog_support into master

Machine unfit for service

I added a "machine_unfit_for_service" console pattern, that allows jobs to mark machines as broken.

This can be used to drop

Watchdogs

Given the following job description:

timeouts:
  overall:
    hours: 1
    retries: 0
    # no retries possible here
  watchdogs:
    custom:
      seconds: 30
      retries: 1

console_patterns:
    session_end:
        regex: "^.*It's now safe to turn off your computer\r$"
    watchdogs:
      custom:
        start:
          regex: "CUSTOM START"
        reset:
          regex: "CUSTOM RESET"
        stop:
          regex: "CUSTOM STOP"

[...]

We got the following log while using a job in an interactive session:

root@boot2container:/app# echo "CUSTOM START"
CUSTOM +213.257s: Matched the following patterns: custom.start
START
root@boot2container:/app# +243.274s: Hit the timeout <Timeout custom: value=0:00:30, retries=1/1> --> Try again!

[...]

root@boot2container:/app# echo "CUSTOM START"
CUSTO+572.804s: Matched the following patterns: custom.start
M START
root@boot2container:/app# echo "CUSTOM RESET"
+581.698s: Matched the following patterns: custom.reset
CUSTOM RESET
root@boot2container:/app# echo "CUSTOM STOP+587.173s: Matched the following patterns: custom.reset, custom.stop

CUSTOM STOP

As you can see, when the stop pattern is played, we also get the reset pattern. This is because bash is doing something funky with backspaces to erase characters, which SALAD completely ignores... I guess I can live with this, as this would only happen in interactive sessions.

Merge request reports