executor: split job execution to a separate process (!346) · Merge requests · gfx-ci / CI-tron

Martin Roukala requested to merge executor_rewrite into master Sep 29, 2022

By moving execution of jobs to their own process, we tie more
effectively the ressources used by the job to a process's lifetime.

This should result in a more resilient tracking of the DUT state,
and will enable updating the executor at runtime without affecting
any job currently executing \o/.

This commit is relatively big, as I wanted to keep the series
bisectable. Here are the list of changes that happened:

 - executor.py:
   - Run a flask server over a unix socket: This prevents multiple
     instances to run concurrently, as only one process can listen
     on the unix socket. Conversally, being able to connect to this
     socket indicates that the machine is busy.

 - __init__.py:
   - Introduce the "executor run-job" command

 - dut.py:
   - Add a dependency to requests_unixsocket, to make REST queries to
     the job process to get the state, cancel job, ...
   - Start the per-job process, passing the job bucket initial tarball
     by first writing it to a temporary file, then passing the fd to
     the new process.
   - To prevent a race condition where the job process may be starting
     up while another job comes in and thus would consider the machine
     free because the unix socket is not listened to by the job process,
     we wait for up to 5 seconds after the job got queued for the unix
     socket to become active, or we kill the process and fail the call.
   - Move MachineState to dut.py, and rename it to DUTState

 - Sergent Hartman:
   - Move to dut.py, as we cannot have cross-dependencies between dut.py
     and executor.py
   - Introduce execute_next_task() which is basically queueing the
     next training task, executing it, then reporting back the result.
     It thus had to implement a minimal client.
   - Prefix the next_task() and report() function with an underscore
   - Run the different tasks in a per-dut thread

Fixes: #64

Admin message

executor: split job execution to a separate process

Merge request reports