Skip to content

executor: address many ways for the executor job process to fail to connect to the client before its timeout

Martin Roukala requested to merge executor_job_process_reliability into master

One big reliability issue we are having in our infra is that the job process fails to connect to the client in time before the 5s expire... This leads to many failed CI jobs...

This series is addressing a few reasons for this to happen:

  1. Slow PDU instanciation
  2. The machine being marked as IDLE, when it was actually busy running

References: #120 (closed)

Merge request reports

Loading