Skip to content
Snippets Groups Projects
  • Uday Shankar's avatar
    13918315
    io-wq: backoff when retrying worker creation · 13918315
    Uday Shankar authored
    
    When io_uring submission goes async for the first time on a given task,
    we'll try to create a worker thread to handle the submission. Creating
    this worker thread can fail due to various transient conditions, such as
    an outstanding signal in the forking thread, so we have retry logic with
    a limit of 3 retries. However, this retry logic appears to be too
    aggressive/fast - we've observed a thread blowing through the retry
    limit while having the same outstanding signal the whole time. Here's an
    excerpt of some tracing that demonstrates the issue:
    
    First, signal 26 is generated for the process. It ends up getting routed
    to thread 92942.
    
     0)   cbd-92284    /* signal_generate: sig=26 errno=0 code=-2 comm=psblkdASD pid=92934 grp=1 res=0 */
    
    This causes create_io_thread in the signalled thread to fail with
    ERESTARTNOINTR, and thus a retry is queued.
    
    13) task_th-92942  /* io_uring_queue_async_work: ring 000000007325c9ae, request 0000000080c96d8e, user_data 0x0, opcode URING_CMD, flags 0x8240001, normal queue, work 000000006e96dd3f */
    13) task_th-92942  io_wq_enqueue() {
    13) task_th-92942    _raw_spin_lock();
    13) task_th-92942    io_wq_activate_free_worker();
    13) task_th-92942    _raw_spin_lock();
    13) task_th-92942    create_io_worker() {
    13) task_th-92942      __kmalloc_cache_noprof();
    13) task_th-92942      __init_swait_queue_head();
    13) task_th-92942      kprobe_ftrace_handler() {
    13) task_th-92942        get_kprobe();
    13) task_th-92942        aggr_pre_handler() {
    13) task_th-92942          pre_handler_kretprobe();
    13) task_th-92942          /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */
    13) task_th-92942        } /* aggr_pre_handler */
    ...
    13) task_th-92942        } /* copy_process */
    13) task_th-92942      } /* create_io_thread */
    13) task_th-92942      kretprobe_rethook_handler() {
    13) task_th-92942        /* create_exit: (create_io_worker+0x8a/0x1a0 <- create_io_thread) arg1=0xfffffffffffffdff */
    13) task_th-92942      } /* kretprobe_rethook_handler */
    13) task_th-92942    queue_work_on() {
    ...
    
    The CPU is then handed to a kworker to process the queued retry:
    
    ------------------------------------------
     13) task_th-92942  => kworker-54154
    ------------------------------------------
    13) kworker-54154  io_workqueue_create() {
    13) kworker-54154    io_queue_worker_create() {
    13) kworker-54154      task_work_add() {
    13) kworker-54154        wake_up_state() {
    13) kworker-54154          try_to_wake_up() {
    13) kworker-54154            _raw_spin_lock_irqsave();
    13) kworker-54154            _raw_spin_unlock_irqrestore();
    13) kworker-54154          } /* try_to_wake_up */
    13) kworker-54154        } /* wake_up_state */
    13) kworker-54154        kick_process();
    13) kworker-54154      } /* task_work_add */
    13) kworker-54154    } /* io_queue_worker_create */
    13) kworker-54154  } /* io_workqueue_create */
    
    And then we immediately switch back to the original task to try creating
    a worker again. This fails, because the original task still hasn't
    handled its signal.
    
    -----------------------------------------
     13) kworker-54154  => task_th-92942
    ------------------------------------------
    13) task_th-92942  create_worker_cont() {
    13) task_th-92942    kprobe_ftrace_handler() {
    13) task_th-92942      get_kprobe();
    13) task_th-92942      aggr_pre_handler() {
    13) task_th-92942        pre_handler_kretprobe();
    13) task_th-92942        /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */
    13) task_th-92942      } /* aggr_pre_handler */
    13) task_th-92942    } /* kprobe_ftrace_handler */
    13) task_th-92942    create_io_thread() {
    13) task_th-92942      copy_process() {
    13) task_th-92942        task_active_pid_ns();
    13) task_th-92942        _raw_spin_lock_irq();
    13) task_th-92942        recalc_sigpending();
    13) task_th-92942        _raw_spin_lock_irq();
    13) task_th-92942      } /* copy_process */
    13) task_th-92942    } /* create_io_thread */
    13) task_th-92942    kretprobe_rethook_handler() {
    13) task_th-92942      /* create_exit: (create_worker_cont+0x35/0x1b0 <- create_io_thread) arg1=0xfffffffffffffdff */
    13) task_th-92942    } /* kretprobe_rethook_handler */
    13) task_th-92942    io_worker_release();
    13) task_th-92942    queue_work_on() {
    13) task_th-92942      clear_pending_if_disabled();
    13) task_th-92942      __queue_work() {
    13) task_th-92942      } /* __queue_work */
    13) task_th-92942    } /* queue_work_on */
    13) task_th-92942  } /* create_worker_cont */
    
    The pattern repeats another couple times until we blow through the retry
    counter, at which point we give up. All outstanding work is canceled,
    and the io_uring command which triggered all this is failed with
    ECANCELED:
    
    13) task_th-92942  io_acct_cancel_pending_work() {
    ...
    13) task_th-92942  /* io_uring_complete: ring 000000007325c9ae, req 0000000080c96d8e, user_data 0x0, result -125, cflags 0x0 extra1 0 extra2 0  */
    
    Finally, the task gets around to processing its outstanding signal 26,
    but it's too late.
    
    13) task_th-92942  /* signal_deliver: sig=26 errno=0 code=-2 sa_handler=59566a0 sa_flags=14000000 */
    
    Try to address this issue by adding a small scaling delay when retrying
    worker creation. This should give the forking thread time to handle its
    signal in the above case. This isn't a particularly satisfying solution,
    as sufficiently paradoxical scheduling would still have us hitting the
    same issue, and I'm open to suggestions for something better. But this
    is likely to prevent this (already rare) issue from hitting in practice.
    
    Signed-off-by: default avatarUday Shankar <ushankar@purestorage.com>
    Link: https://lore.kernel.org/r/20250208-wq_retry-v2-1-4f6f5041d303@purestorage.com
    
    
    Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    13918315
    History
    io-wq: backoff when retrying worker creation
    Uday Shankar authored
    
    When io_uring submission goes async for the first time on a given task,
    we'll try to create a worker thread to handle the submission. Creating
    this worker thread can fail due to various transient conditions, such as
    an outstanding signal in the forking thread, so we have retry logic with
    a limit of 3 retries. However, this retry logic appears to be too
    aggressive/fast - we've observed a thread blowing through the retry
    limit while having the same outstanding signal the whole time. Here's an
    excerpt of some tracing that demonstrates the issue:
    
    First, signal 26 is generated for the process. It ends up getting routed
    to thread 92942.
    
     0)   cbd-92284    /* signal_generate: sig=26 errno=0 code=-2 comm=psblkdASD pid=92934 grp=1 res=0 */
    
    This causes create_io_thread in the signalled thread to fail with
    ERESTARTNOINTR, and thus a retry is queued.
    
    13) task_th-92942  /* io_uring_queue_async_work: ring 000000007325c9ae, request 0000000080c96d8e, user_data 0x0, opcode URING_CMD, flags 0x8240001, normal queue, work 000000006e96dd3f */
    13) task_th-92942  io_wq_enqueue() {
    13) task_th-92942    _raw_spin_lock();
    13) task_th-92942    io_wq_activate_free_worker();
    13) task_th-92942    _raw_spin_lock();
    13) task_th-92942    create_io_worker() {
    13) task_th-92942      __kmalloc_cache_noprof();
    13) task_th-92942      __init_swait_queue_head();
    13) task_th-92942      kprobe_ftrace_handler() {
    13) task_th-92942        get_kprobe();
    13) task_th-92942        aggr_pre_handler() {
    13) task_th-92942          pre_handler_kretprobe();
    13) task_th-92942          /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */
    13) task_th-92942        } /* aggr_pre_handler */
    ...
    13) task_th-92942        } /* copy_process */
    13) task_th-92942      } /* create_io_thread */
    13) task_th-92942      kretprobe_rethook_handler() {
    13) task_th-92942        /* create_exit: (create_io_worker+0x8a/0x1a0 <- create_io_thread) arg1=0xfffffffffffffdff */
    13) task_th-92942      } /* kretprobe_rethook_handler */
    13) task_th-92942    queue_work_on() {
    ...
    
    The CPU is then handed to a kworker to process the queued retry:
    
    ------------------------------------------
     13) task_th-92942  => kworker-54154
    ------------------------------------------
    13) kworker-54154  io_workqueue_create() {
    13) kworker-54154    io_queue_worker_create() {
    13) kworker-54154      task_work_add() {
    13) kworker-54154        wake_up_state() {
    13) kworker-54154          try_to_wake_up() {
    13) kworker-54154            _raw_spin_lock_irqsave();
    13) kworker-54154            _raw_spin_unlock_irqrestore();
    13) kworker-54154          } /* try_to_wake_up */
    13) kworker-54154        } /* wake_up_state */
    13) kworker-54154        kick_process();
    13) kworker-54154      } /* task_work_add */
    13) kworker-54154    } /* io_queue_worker_create */
    13) kworker-54154  } /* io_workqueue_create */
    
    And then we immediately switch back to the original task to try creating
    a worker again. This fails, because the original task still hasn't
    handled its signal.
    
    -----------------------------------------
     13) kworker-54154  => task_th-92942
    ------------------------------------------
    13) task_th-92942  create_worker_cont() {
    13) task_th-92942    kprobe_ftrace_handler() {
    13) task_th-92942      get_kprobe();
    13) task_th-92942      aggr_pre_handler() {
    13) task_th-92942        pre_handler_kretprobe();
    13) task_th-92942        /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */
    13) task_th-92942      } /* aggr_pre_handler */
    13) task_th-92942    } /* kprobe_ftrace_handler */
    13) task_th-92942    create_io_thread() {
    13) task_th-92942      copy_process() {
    13) task_th-92942        task_active_pid_ns();
    13) task_th-92942        _raw_spin_lock_irq();
    13) task_th-92942        recalc_sigpending();
    13) task_th-92942        _raw_spin_lock_irq();
    13) task_th-92942      } /* copy_process */
    13) task_th-92942    } /* create_io_thread */
    13) task_th-92942    kretprobe_rethook_handler() {
    13) task_th-92942      /* create_exit: (create_worker_cont+0x35/0x1b0 <- create_io_thread) arg1=0xfffffffffffffdff */
    13) task_th-92942    } /* kretprobe_rethook_handler */
    13) task_th-92942    io_worker_release();
    13) task_th-92942    queue_work_on() {
    13) task_th-92942      clear_pending_if_disabled();
    13) task_th-92942      __queue_work() {
    13) task_th-92942      } /* __queue_work */
    13) task_th-92942    } /* queue_work_on */
    13) task_th-92942  } /* create_worker_cont */
    
    The pattern repeats another couple times until we blow through the retry
    counter, at which point we give up. All outstanding work is canceled,
    and the io_uring command which triggered all this is failed with
    ECANCELED:
    
    13) task_th-92942  io_acct_cancel_pending_work() {
    ...
    13) task_th-92942  /* io_uring_complete: ring 000000007325c9ae, req 0000000080c96d8e, user_data 0x0, result -125, cflags 0x0 extra1 0 extra2 0  */
    
    Finally, the task gets around to processing its outstanding signal 26,
    but it's too late.
    
    13) task_th-92942  /* signal_deliver: sig=26 errno=0 code=-2 sa_handler=59566a0 sa_flags=14000000 */
    
    Try to address this issue by adding a small scaling delay when retrying
    worker creation. This should give the forking thread time to handle its
    signal in the above case. This isn't a particularly satisfying solution,
    as sufficiently paradoxical scheduling would still have us hitting the
    same issue, and I'm open to suggestions for something better. But this
    is likely to prevent this (already rare) issue from hitting in practice.
    
    Signed-off-by: default avatarUday Shankar <ushankar@purestorage.com>
    Link: https://lore.kernel.org/r/20250208-wq_retry-v2-1-4f6f5041d303@purestorage.com
    
    
    Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
io-wq.c 33.62 KiB