Skip to content
Snippets Groups Projects
  1. Nov 28, 2023
    • Jens Axboe's avatar
      io_uring: use fget/fput consistently · 73363c26
      Jens Axboe authored
      Normally within a syscall it's fine to use fdget/fdput for grabbing a
      file from the file table, and it's fine within io_uring as well. We do
      that via io_uring_enter(2), io_uring_register(2), and then also for
      cancel which is invoked from the latter. io_uring cannot close its own
      file descriptors as that is explicitly rejected, and for the cancel
      side of things, the file itself is just used as a lookup cookie.
      
      However, it is more prudent to ensure that full references are always
      grabbed. For anything threaded, either explicitly in the application
      itself or through use of the io-wq worker threads, this is what happens
      anyway. Generalize it and use fget/fput throughout.
      
      Also see the below link for more details.
      
      Link: https://lore.kernel.org/io-uring/CAG48ez1htVSO3TqmrF8QcX2WFuYTRM-VZ_N10i-VZgbtg=NNqw@mail.gmail.com/
      
      
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      73363c26
  2. Sep 29, 2023
    • Jens Axboe's avatar
      io_uring: add support for futex wake and wait · 194bb58c
      Jens Axboe authored
      
      Add support for FUTEX_WAKE/WAIT primitives.
      
      IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
      it does support passing in a bitset.
      
      Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
      FUTEX_WAIT_BITSET.
      
      For both of them, they are using the futex2 interface.
      
      FUTEX_WAKE is straight forward, as those can always be done directly from
      the io_uring submission without needing async handling. For FUTEX_WAIT,
      things are a bit more complicated. If the futex isn't ready, then we
      rely on a callback via futex_queue->wake() when someone wakes up the
      futex. From that calback, we queue up task_work with the original task,
      which will post a CQE and wake it, if necessary.
      
      Cancelations are supported, both from the application point-of-view,
      but also to be able to cancel pending waits if the ring exits before
      all events have occurred. The return value of futex_unqueue() is used
      to gate who wins the potential race between cancelation and futex
      wakeups. Whomever gets a 'ret == 1' return from that claims ownership
      of the io_uring futex request.
      
      This is just the barebones wait/wake support. PI or REQUEUE support is
      not added at this point, unclear if we might look into that later.
      
      Likewise, explicit timeouts are not supported either. It is expected
      that users that need timeouts would do so via the usual io_uring
      mechanism to do that using linked timeouts.
      
      The SQE format is as follows:
      
      `addr`		Address of futex
      `fd`		futex2(2) FUTEX2_* flags
      `futex_flags`	io_uring specific command flags. None valid now.
      `addr2`		Value of futex
      `addr3`		Mask to wake/wait
      
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      194bb58c
  3. Sep 21, 2023
    • Jens Axboe's avatar
      io_uring: add IORING_OP_WAITID support · f31ecf67
      Jens Axboe authored
      
      This adds support for an async version of waitid(2), in a fully async
      version. If an event isn't immediately available, wait for a callback
      to trigger a retry.
      
      The format of the sqe is as follows:
      
      sqe->len		The 'which', the idtype being queried/waited for.
      sqe->fd			The 'pid' (or id) being waited for.
      sqe->file_index		The 'options' being set.
      sqe->addr2		A pointer to siginfo_t, if any, being filled in.
      
      buf_index, add3, and waitid_flags are reserved/unused for now.
      waitid_flags will be used for options for this request type. One
      interesting use case may be to add multi-shot support, so that the
      request stays armed and posts a notification every time a monitored
      process state change occurs.
      
      Note that this does not support rusage, on Arnd's recommendation.
      
      See the waitid(2) man page for details on the arguments.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f31ecf67
  4. Jul 17, 2023
  5. Jun 20, 2023
  6. Dec 21, 2022
  7. Sep 21, 2022
    • Dylan Yudaken's avatar
      io_uring: add IORING_SETUP_DEFER_TASKRUN · c0e0d6ba
      Dylan Yudaken authored
      Allow deferring async tasks until the user calls io_uring_enter(2) with
      the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
      io_uring_setup time. This functionality requires that the later
      io_uring_enter will be called from the same submission task, and therefore
      restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
      set.
      
      Being able to hand pick when tasks are run prevents the problem where
      there is current work to be done, however task work runs anyway.
      
      For example, a common workload would obtain a batch of CQEs, and process
      each one. Interrupting this to additional taskwork would add latency but
      not gain anything. If instead task work is deferred to just before more
      CQEs are obtained then no additional latency is added.
      
      The way this is implemented is by trying to keep task work local to a
      io_ring_ctx, rather than to the submission task. This is required, as the
      application will want to wake up only a single io_ring_ctx at a time to
      process work, and so the lists of work have to be kept separate.
      
      This has some other benefits like not having to check the task continually
      in handle_tw_list (and potentially unlocking/locking those), and reducing
      locks in the submit & process completions path.
      
      There are networking cases where using this option can reduce request
      latency by 50%. For example a contrived example using [1] where the client
      sends 2k data and receives the same data back while doing some system
      calls (to trigger task work) shows this reduction. The reason ends up
      being that if sending responses is delayed by processing task work, then
      the client side sits idle. Whereas reordering the sends first means that
      the client runs it's workload in parallel with the local task work.
      
      [1]:
      Using https://github.com/DylanZA/netbench/tree/defer_run
      
      
      Client:
      ./netbench  --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
      Server:
      ./netbench  --server_only 1 --control_port 10000  --rx "io_uring --defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1 --workload 100"
      
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c0e0d6ba
  8. Aug 23, 2022
  9. Aug 12, 2022
  10. Jul 25, 2022
Loading