Skip to content
Snippets Groups Projects
  1. May 22, 2024
  2. May 21, 2024
  3. May 14, 2024
  4. May 10, 2024
  5. May 09, 2024
    • Jens Axboe's avatar
      io_uring/net: add IORING_ACCEPT_POLL_FIRST flag · d3da8e98
      Jens Axboe authored
      
      Similarly to how polling first is supported for receive, it makes sense
      to provide the same for accept. An accept operation does a lot of
      expensive setup, like allocating an fd, a socket/inode, etc. If no
      connection request is already pending, this is wasted and will just be
      cleaned up and freed, only to retry via the usual poll trigger.
      
      Add IORING_ACCEPT_POLL_FIRST, which tells accept to only initiate the
      accept request if poll says we have something to accept.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d3da8e98
    • Jens Axboe's avatar
      io_uring/net: add IORING_ACCEPT_DONTWAIT flag · 7dcc758c
      Jens Axboe authored
      
      This allows the caller to perform a non-blocking attempt, similarly to
      how recvmsg has MSG_DONTWAIT. If set, and we get -EAGAIN on a connection
      attempt, propagate the result to userspace rather than arm poll and
      wait for a retry.
      
      Suggested-by: default avatarNorman Maurer <norman_maurer@apple.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7dcc758c
  6. May 08, 2024
  7. May 07, 2024
    • Breno Leitao's avatar
      io_uring/io-wq: Use set_bit() and test_bit() at worker->flags · 8a565304
      Breno Leitao authored
      Utilize set_bit() and test_bit() on worker->flags within io_uring/io-wq
      to address potential data races.
      
      The structure io_worker->flags may be accessed through various data
      paths, leading to concurrency issues. When KCSAN is enabled, it reveals
      data races occurring in io_worker_handle_work and
      io_wq_activate_free_worker functions.
      
      	 BUG: KCSAN: data-race in io_worker_handle_work / io_wq_activate_free_worker
      	 write to 0xffff8885c4246404 of 4 bytes by task 49071 on cpu 28:
      	 io_worker_handle_work (io_uring/io-wq.c:434 io_uring/io-wq.c:569)
      	 io_wq_worker (io_uring/io-wq.c:?)
      <snip>
      
      	 read to 0xffff8885c4246404 of 4 bytes by task 49024 on cpu 5:
      	 io_wq_activate_free_worker (io_uring/io-wq.c:? io_uring/io-wq.c:285)
      	 io_wq_enqueue (io_uring/io-wq.c:947)
      	 io_queue_iowq (io_uring/io_uring.c:524)
      	 io_req_task_submit (io_uring/io_uring.c:1511)
      	 io_handle_tw_list (io_uring/io_uring.c:1198)
      <snip>
      
      Line numbers against commit 18daea77 ("Merge tag 'for-linus' of
      git://git.kernel.org/pub/scm/virt/kvm/kvm"
      
      ).
      
      These races involve writes and reads to the same memory location by
      different tasks running on different CPUs. To mitigate this, refactor
      the code to use atomic operations such as set_bit(), test_bit(), and
      clear_bit() instead of basic "and" and "or" operations. This ensures
      thread-safe manipulation of worker flags.
      
      Also, move `create_index` to avoid holes in the structure.
      
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240507170002.2269003-1-leitao@debian.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a565304
  8. May 01, 2024
    • Jens Axboe's avatar
      io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring · 59b28a6e
      Jens Axboe authored
      
      Move the posting outside the checking and locking, it's cleaner that
      way.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      59b28a6e
    • Gabriel Krisman Bertazi's avatar
      io_uring: Require zeroed sqe->len on provided-buffers send · 79996b45
      Gabriel Krisman Bertazi authored
      
      When sending from a provided buffer, we set sr->len to be the smallest
      between the actual buffer size and sqe->len.  But, now that we
      disconnect the buffer from the submission request, we can get in a
      situation where the buffers and requests mismatch, and only part of a
      buffer gets sent.  Assume:
      
      * buf[1]->len = 128; buf[2]->len = 256
      * sqe[1]->len = 128; sqe[2]->len = 256
      
      If sqe1 runs first, it picks buff[1] and it's all good. But, if sqe[2]
      runs first, sqe[1] picks buff[2], and the last half of buff[2] is
      never sent.
      
      While arguably the use-case of different-length sends is questionable,
      it has already raised confusion with potential users of this
      feature. Let's make the interface less tricky by forcing the length to
      only come from the buffer ring entry itself.
      
      Fixes: ac5f71a3 ("io_uring/net: add provided buffer support for IORING_OP_SEND")
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79996b45
  9. Apr 30, 2024
  10. Apr 26, 2024
    • linke li's avatar
      io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it · a4d416dc
      linke li authored
      
      In io_msg_exec_remote(), ctx->submitter_task is read using READ_ONCE at
      the beginning of the function, checked, and then re-read from
      ctx->submitter_task, voiding all guarantees of the checks. Reuse the value
      that was read by READ_ONCE to ensure the consistency of the task struct
      throughout the function.
      
      Signed-off-by: default avatarlinke li <lilinke99@qq.com>
      Link: https://lore.kernel.org/r/tencent_F9B2296C93928D6F68FF0C95C33475C68209@qq.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a4d416dc
    • Rick Edgecombe's avatar
      mm: switch mm->get_unmapped_area() to a flag · 529ce23a
      Rick Edgecombe authored
      The mm_struct contains a function pointer *get_unmapped_area(), which is
      set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown()
      during the initialization of the mm.
      
      Since the function pointer only ever points to two functions that are
      named the same across all arch's, a function pointer is not really
      required.  In addition future changes will want to add versions of the
      functions that take additional arguments.  So to save a pointers worth of
      bytes in mm_struct, and prevent adding additional function pointers to
      mm_struct in future changes, remove it and keep the information about
      which get_unmapped_area() to use in a flag.
      
      Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by
      mmf_init_flags().  Most MM flags get clobbered on fork.  In the
      pre-existing behavior mm->get_unmapped_area() would get copied to the new
      mm in dup_mm(), so not clobbering the flag preserves the existing behavior
      around inheriting the topdown-ness.
      
      Introduce a helper, mm_get_unmapped_area(), to easily convert code that
      refers to the old function pointer to instead select and call either
      arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the
      flag.  Then drop the mm->get_unmapped_area() function pointer.  Leave the
      get_unmapped_area() pointer in struct file_operations alone.  The main
      purpose of this change is to reorganize in preparation for future changes,
      but it also converts the calls of mm->get_unmapped_area() from indirect
      branches into a direct ones.
      
      The stress-ng bigheap benchmark calls realloc a lot, which calls through
      get_unmapped_area() in the kernel.  On x86, the change yielded a ~1%
      improvement there on a retpoline config.
      
      In testing a few x86 configs, removing the pointer unfortunately didn't
      result in any actual size reductions in the compiled layout of mm_struct. 
      But depending on compiler or arch alignment requirements, the change could
      shrink the size of mm_struct.
      
      Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com
      
      
      Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Deepak Gupta <debug@rivosinc.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      529ce23a
  11. Apr 25, 2024
    • Jens Axboe's avatar
      io_uring/rw: reinstate thread check for retries · 039a2e80
      Jens Axboe authored
      
      Allowing retries for everything is arguably the right thing to do, now
      that every command type is async read from the start. But it's exposed a
      few issues around missing check for a retry (which cca65713 exposed),
      and the fixup commit for that isn't necessarily 100% sound in terms of
      iov_iter state.
      
      For now, just revert these two commits. This unfortunately then re-opens
      the fact that -EAGAIN can get bubbled to userspace for some cases where
      the kernel very well could just sanely retry them. But until we have all
      the conditions covered around that, we cannot safely enable that.
      
      This reverts commit df604d2a.
      This reverts commit cca65713.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      039a2e80
  12. Apr 23, 2024
  13. Apr 22, 2024
    • Pavel Begunkov's avatar
      net: extend ubuf_info callback to ops structure · 7ab4f16f
      Pavel Begunkov authored
      
      We'll need to associate additional callbacks with ubuf_info, introduce
      a structure holding ubuf_info callbacks. Apart from a more smarter
      io_uring notification management introduced in next patches, it can be
      used to generalise msg_zerocopy_put_abort() and also store
      ->sg_from_iter, which is currently passed in struct msghdr.
      
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7ab4f16f
    • Jens Axboe's avatar
      io_uring/net: support bundles for recv · 2f9c9515
      Jens Axboe authored
      
      If IORING_OP_RECV is used with provided buffers, the caller may also set
      IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs
      buffers available and receives into them, posting a single completion for
      all of it.
      
      This can be used with multishot receive as well, or without it.
      
      Now that both send and receive support bundles, add a feature flag for
      it as well. If IORING_FEAT_RECVSEND_BUNDLE is set after registering the
      ring, then the kernel supports bundles for recv and send.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2f9c9515
    • Jens Axboe's avatar
      io_uring/net: support bundles for send · a05d1f62
      Jens Axboe authored
      
      If IORING_OP_SEND is used with provided buffers, the caller may also
      set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer send. The idea
      is that an application can fill outgoing buffers in a provided buffer
      group, and then arm a single send that will service them all. Once
      there are no more buffers to send, or if the requested length has
      been sent, the request posts a single completion for all the buffers.
      
      This only enables it for IORING_OP_SEND, IORING_OP_SENDMSG is coming
      in a separate patch. However, this patch does do a lot of the prep
      work that makes wiring up the sendmsg variant pretty trivial. They
      share the prep side.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a05d1f62
    • Jens Axboe's avatar
      io_uring/kbuf: add helpers for getting/peeking multiple buffers · 35c8711c
      Jens Axboe authored
      
      Our provided buffer interface only allows selection of a single buffer.
      Add an API that allows getting/peeking multiple buffers at the same time.
      
      This is only implemented for the ring provided buffers. It could be added
      for the legacy provided buffers as well, but since it's strongly
      encouraged to use the new interface, let's keep it simpler and just
      provide it for the new API. The legacy interface will always just select
      a single buffer.
      
      There are two new main functions:
      
      io_buffers_select(), which selects up as many buffers as it can. The
      caller supplies the iovec array, and io_buffers_select() may allocate a
      bigger array if the 'out_len' being passed in is non-zero and bigger
      than what fits in the provided iovec. Buffers grabbed with this helper
      are permanently assigned.
      
      io_buffers_peek(), which works like io_buffers_select(), except they can
      be recycled, if needed. Callers using either of these functions should
      call io_put_kbufs() rather than io_put_kbuf() at completion time. The
      peek interface must be called with the ctx locked from peek to
      completion.
      
      This add a bit state for the request:
      
      - REQ_F_BUFFERS_COMMIT, which means that the the buffers have been
        peeked and should be committed to the buffer ring head when they are
        put as part of completion. Prior to this, req->buf_list was cleared to
        NULL when committed.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      35c8711c
    • Jens Axboe's avatar
      io_uring/net: add provided buffer support for IORING_OP_SEND · ac5f71a3
      Jens Axboe authored
      
      It's pretty trivial to wire up provided buffer support for the send
      side, just like how it's done the receive side. This enables setting up
      a buffer ring that an application can use to push pending sends to,
      and then have a send pick a buffer from that ring.
      
      One of the challenges with async IO and networking sends is that you
      can get into reordering conditions if you have more than one inflight
      at the same time. Consider the following scenario where everything is
      fine:
      
      1) App queues sendA for socket1
      2) App queues sendB for socket1
      3) App does io_uring_submit()
      4) sendA is issued, completes successfully, posts CQE
      5) sendB is issued, completes successfully, posts CQE
      
      All is fine. Requests are always issued in-order, and both complete
      inline as most sends do.
      
      However, if we're flooding socket1 with sends, the following could
      also result from the same sequence:
      
      1) App queues sendA for socket1
      2) App queues sendB for socket1
      3) App does io_uring_submit()
      4) sendA is issued, socket1 is full, poll is armed for retry
      5) Space frees up in socket1, this triggers sendA retry via task_work
      6) sendB is issued, completes successfully, posts CQE
      7) sendA is retried, completes successfully, posts CQE
      
      Now we've sent sendB before sendA, which can make things unhappy. If
      both sendA and sendB had been using provided buffers, then it would look
      as follows instead:
      
      1) App queues dataA for sendA, queues sendA for socket1
      2) App queues dataB for sendB queues sendB for socket1
      3) App does io_uring_submit()
      4) sendA is issued, socket1 is full, poll is armed for retry
      5) Space frees up in socket1, this triggers sendA retry via task_work
      6) sendB is issued, picks first buffer (dataA), completes successfully,
         posts CQE (which says "I sent dataA")
      7) sendA is retried, picks first buffer (dataB), completes successfully,
         posts CQE (which says "I sent dataB")
      
      Now we've sent the data in order, and everybody is happy.
      
      It's worth noting that this also opens the door for supporting multishot
      sends, as provided buffers would be a prerequisite for that. Those can
      trigger either when new buffers are added to the outgoing ring, or (if
      stalled due to lack of space) when space frees up in the socket.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ac5f71a3
    • Jens Axboe's avatar
      io_uring/net: add generic multishot retry helper · 3e747ded
      Jens Axboe authored
      
      This is just moving io_recv_prep_retry() higher up so it can get used
      for sends as well, and rename it to be generically useful for both
      sends and receives.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3e747ded
  14. Apr 17, 2024
  15. Apr 15, 2024
Loading