Skip to content
Snippets Groups Projects
  1. Feb 15, 2024
    • Jens Axboe's avatar
      io_uring/net: fix multishot accept overflow handling · a37ee9e1
      Jens Axboe authored
      If we hit CQ ring overflow when attempting to post a multishot accept
      completion, we don't properly save the result or return code. This
      results in losing the accepted fd value.
      
      Instead, we return the result from the poll operation that triggered
      the accept retry. This is generally POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND
      which is 0xc3, or 195, which looks like a valid file descriptor, but it
      really has no connection to that.
      
      Handle this like we do for other multishot completions - assign the
      result, and return IOU_STOP_MULTISHOT to cancel any further completions
      from this request when overflow is hit. This preserves the result, as we
      should, and tells the application that the request needs to be re-armed.
      
      Cc: stable@vger.kernel.org
      Fixes: 515e2696 ("io_uring: revert "io_uring fix multishot accept ordering"")
      Link: https://github.com/axboe/liburing/issues/1062
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a37ee9e1
  2. Feb 01, 2024
    • Jens Axboe's avatar
      io_uring/net: fix sr->len for IORING_OP_RECV with MSG_WAITALL and buffers · 72bd8025
      Jens Axboe authored
      
      If we use IORING_OP_RECV with provided buffers and pass in '0' as the
      length of the request, the length is retrieved from the selected buffer.
      If MSG_WAITALL is also set and we get a short receive, then we may hit
      the retry path which decrements sr->len and increments the buffer for
      a retry. However, the length is still zero at this point, which means
      that sr->len now becomes huge and import_ubuf() will cap it to
      MAX_RW_COUNT and subsequently return -EFAULT for the range as a whole.
      
      Fix this by always assigning sr->len once the buffer has been selected.
      
      Cc: stable@vger.kernel.org
      Fixes: 7ba89d2a ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      72bd8025
  3. Jan 29, 2024
    • Jens Axboe's avatar
      io_uring/net: limit inline multishot retries · 76b367a2
      Jens Axboe authored
      If we have multiple clients and some/all are flooding the receives to
      such an extent that we can retry a LOT handling multishot receives, then
      we can be starving some clients and hence serving traffic in an
      imbalanced fashion.
      
      Limit multishot retry attempts to some arbitrary value, whose only
      purpose serves to ensure that we don't keep serving a single connection
      for way too long. We default to 32 retries, which should be more than
      enough to provide fairness, yet not so small that we'll spend too much
      time requeuing rather than handling traffic.
      
      Cc: stable@vger.kernel.org
      Depends-on: 704ea888 ("io_uring/poll: add requeue return code from poll multishot handling")
      Depends-on: 1e5d765a82f ("io_uring/net: un-indent mshot retry path in io_recv_finish()")
      Depends-on: e84b01a8 ("io_uring/poll: move poll execution helpers higher up")
      Fixes: b3fdea6e ("io_uring: multishot recv")
      Fixes: 9bb66906 ("io_uring: support multishot in recvmsg")
      Link: https://github.com/axboe/liburing/issues/1043
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      76b367a2
    • Jens Axboe's avatar
      io_uring/poll: add requeue return code from poll multishot handling · 704ea888
      Jens Axboe authored
      
      Since our poll handling is edge triggered, multishot handlers retry
      internally until they know that no more data is available. In
      preparation for limiting these retries, add an internal return code,
      IOU_REQUEUE, which can be used to inform the poll backend about the
      handler wanting to retry, but that this should happen through a normal
      task_work requeue rather than keep hammering on the issue side for this
      one request.
      
      No functional changes in this patch, nobody is using this return code
      just yet.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      704ea888
    • Jens Axboe's avatar
      io_uring/net: un-indent mshot retry path in io_recv_finish() · 91e5d765
      Jens Axboe authored
      
      In preparation for putting some retry logic in there, have the done
      path just skip straight to the end rather than have too much nesting
      in here.
      
      No functional changes in this patch.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91e5d765
    • Jens Axboe's avatar
      io_uring/poll: move poll execution helpers higher up · e84b01a8
      Jens Axboe authored
      
      In preparation for calling __io_poll_execute() higher up, move the
      functions to avoid forward declarations.
      
      No functional changes in this patch.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e84b01a8
    • Jens Axboe's avatar
      io_uring/rw: ensure poll based multishot read retries appropriately · c79f52f0
      Jens Axboe authored
      io_read_mshot() always relies on poll triggering retries, and this works
      fine as long as we do a retry per size of the buffer being read. The
      buffer size is given by the size of the buffer(s) in the given buffer
      group ID.
      
      But if we're reading less than what is available, then we don't always
      get to read everything that is available. For example, if the buffers
      available are 32 bytes and we have 64 bytes to read, then we'll
      correctly read the first 32 bytes and then wait for another poll trigger
      before we attempt the next read. This next poll trigger may never
      happen, in which case we just sit forever and never make progress, or it
      may trigger at some point in the future, and now we're just delivering
      the available data much later than we should have.
      
      io_read_mshot() could do retries itself, but that is wasteful as we'll
      be going through all of __io_read() again, and most likely in vain.
      Rather than do that, bump our poll reference count and have
      io_poll_check_events() do one more loop and check with vfs_poll() if we
      have more data to read. If we do, io_read_mshot() will get invoked again
      directly and we'll read the next chunk.
      
      io_poll_multishot_retry() must only get called from inside
      io_poll_issue(), which is our multishot retry handler, as we know we
      already "own" the request at this point.
      
      Cc: stable@vger.kernel.org
      Link: https://github.com/axboe/liburing/issues/1041
      
      
      Fixes: fc68fcda ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c79f52f0
  4. Jan 23, 2024
    • Paul Moore's avatar
      io_uring: enable audit and restrict cred override for IORING_OP_FIXED_FD_INSTALL · 16bae3e1
      Paul Moore authored
      
      We need to correct some aspects of the IORING_OP_FIXED_FD_INSTALL
      command to take into account the security implications of making an
      io_uring-private file descriptor generally accessible to a userspace
      task.
      
      The first change in this patch is to enable auditing of the FD_INSTALL
      operation as installing a file descriptor into a task's file descriptor
      table is a security relevant operation and something that admins/users
      may want to audit.
      
      The second change is to disable the io_uring credential override
      functionality, also known as io_uring "personalities", in the
      FD_INSTALL command.  The credential override in FD_INSTALL is
      particularly problematic as it affects the credentials used in the
      security_file_receive() LSM hook.  If a task were to request a
      credential override via REQ_F_CREDS on a FD_INSTALL operation, the LSM
      would incorrectly check to see if the overridden credentials of the
      io_uring were able to "receive" the file as opposed to the task's
      credentials.  After discussions upstream, it's difficult to imagine a
      use case where we would want to allow a credential override on a
      FD_INSTALL operation so we are simply going to block REQ_F_CREDS on
      IORING_OP_FIXED_FD_INSTALL operations.
      
      Fixes: dc18b89a ("io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL")
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      Link: https://lore.kernel.org/r/20240123215501.289566-2-paul@paul-moore.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      16bae3e1
  5. Jan 17, 2024
  6. Jan 11, 2024
    • Jens Axboe's avatar
      io_uring/rsrc: improve code generation for fixed file assignment · 3f302388
      Jens Axboe authored
      
      For the normal read/write path, we have already locked the ring
      submission side when assigning the file. This causes branch
      mispredictions when we then check and try and lock again in
      io_req_set_rsrc_node(). As this is a very hot path, this matters.
      
      Add a basic helper that already assumes we already have it locked,
      and use that in io_file_get_fixed().
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3f302388
  7. Jan 10, 2024
    • Jens Axboe's avatar
      io_uring/rw: cleanup io_rw_done() · fe80eb15
      Jens Axboe authored
      
      This originally came from the aio side, and it's laid out rather oddly.
      The common case here is that we either get -EIOCBQUEUED from submitting
      an async request, or that we complete the request correctly with the
      given number of bytes. Handling the odd internal restart error codes
      is not a common operation.
      
      Lay it out a bit more optimally that better explains the normal flow,
      and switch to avoiding the indirect call completely as this is our
      kiocb and we know the completion handler can only be one of two
      possible variants. While at it, move it to where it belongs in the
      file, with fellow end IO helpers.
      
      Outside of being easier to read, this also reduces the text size of the
      function by 24 bytes for me on arm64.
      
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fe80eb15
  8. Jan 04, 2024
    • Jens Axboe's avatar
      io_uring: ensure local task_work is run on wait timeout · 6ff1407e
      Jens Axboe authored
      
      A previous commit added an earlier break condition here, which is fine if
      we're using non-local task_work as it'll be run on return to userspace.
      However, if DEFER_TASKRUN is used, then we could be leaving local
      task_work that is ready to process in the ctx list until next time that
      we enter the kernel to wait for events.
      
      Move the break condition to _after_ we have run task_work.
      
      Cc: stable@vger.kernel.org
      Fixes: 846072f1 ("io_uring: mimimise io_cqring_wait_schedule")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6ff1407e
  9. Dec 29, 2023
    • Andrey Konovalov's avatar
      io_uring: use mempool KASAN hook · 8ab3b097
      Andrey Konovalov authored
      Use the proper kasan_mempool_unpoison_object hook for unpoisoning cached
      objects.
      
      A future change might also update io_uring to check the return value of
      kasan_mempool_poison_object to prevent double-free and invalid-free bugs. 
      This proves to be non-trivial with the current way io_uring caches
      objects, so this is left out-of-scope of this series.
      
      Link: https://lkml.kernel.org/r/eca18d6cbf676ed784f1a1f209c386808a8087c5.1703024586.git.andreyknvl@google.com
      
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Breno Leitao <leitao@debian.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8ab3b097
    • Andrey Konovalov's avatar
      kasan: rename kasan_slab_free_mempool to kasan_mempool_poison_object · 280ec6cc
      Andrey Konovalov authored
      Patch series "kasan: save mempool stack traces".
      
      This series updates KASAN to save alloc and free stack traces for
      secondary-level allocators that cache and reuse allocations internally
      instead of giving them back to the underlying allocator (e.g.  mempool).
      
      As a part of this change, introduce and document a set of KASAN hooks:
      
      bool kasan_mempool_poison_pages(struct page *page, unsigned int order);
      void kasan_mempool_unpoison_pages(struct page *page, unsigned int order);
      bool kasan_mempool_poison_object(void *ptr);
      void kasan_mempool_unpoison_object(void *ptr, size_t size);
      
      and use them in the mempool code.
      
      Besides mempool, skbuff and io_uring also cache allocations and already
      use KASAN hooks to poison those.  Their code is updated to use the new
      mempool hooks.
      
      The new hooks save alloc and free stack traces (for normal kmalloc and
      slab objects; stack traces for large kmalloc objects and page_alloc are
      not supported by KASAN yet), improve the readability of the users' code,
      and also allow the users to prevent double-free and invalid-free bugs; see
      the patches for the details.
      
      
      This patch (of 21):
      
      Rename kasan_slab_free_mempool to kasan_mempool_poison_object.
      
      kasan_slab_free_mempool is a slightly confusing name: it is unclear
      whether this function poisons the object when it is freed into mempool or
      does something when the object is freed from mempool to the underlying
      allocator.
      
      The new name also aligns with other mempool-related KASAN hooks added in
      the following patches in this series.
      
      Link: https://lkml.kernel.org/r/cover.1703024586.git.andreyknvl@google.com
      Link: https://lkml.kernel.org/r/c5618685abb7cdbf9fb4897f565e7759f601da84.1703024586.git.andreyknvl@google.com
      
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Breno Leitao <leitao@debian.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      280ec6cc
  10. Dec 21, 2023
    • Jens Axboe's avatar
      io_uring/kbuf: add method for returning provided buffer ring head · d293b1a8
      Jens Axboe authored
      The tail of the provided ring buffer is shared between the kernel and
      the application, but the head is private to the kernel as the
      application doesn't need to see it. However, this also prevents the
      application from knowing how many buffers the kernel has consumed.
      Usually this is fine, as the information is inherently racy in that
      the kernel could be consuming buffers continually, but for cleanup
      purposes it may be relevant to know how many buffers are still left
      in the ring.
      
      Add IORING_REGISTER_PBUF_STATUS which will return status for a given
      provided buffer ring. Right now it just returns the head, but space
      is reserved for more information later in, if needed.
      
      Link: https://github.com/axboe/liburing/discussions/1020
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d293b1a8
    • Jens Axboe's avatar
      io_uring/rw: ensure io->bytes_done is always initialized · 0a535edd
      Jens Axboe authored
      
      If IOSQE_ASYNC is set and we fail importing an iovec for a readv or
      writev request, then we leave ->bytes_done uninitialized and hence the
      eventual failure CQE posted can potentially have a random res value
      rather than the expected -EINVAL.
      
      Setup ->bytes_done before potentially failing, so we have a consistent
      value if we fail the request early.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarxingwei lee <xrivendell7@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0a535edd
  11. Dec 19, 2023
  12. Dec 14, 2023
    • Al Viro's avatar
      io_uring/cmd: fix breakage in SOCKET_URING_OP_SIOC* implementation · 1ba0e9d6
      Al Viro authored
      
      	In 8e9fad0e "io_uring: Add io_uring command support for sockets"
      you've got an include of asm-generic/ioctls.h done in io_uring/uring_cmd.c.
      That had been done for the sake of this chunk -
      +               ret = prot->ioctl(sk, SIOCINQ, &arg);
      +               if (ret)
      +                       return ret;
      +               return arg;
      +       case SOCKET_URING_OP_SIOCOUTQ:
      +               ret = prot->ioctl(sk, SIOCOUTQ, &arg);
      
      SIOC{IN,OUT}Q are defined to symbols (FIONREAD and TIOCOUTQ) that come from
      ioctls.h, all right, but the values vary by the architecture.
      
      FIONREAD is
      	0x467F on mips
      	0x4004667F on alpha, powerpc and sparc
      	0x8004667F on sh and xtensa
      	0x541B everywhere else
      TIOCOUTQ is
      	0x7472 on mips
      	0x40047473 on alpha, powerpc and sparc
      	0x80047473 on sh and xtensa
      	0x5411 everywhere else
      
      ->ioctl() expects the same values it would've gotten from userland; all
      places where we compare with SIOC{IN,OUT}Q are using asm/ioctls.h, so
      they pick the correct values.  io_uring_cmd_sock(), OTOH, ends up
      passing the default ones.
      
      Fixes: 8e9fad0e ("io_uring: Add io_uring command support for sockets")
      Cc:  <stable@vger.kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/20231214213408.GT1674809@ZenIV
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1ba0e9d6
  13. Dec 13, 2023
    • Jens Axboe's avatar
      io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE · 595e5228
      Jens Axboe authored
      
      There are a few quirks around using lazy wake for poll unconditionally,
      and one of them is related the EPOLLEXCLUSIVE. Those may trigger
      exclusive wakeups, which wake a limited number of entries in the wait
      queue. If that wake number is less than the number of entries someone is
      waiting for (and that someone is also using DEFER_TASKRUN), then we can
      get stuck waiting for more entries while we should be processing the ones
      we already got.
      
      If we're doing exclusive poll waits, flag the request as not being
      compatible with lazy wakeups.
      
      Reported-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Fixes: 6ce4a93d ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      595e5228
  14. Dec 12, 2023
  15. Dec 09, 2023
  16. Dec 07, 2023
  17. Dec 05, 2023
  18. Dec 04, 2023
  19. Dec 02, 2023
  20. Nov 28, 2023
    • Jens Axboe's avatar
      io_uring: use fget/fput consistently · 73363c26
      Jens Axboe authored
      Normally within a syscall it's fine to use fdget/fdput for grabbing a
      file from the file table, and it's fine within io_uring as well. We do
      that via io_uring_enter(2), io_uring_register(2), and then also for
      cancel which is invoked from the latter. io_uring cannot close its own
      file descriptors as that is explicitly rejected, and for the cancel
      side of things, the file itself is just used as a lookup cookie.
      
      However, it is more prudent to ensure that full references are always
      grabbed. For anything threaded, either explicitly in the application
      itself or through use of the io-wq worker threads, this is what happens
      anyway. Generalize it and use fget/fput throughout.
      
      Also see the below link for more details.
      
      Link: https://lore.kernel.org/io-uring/CAG48ez1htVSO3TqmrF8QcX2WFuYTRM-VZ_N10i-VZgbtg=NNqw@mail.gmail.com/
      
      
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      73363c26
Loading