Skip to content
Snippets Groups Projects
  1. Feb 01, 2024
    • Jens Axboe's avatar
      io_uring/net: fix sr->len for IORING_OP_RECV with MSG_WAITALL and buffers · 72bd8025
      Jens Axboe authored
      
      If we use IORING_OP_RECV with provided buffers and pass in '0' as the
      length of the request, the length is retrieved from the selected buffer.
      If MSG_WAITALL is also set and we get a short receive, then we may hit
      the retry path which decrements sr->len and increments the buffer for
      a retry. However, the length is still zero at this point, which means
      that sr->len now becomes huge and import_ubuf() will cap it to
      MAX_RW_COUNT and subsequently return -EFAULT for the range as a whole.
      
      Fix this by always assigning sr->len once the buffer has been selected.
      
      Cc: stable@vger.kernel.org
      Fixes: 7ba89d2a ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      72bd8025
  2. Jan 29, 2024
    • Jens Axboe's avatar
      io_uring/net: limit inline multishot retries · 76b367a2
      Jens Axboe authored
      If we have multiple clients and some/all are flooding the receives to
      such an extent that we can retry a LOT handling multishot receives, then
      we can be starving some clients and hence serving traffic in an
      imbalanced fashion.
      
      Limit multishot retry attempts to some arbitrary value, whose only
      purpose serves to ensure that we don't keep serving a single connection
      for way too long. We default to 32 retries, which should be more than
      enough to provide fairness, yet not so small that we'll spend too much
      time requeuing rather than handling traffic.
      
      Cc: stable@vger.kernel.org
      Depends-on: 704ea888 ("io_uring/poll: add requeue return code from poll multishot handling")
      Depends-on: 1e5d765a82f ("io_uring/net: un-indent mshot retry path in io_recv_finish()")
      Depends-on: e84b01a8 ("io_uring/poll: move poll execution helpers higher up")
      Fixes: b3fdea6e ("io_uring: multishot recv")
      Fixes: 9bb66906 ("io_uring: support multishot in recvmsg")
      Link: https://github.com/axboe/liburing/issues/1043
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      76b367a2
    • Jens Axboe's avatar
      io_uring/poll: add requeue return code from poll multishot handling · 704ea888
      Jens Axboe authored
      
      Since our poll handling is edge triggered, multishot handlers retry
      internally until they know that no more data is available. In
      preparation for limiting these retries, add an internal return code,
      IOU_REQUEUE, which can be used to inform the poll backend about the
      handler wanting to retry, but that this should happen through a normal
      task_work requeue rather than keep hammering on the issue side for this
      one request.
      
      No functional changes in this patch, nobody is using this return code
      just yet.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      704ea888
    • Jens Axboe's avatar
      io_uring/net: un-indent mshot retry path in io_recv_finish() · 91e5d765
      Jens Axboe authored
      
      In preparation for putting some retry logic in there, have the done
      path just skip straight to the end rather than have too much nesting
      in here.
      
      No functional changes in this patch.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91e5d765
    • Jens Axboe's avatar
      io_uring/poll: move poll execution helpers higher up · e84b01a8
      Jens Axboe authored
      
      In preparation for calling __io_poll_execute() higher up, move the
      functions to avoid forward declarations.
      
      No functional changes in this patch.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e84b01a8
    • Jens Axboe's avatar
      io_uring/rw: ensure poll based multishot read retries appropriately · c79f52f0
      Jens Axboe authored
      io_read_mshot() always relies on poll triggering retries, and this works
      fine as long as we do a retry per size of the buffer being read. The
      buffer size is given by the size of the buffer(s) in the given buffer
      group ID.
      
      But if we're reading less than what is available, then we don't always
      get to read everything that is available. For example, if the buffers
      available are 32 bytes and we have 64 bytes to read, then we'll
      correctly read the first 32 bytes and then wait for another poll trigger
      before we attempt the next read. This next poll trigger may never
      happen, in which case we just sit forever and never make progress, or it
      may trigger at some point in the future, and now we're just delivering
      the available data much later than we should have.
      
      io_read_mshot() could do retries itself, but that is wasteful as we'll
      be going through all of __io_read() again, and most likely in vain.
      Rather than do that, bump our poll reference count and have
      io_poll_check_events() do one more loop and check with vfs_poll() if we
      have more data to read. If we do, io_read_mshot() will get invoked again
      directly and we'll read the next chunk.
      
      io_poll_multishot_retry() must only get called from inside
      io_poll_issue(), which is our multishot retry handler, as we know we
      already "own" the request at this point.
      
      Cc: stable@vger.kernel.org
      Link: https://github.com/axboe/liburing/issues/1041
      
      
      Fixes: fc68fcda ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c79f52f0
  3. Jan 23, 2024
    • Paul Moore's avatar
      io_uring: enable audit and restrict cred override for IORING_OP_FIXED_FD_INSTALL · 16bae3e1
      Paul Moore authored
      
      We need to correct some aspects of the IORING_OP_FIXED_FD_INSTALL
      command to take into account the security implications of making an
      io_uring-private file descriptor generally accessible to a userspace
      task.
      
      The first change in this patch is to enable auditing of the FD_INSTALL
      operation as installing a file descriptor into a task's file descriptor
      table is a security relevant operation and something that admins/users
      may want to audit.
      
      The second change is to disable the io_uring credential override
      functionality, also known as io_uring "personalities", in the
      FD_INSTALL command.  The credential override in FD_INSTALL is
      particularly problematic as it affects the credentials used in the
      security_file_receive() LSM hook.  If a task were to request a
      credential override via REQ_F_CREDS on a FD_INSTALL operation, the LSM
      would incorrectly check to see if the overridden credentials of the
      io_uring were able to "receive" the file as opposed to the task's
      credentials.  After discussions upstream, it's difficult to imagine a
      use case where we would want to allow a credential override on a
      FD_INSTALL operation so we are simply going to block REQ_F_CREDS on
      IORING_OP_FIXED_FD_INSTALL operations.
      
      Fixes: dc18b89a ("io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL")
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      Link: https://lore.kernel.org/r/20240123215501.289566-2-paul@paul-moore.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      16bae3e1
  4. Jan 17, 2024
  5. Jan 11, 2024
    • Jens Axboe's avatar
      io_uring/rsrc: improve code generation for fixed file assignment · 3f302388
      Jens Axboe authored
      
      For the normal read/write path, we have already locked the ring
      submission side when assigning the file. This causes branch
      mispredictions when we then check and try and lock again in
      io_req_set_rsrc_node(). As this is a very hot path, this matters.
      
      Add a basic helper that already assumes we already have it locked,
      and use that in io_file_get_fixed().
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3f302388
  6. Jan 10, 2024
    • Jens Axboe's avatar
      io_uring/rw: cleanup io_rw_done() · fe80eb15
      Jens Axboe authored
      
      This originally came from the aio side, and it's laid out rather oddly.
      The common case here is that we either get -EIOCBQUEUED from submitting
      an async request, or that we complete the request correctly with the
      given number of bytes. Handling the odd internal restart error codes
      is not a common operation.
      
      Lay it out a bit more optimally that better explains the normal flow,
      and switch to avoiding the indirect call completely as this is our
      kiocb and we know the completion handler can only be one of two
      possible variants. While at it, move it to where it belongs in the
      file, with fellow end IO helpers.
      
      Outside of being easier to read, this also reduces the text size of the
      function by 24 bytes for me on arm64.
      
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fe80eb15
  7. Jan 04, 2024
    • Jens Axboe's avatar
      io_uring: ensure local task_work is run on wait timeout · 6ff1407e
      Jens Axboe authored
      
      A previous commit added an earlier break condition here, which is fine if
      we're using non-local task_work as it'll be run on return to userspace.
      However, if DEFER_TASKRUN is used, then we could be leaving local
      task_work that is ready to process in the ctx list until next time that
      we enter the kernel to wait for events.
      
      Move the break condition to _after_ we have run task_work.
      
      Cc: stable@vger.kernel.org
      Fixes: 846072f1 ("io_uring: mimimise io_cqring_wait_schedule")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6ff1407e
  8. Dec 29, 2023
    • Andrey Konovalov's avatar
      io_uring: use mempool KASAN hook · 8ab3b097
      Andrey Konovalov authored
      Use the proper kasan_mempool_unpoison_object hook for unpoisoning cached
      objects.
      
      A future change might also update io_uring to check the return value of
      kasan_mempool_poison_object to prevent double-free and invalid-free bugs. 
      This proves to be non-trivial with the current way io_uring caches
      objects, so this is left out-of-scope of this series.
      
      Link: https://lkml.kernel.org/r/eca18d6cbf676ed784f1a1f209c386808a8087c5.1703024586.git.andreyknvl@google.com
      
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Breno Leitao <leitao@debian.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8ab3b097
    • Andrey Konovalov's avatar
      kasan: rename kasan_slab_free_mempool to kasan_mempool_poison_object · 280ec6cc
      Andrey Konovalov authored
      Patch series "kasan: save mempool stack traces".
      
      This series updates KASAN to save alloc and free stack traces for
      secondary-level allocators that cache and reuse allocations internally
      instead of giving them back to the underlying allocator (e.g.  mempool).
      
      As a part of this change, introduce and document a set of KASAN hooks:
      
      bool kasan_mempool_poison_pages(struct page *page, unsigned int order);
      void kasan_mempool_unpoison_pages(struct page *page, unsigned int order);
      bool kasan_mempool_poison_object(void *ptr);
      void kasan_mempool_unpoison_object(void *ptr, size_t size);
      
      and use them in the mempool code.
      
      Besides mempool, skbuff and io_uring also cache allocations and already
      use KASAN hooks to poison those.  Their code is updated to use the new
      mempool hooks.
      
      The new hooks save alloc and free stack traces (for normal kmalloc and
      slab objects; stack traces for large kmalloc objects and page_alloc are
      not supported by KASAN yet), improve the readability of the users' code,
      and also allow the users to prevent double-free and invalid-free bugs; see
      the patches for the details.
      
      
      This patch (of 21):
      
      Rename kasan_slab_free_mempool to kasan_mempool_poison_object.
      
      kasan_slab_free_mempool is a slightly confusing name: it is unclear
      whether this function poisons the object when it is freed into mempool or
      does something when the object is freed from mempool to the underlying
      allocator.
      
      The new name also aligns with other mempool-related KASAN hooks added in
      the following patches in this series.
      
      Link: https://lkml.kernel.org/r/cover.1703024586.git.andreyknvl@google.com
      Link: https://lkml.kernel.org/r/c5618685abb7cdbf9fb4897f565e7759f601da84.1703024586.git.andreyknvl@google.com
      
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Breno Leitao <leitao@debian.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      280ec6cc
  9. Dec 21, 2023
    • Jens Axboe's avatar
      io_uring/kbuf: add method for returning provided buffer ring head · d293b1a8
      Jens Axboe authored
      The tail of the provided ring buffer is shared between the kernel and
      the application, but the head is private to the kernel as the
      application doesn't need to see it. However, this also prevents the
      application from knowing how many buffers the kernel has consumed.
      Usually this is fine, as the information is inherently racy in that
      the kernel could be consuming buffers continually, but for cleanup
      purposes it may be relevant to know how many buffers are still left
      in the ring.
      
      Add IORING_REGISTER_PBUF_STATUS which will return status for a given
      provided buffer ring. Right now it just returns the head, but space
      is reserved for more information later in, if needed.
      
      Link: https://github.com/axboe/liburing/discussions/1020
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d293b1a8
    • Jens Axboe's avatar
      io_uring/rw: ensure io->bytes_done is always initialized · 0a535edd
      Jens Axboe authored
      
      If IOSQE_ASYNC is set and we fail importing an iovec for a readv or
      writev request, then we leave ->bytes_done uninitialized and hence the
      eventual failure CQE posted can potentially have a random res value
      rather than the expected -EINVAL.
      
      Setup ->bytes_done before potentially failing, so we have a consistent
      value if we fail the request early.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarxingwei lee <xrivendell7@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0a535edd
  10. Dec 19, 2023
  11. Dec 14, 2023
    • Al Viro's avatar
      io_uring/cmd: fix breakage in SOCKET_URING_OP_SIOC* implementation · 1ba0e9d6
      Al Viro authored
      
      	In 8e9fad0e "io_uring: Add io_uring command support for sockets"
      you've got an include of asm-generic/ioctls.h done in io_uring/uring_cmd.c.
      That had been done for the sake of this chunk -
      +               ret = prot->ioctl(sk, SIOCINQ, &arg);
      +               if (ret)
      +                       return ret;
      +               return arg;
      +       case SOCKET_URING_OP_SIOCOUTQ:
      +               ret = prot->ioctl(sk, SIOCOUTQ, &arg);
      
      SIOC{IN,OUT}Q are defined to symbols (FIONREAD and TIOCOUTQ) that come from
      ioctls.h, all right, but the values vary by the architecture.
      
      FIONREAD is
      	0x467F on mips
      	0x4004667F on alpha, powerpc and sparc
      	0x8004667F on sh and xtensa
      	0x541B everywhere else
      TIOCOUTQ is
      	0x7472 on mips
      	0x40047473 on alpha, powerpc and sparc
      	0x80047473 on sh and xtensa
      	0x5411 everywhere else
      
      ->ioctl() expects the same values it would've gotten from userland; all
      places where we compare with SIOC{IN,OUT}Q are using asm/ioctls.h, so
      they pick the correct values.  io_uring_cmd_sock(), OTOH, ends up
      passing the default ones.
      
      Fixes: 8e9fad0e ("io_uring: Add io_uring command support for sockets")
      Cc:  <stable@vger.kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/20231214213408.GT1674809@ZenIV
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1ba0e9d6
  12. Dec 13, 2023
    • Jens Axboe's avatar
      io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE · 595e5228
      Jens Axboe authored
      
      There are a few quirks around using lazy wake for poll unconditionally,
      and one of them is related the EPOLLEXCLUSIVE. Those may trigger
      exclusive wakeups, which wake a limited number of entries in the wait
      queue. If that wake number is less than the number of entries someone is
      waiting for (and that someone is also using DEFER_TASKRUN), then we can
      get stuck waiting for more entries while we should be processing the ones
      we already got.
      
      If we're doing exclusive poll waits, flag the request as not being
      compatible with lazy wakeups.
      
      Reported-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Fixes: 6ce4a93d ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      595e5228
  13. Dec 12, 2023
  14. Dec 09, 2023
  15. Dec 07, 2023
  16. Dec 05, 2023
  17. Dec 04, 2023
  18. Dec 02, 2023
  19. Nov 28, 2023
    • Jens Axboe's avatar
      io_uring: use fget/fput consistently · 73363c26
      Jens Axboe authored
      Normally within a syscall it's fine to use fdget/fdput for grabbing a
      file from the file table, and it's fine within io_uring as well. We do
      that via io_uring_enter(2), io_uring_register(2), and then also for
      cancel which is invoked from the latter. io_uring cannot close its own
      file descriptors as that is explicitly rejected, and for the cancel
      side of things, the file itself is just used as a lookup cookie.
      
      However, it is more prudent to ensure that full references are always
      grabbed. For anything threaded, either explicitly in the application
      itself or through use of the io-wq worker threads, this is what happens
      anyway. Generalize it and use fget/fput throughout.
      
      Also see the below link for more details.
      
      Link: https://lore.kernel.org/io-uring/CAG48ez1htVSO3TqmrF8QcX2WFuYTRM-VZ_N10i-VZgbtg=NNqw@mail.gmail.com/
      
      
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      73363c26
    • Jens Axboe's avatar
      io_uring: free io_buffer_list entries via RCU · 5cf4f52e
      Jens Axboe authored
      
      mmap_lock nests under uring_lock out of necessity, as we may be doing
      user copies with uring_lock held. However, for mmap of provided buffer
      rings, we attempt to grab uring_lock with mmap_lock already held from
      do_mmap(). This makes lockdep, rightfully, complain:
      
      WARNING: possible circular locking dependency detected
      6.7.0-rc1-00009-gff3337ebaf94-dirty #4438 Not tainted
      ------------------------------------------------------
      buf-ring.t/442 is trying to acquire lock:
      ffff00020e1480a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_uring_validate_mmap_request.isra.0+0x4c/0x140
      
      but task is already holding lock:
      ffff0000dc226190 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x124/0x264
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&mm->mmap_lock){++++}-{3:3}:
             __might_fault+0x90/0xbc
             io_register_pbuf_ring+0x94/0x488
             __arm64_sys_io_uring_register+0x8dc/0x1318
             invoke_syscall+0x5c/0x17c
             el0_svc_common.constprop.0+0x108/0x130
             do_el0_svc+0x2c/0x38
             el0_svc+0x4c/0x94
             el0t_64_sync_handler+0x118/0x124
             el0t_64_sync+0x168/0x16c
      
      -> #0 (&ctx->uring_lock){+.+.}-{3:3}:
             __lock_acquire+0x19a0/0x2d14
             lock_acquire+0x2e0/0x44c
             __mutex_lock+0x118/0x564
             mutex_lock_nested+0x20/0x28
             io_uring_validate_mmap_request.isra.0+0x4c/0x140
             io_uring_mmu_get_unmapped_area+0x3c/0x98
             get_unmapped_area+0xa4/0x158
             do_mmap+0xec/0x5b4
             vm_mmap_pgoff+0x158/0x264
             ksys_mmap_pgoff+0x1d4/0x254
             __arm64_sys_mmap+0x80/0x9c
             invoke_syscall+0x5c/0x17c
             el0_svc_common.constprop.0+0x108/0x130
             do_el0_svc+0x2c/0x38
             el0_svc+0x4c/0x94
             el0t_64_sync_handler+0x118/0x124
             el0t_64_sync+0x168/0x16c
      
      From that mmap(2) path, we really just need to ensure that the buffer
      list doesn't go away from underneath us. For the lower indexed entries,
      they never go away until the ring is freed and we can always sanely
      reference those as long as the caller has a file reference. For the
      higher indexed ones in our xarray, we just need to ensure that the
      buffer list remains valid while we return the address of it.
      
      Free the higher indexed io_buffer_list entries via RCU. With that we can
      avoid needing ->uring_lock inside mmap(2), and simply hold the RCU read
      lock around the buffer list lookup and address check.
      
      To ensure that the arrayed lookup either returns a valid fully formulated
      entry via RCU lookup, add an 'is_ready' flag that we access with store
      and release memory ordering. This isn't needed for the xarray lookups,
      but doesn't hurt either. Since this isn't a fast path, retain it across
      both types. Similarly, for the allocated array inside the ctx, ensure
      we use the proper load/acquire as setup could in theory be running in
      parallel with mmap.
      
      While in there, add a few lockdep checks for documentation purposes.
      
      Cc: stable@vger.kernel.org
      Fixes: c56e022c ("io_uring: add support for user mapped provided buffer ring")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5cf4f52e
Loading