Skip to content
Snippets Groups Projects
  1. Jul 07, 2023
    • Andres Freund's avatar
      io_uring: Use io_schedule* in cqring wait · 8a796565
      Andres Freund authored
      
      I observed poor performance of io_uring compared to synchronous IO. That
      turns out to be caused by deeper CPU idle states entered with io_uring,
      due to io_uring using plain schedule(), whereas synchronous IO uses
      io_schedule().
      
      The losses due to this are substantial. On my cascade lake workstation,
      t/io_uring from the fio repository e.g. yields regressions between 20%
      and 40% with the following command:
      ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0
      
      This is repeatable with different filesystems, using raw block devices
      and using different block devices.
      
      Use io_schedule_prepare() / io_schedule_finish() in
      io_cqring_wait_schedule() to address the difference.
      
      After that using io_uring is on par or surpassing synchronous IO (using
      registered files etc makes it reliably win, but arguably is a less fair
      comparison).
      
      There are other calls to schedule() in io_uring/, but none immediately
      jump out to be similarly situated, so I did not touch them. Similarly,
      it's possible that mutex_lock_io() should be used, but it's not clear if
      there are cases where that matters.
      
      Cc: stable@vger.kernel.org # 5.10+
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: io-uring@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarAndres Freund <andres@anarazel.de>
      Link: https://lore.kernel.org/r/20230707162007.194068-1-andres@anarazel.de
      
      
      [axboe: minor style fixup]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a796565
  2. Jun 28, 2023
    • Jens Axboe's avatar
      io_uring: flush offloaded and delayed task_work on exit · dfbe5561
      Jens Axboe authored
      
      io_uring offloads task_work for cancelation purposes when the task is
      exiting. This is conceptually fine, but we should be nicer and actually
      wait for that work to complete before returning.
      
      Add an argument to io_fallback_tw() telling it to flush the deferred
      work when it's all queued up, and have it flush a ctx behind whenever
      the ctx changes.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dfbe5561
  3. Jun 27, 2023
  4. Jun 23, 2023
  5. Jun 21, 2023
  6. Jun 20, 2023
  7. Jun 18, 2023
  8. Jun 14, 2023
    • Jens Axboe's avatar
      io_uring/io-wq: clear current->worker_private on exit · adeaa3f2
      Jens Axboe authored
      
      A recent fix stopped clearing PF_IO_WORKER from current->flags on exit,
      which meant that we can now call inc/dec running on the worker after it
      has been removed if it ends up scheduling in/out as part of exit.
      
      If this happens after an RCU grace period has passed, then the struct
      pointed to by current->worker_private may have been freed, and we can
      now be accessing memory that is freed.
      
      Ensure this doesn't happen by clearing the task worker_private field.
      Both io_wq_worker_running() and io_wq_worker_sleeping() check this
      field before going any further, and we don't need any accounting etc
      done after this worker has exited.
      
      Fixes: fd37b884 ("io_uring/io-wq: don't clear PF_IO_WORKER on exit")
      Reported-by: default avatarZorro Lang <zlang@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      adeaa3f2
    • Jens Axboe's avatar
      io_uring/net: save msghdr->msg_control for retries · cac9e441
      Jens Axboe authored
      If the application sets ->msg_control and we have to later retry this
      command, or if it got queued with IOSQE_ASYNC to begin with, then we
      need to retain the original msg_control value. This is due to the net
      stack overwriting this field with an in-kernel pointer, to copy it
      in. Hitting that path for the second time will now fail the copy from
      user, as it's attempting to copy from a non-user address.
      
      Cc: stable@vger.kernel.org # 5.10+
      Link: https://github.com/axboe/liburing/issues/880
      
      
      Reported-and-tested-by: default avatarMarek Majkowski <marek@cloudflare.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cac9e441
  9. Jun 12, 2023
    • Jens Axboe's avatar
      io_uring/io-wq: don't clear PF_IO_WORKER on exit · fd37b884
      Jens Axboe authored
      
      A recent commit gated the core dumping task exit logic on current->flags
      remaining consistent in terms of PF_{IO,USER}_WORKER at task exit time.
      This exposed a problem with the io-wq handling of that, which explicitly
      clears PF_IO_WORKER before calling do_exit().
      
      The reasons for this manual clear of PF_IO_WORKER is historical, where
      io-wq used to potentially trigger a sleep on exit. As the io-wq thread
      is exiting, it should not participate any further accounting. But these
      days we don't need to rely on current->flags anymore, so we can safely
      remove the PF_IO_WORKER clearing.
      
      Reported-by: default avatarZorro Lang <zlang@redhat.com>
      Reported-by: default avatarDave Chinner <david@fromorbit.com>
      Link: https://lore.kernel.org/all/ZIZSPyzReZkGBEFy@dread.disaster.area/
      
      
      Fixes: f9010dbd ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd37b884
    • Jens Axboe's avatar
      io_uring: wait interruptibly for request completions on exit · 4826c594
      Jens Axboe authored
      WHen the ring exits, cleanup is done and the final cancelation and
      waiting on completions is done by io_ring_exit_work. That function is
      invoked by kworker, which doesn't take any signals. Because of that, it
      doesn't really matter if we wait for completions in TASK_INTERRUPTIBLE
      or TASK_UNINTERRUPTIBLE state. However, it does matter to the hung task
      detection checker!
      
      Normally we expect cancelations and completions to happen rather
      quickly. Some test cases, however, will exit the ring and park the
      owning task stopped (eg via SIGSTOP). If the owning task needs to run
      task_work to complete requests, then io_ring_exit_work won't make any
      progress until the task is runnable again. Hence io_ring_exit_work can
      trigger the hung task detection, which is particularly problematic if
      panic-on-hung-task is enabled.
      
      As the ring exit doesn't take signals to begin with, have it wait
      interruptibly rather than uninterruptibly. io_uring has a separate
      stuck-exit warning that triggers independently anyway, so we're not
      really missing anything by making this switch.
      
      Cc: stable@vger.kernel.org # 5.10+
      Link: https://lore.kernel.org/r/b0e4aaef-7088-56ce-244c-976edeac0e66@kernel.dk
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4826c594
    • Amir Goldstein's avatar
      fsnotify: move fsnotify_open() hook into do_dentry_open() · 7b8c9d7b
      Amir Goldstein authored and Jan Kara's avatar Jan Kara committed
      
      fsnotify_open() hook is called only from high level system calls
      context and not called for the very many helpers to open files.
      
      This may makes sense for many of the special file open cases, but it is
      inconsistent with fsnotify_close() hook that is called for every last
      fput() of on a file object with FMODE_OPENED.
      
      As a result, it is possible to observe ACCESS, MODIFY and CLOSE events
      without ever observing an OPEN event.
      
      Fix this inconsistency by replacing all the fsnotify_open() hooks with
      a single hook inside do_dentry_open().
      
      If there are special cases that would like to opt-out of the possible
      overhead of fsnotify() call in fsnotify_open(), they would probably also
      want to avoid the overhead of fsnotify() call in the rest of the fsnotify
      hooks, so they should be opening that file with the __FMODE_NONOTIFY flag.
      
      However, in the majority of those cases, the s_fsnotify_connectors
      optimization in fsnotify_parent() would be sufficient to avoid the
      overhead of fsnotify() call anyway.
      
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>
      7b8c9d7b
  10. Jun 09, 2023
  11. Jun 07, 2023
  12. Jun 02, 2023
  13. May 27, 2023
  14. May 25, 2023
Loading