- Jul 07, 2023
-
-
Andres Freund authored
I observed poor performance of io_uring compared to synchronous IO. That turns out to be caused by deeper CPU idle states entered with io_uring, due to io_uring using plain schedule(), whereas synchronous IO uses io_schedule(). The losses due to this are substantial. On my cascade lake workstation, t/io_uring from the fio repository e.g. yields regressions between 20% and 40% with the following command: ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0 This is repeatable with different filesystems, using raw block devices and using different block devices. Use io_schedule_prepare() / io_schedule_finish() in io_cqring_wait_schedule() to address the difference. After that using io_uring is on par or surpassing synchronous IO (using registered files etc makes it reliably win, but arguably is a less fair comparison). There are other calls to schedule() in io_uring/, but none immediately jump out to be similarly situated, so I did not touch them. Similarly, it's possible that mutex_lock_io() should be used, but it's not clear if there are cases where that matters. Cc: stable@vger.kernel.org # 5.10+ Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by:
Andres Freund <andres@anarazel.de> Link: https://lore.kernel.org/r/20230707162007.194068-1-andres@anarazel.de [axboe: minor style fixup] Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 28, 2023
-
-
Jens Axboe authored
io_uring offloads task_work for cancelation purposes when the task is exiting. This is conceptually fine, but we should be nicer and actually wait for that work to complete before returning. Add an argument to io_fallback_tw() telling it to flush the deferred work when it's all queued up, and have it flush a ctx behind whenever the ctx changes. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 27, 2023
-
-
Jens Axboe authored
It's used just one function higher up, get rid of the declaration and just move it up a bit. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
struct msghdr->msg_inq is a signed type, yet we attempt to store what is essentially an unsigned bitmask in there. We only really need to know if the field was stored or not, but let's use the proper type to avoid any misunderstandings on what is being attempted here. Link: https://lore.kernel.org/io-uring/CAHk-=wjKb24aSe6fE4zDH-eh8hr-FB9BbukObUVSMGOrsBHCRQ@mail.gmail.com/ Reported-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 23, 2023
-
-
Pavel Begunkov authored
There is no reason not to use __io_cq_unlock_post_flush for intermediate aux CQE flushing, all ->task_complete should apply there, i.e. if set it should be the submitter task. Combine them, get rid of of __io_cq_unlock_post() and rename the left function. This place was also taking a couple percents of CPU according to profiles for max throughput net benchmarks due to multishot recv flooding it with completions. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bbed60734cbec2e833d9c7bdcf9741aada5d8aab.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_cq_unlock_post() is exclusively used in io_uring/io_uring.c, mark it static and don't expose to other files. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dc8127dda4514e1dd24bb32035faac887c5fa37.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
__io_cq_unlock is not very helpful, and users should be calling flush variants anyway. Open code the function. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d875c4cfb69f38ccecb58a57111446c77a614caa.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We do conditional locking, so __io_cq_lock() and friends not always actually grab/release the lock, so kill misleading annotations. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a098f9144c24cab622f8bf90b39f44da5d0401e.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We're abusing ->completion_lock helpers. io_cq_unlock() neither locking conditionally nor doing CQE flushing, which means that callers must have some side reason of taking the lock and should do it directly. Open code io_cq_unlock() into io_cqring_overflow_kill() and clean it up. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7dabb36856db2b562e78780480396c52c29b2bf4.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Extract a function for non-local task_work_add, and use it directly from io_move_task_work_from_local(). Now we don't use IOU_F_TWQ_FORCE_NORMAL and it can be killed. As a small positive side effect we don't grab task->io_uring in io_req_normal_work_add anymore, which is not needed for io_req_local_work_add(). Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2e55571e8ff2927ae3cc12da606d204e2485525b.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We're trying to batch io_put_task() in io_free_batch_list(), but considering that the hot path is a simple inc, it's most cerainly and probably faster to just do io_put_task() instead of task tracking. We don't care about io_put_task_remote() as it's only for IOPOLL where polling/waiting is done by not the submitter task. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a7ef7dce845fe2bd35507bf389d6bd2d5c1edf0.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Move io_clean_op() up in the source file and remove the forward declaration, as the function doesn't have tricky dependencies anymore. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1b7163b2ba7c3a8322d972c79c1b0a9301b3057e.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_dismantle_req() is only used in __io_req_complete_post(), open code it there. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ba8f20cb2c914eefa2e7d120a104a198552050db.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Request completion is a very hot path in general, but there are 3 places that can be doing it: io_free_batch_list(), io_req_complete_post() and io_free_req_tw(). io_free_req_tw() is used rather marginally and we don't care about it. Killing it can help to clean up and optimise the left two, do that by replacing it with io_req_task_complete(). There are two things to consider: 1) io_free_req() is called when all refs are put, so we need to reinit references. The easiest way to do that is to clear REQ_F_REFCOUNT. 2) We also don't need a cqe from it, so silence it with REQ_F_CQE_SKIP. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/434a2be8f33d474ad888ce1c17fe5ea7bbcb2a55.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
There is only one user of io_put_req_find_next() and it doesn't make much sense to have it. Open code the function. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/38b5c5e48e4adc8e6a0cd16fdd5c1531d7ff81a9.1687518903.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 21, 2023
-
-
Jens Axboe authored
Rather than assign the user pointer to msghdr->msg_control, assign it to msghdr->msg_control_user to make sparse happy. They are in a union so the end result is the same, but let's avoid new sparse warnings and squash this one. Reported-by:
kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202306210654.mDMcyMuB-lkp@intel.com/ Fixes: cac9e441 ("io_uring/net: save msghdr->msg_control for retries") Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We cannot sanely handle partial retries for recvmsg if we have cmsg attached. If we don't, then we'd just be overwriting the initial cmsg header on retries. Alternatively we could increment and handle this appropriately, but it doesn't seem worth the complication. Move the MSG_WAITALL check into the non-multishot case while at it, since MSG_WAITALL is explicitly disabled for multishot anyway. Link: https://lore.kernel.org/io-uring/0b0d4411-c8fd-4272-770b-e030af6919a0@kernel.dk/ Cc: stable@vger.kernel.org # 5.10+ Reported-by:
Stefan Metzmacher <metze@samba.org> Reviewed-by:
Stefan Metzmacher <metze@samba.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If we have cmsg attached AND we transferred partial data at least, clear msg_controllen on retry so we don't attempt to send that again. Cc: stable@vger.kernel.org # 5.10+ Fixes: cac9e441 ("io_uring/net: save msghdr->msg_control for retries") Reported-by:
Stefan Metzmacher <metze@samba.org> Reviewed-by:
Stefan Metzmacher <metze@samba.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 20, 2023
-
-
Christoph Hellwig authored
Remove all the open coded magic on slot->file_ptr by introducing two helpers that return the file pointer and the flags instead. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-9-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use io_file_from_index instead of open coding it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-8-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use io_file_from_index instead of open coding it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-7-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Two of the three callers want them, so return the more usual format, and shift into the FFS_ form only for the fixed file table. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-6-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Just checking the flag directly makes it a lot more obvious what is going on here. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-5-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The SCM inflight mechanism has nothing to do with the fact that a file might be a regular file or not and if it supports non-blocking operations. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-4-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The variable is only once now, so don't bother with it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-3-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Now that this only checks O_NONBLOCK and FMODE_NOWAIT, the helper is complete overkilļ, and the comments are confusing bordering to wrong. Just inline the check into the caller. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-2-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 18, 2023
-
-
Jens Axboe authored
We selectively grab the ctx->uring_lock for poll update/removal, but we really should grab it from the start to fully synchronize with linked timeouts. Normally this is indeed the case, but if requests are forced async by the application, we don't fully cover removal and timer disarm within the uring_lock. Make this simpler by having consistent locking state for poll removal. Cc: stable@vger.kernel.org # 6.1+ Reported-by:
Querijn Voet <querijnqyn@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 14, 2023
-
-
Jens Axboe authored
A recent fix stopped clearing PF_IO_WORKER from current->flags on exit, which meant that we can now call inc/dec running on the worker after it has been removed if it ends up scheduling in/out as part of exit. If this happens after an RCU grace period has passed, then the struct pointed to by current->worker_private may have been freed, and we can now be accessing memory that is freed. Ensure this doesn't happen by clearing the task worker_private field. Both io_wq_worker_running() and io_wq_worker_sleeping() check this field before going any further, and we don't need any accounting etc done after this worker has exited. Fixes: fd37b884 ("io_uring/io-wq: don't clear PF_IO_WORKER on exit") Reported-by:
Zorro Lang <zlang@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If the application sets ->msg_control and we have to later retry this command, or if it got queued with IOSQE_ASYNC to begin with, then we need to retain the original msg_control value. This is due to the net stack overwriting this field with an in-kernel pointer, to copy it in. Hitting that path for the second time will now fail the copy from user, as it's attempting to copy from a non-user address. Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/issues/880 Reported-and-tested-by:
Marek Majkowski <marek@cloudflare.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 12, 2023
-
-
Jens Axboe authored
A recent commit gated the core dumping task exit logic on current->flags remaining consistent in terms of PF_{IO,USER}_WORKER at task exit time. This exposed a problem with the io-wq handling of that, which explicitly clears PF_IO_WORKER before calling do_exit(). The reasons for this manual clear of PF_IO_WORKER is historical, where io-wq used to potentially trigger a sleep on exit. As the io-wq thread is exiting, it should not participate any further accounting. But these days we don't need to rely on current->flags anymore, so we can safely remove the PF_IO_WORKER clearing. Reported-by:
Zorro Lang <zlang@redhat.com> Reported-by:
Dave Chinner <david@fromorbit.com> Link: https://lore.kernel.org/all/ZIZSPyzReZkGBEFy@dread.disaster.area/ Fixes: f9010dbd ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jens Axboe authored
WHen the ring exits, cleanup is done and the final cancelation and waiting on completions is done by io_ring_exit_work. That function is invoked by kworker, which doesn't take any signals. Because of that, it doesn't really matter if we wait for completions in TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state. However, it does matter to the hung task detection checker! Normally we expect cancelations and completions to happen rather quickly. Some test cases, however, will exit the ring and park the owning task stopped (eg via SIGSTOP). If the owning task needs to run task_work to complete requests, then io_ring_exit_work won't make any progress until the task is runnable again. Hence io_ring_exit_work can trigger the hung task detection, which is particularly problematic if panic-on-hung-task is enabled. As the ring exit doesn't take signals to begin with, have it wait interruptibly rather than uninterruptibly. io_uring has a separate stuck-exit warning that triggers independently anyway, so we're not really missing anything by making this switch. Cc: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/r/b0e4aaef-7088-56ce-244c-976edeac0e66@kernel.dk Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
fsnotify_open() hook is called only from high level system calls context and not called for the very many helpers to open files. This may makes sense for many of the special file open cases, but it is inconsistent with fsnotify_close() hook that is called for every last fput() of on a file object with FMODE_OPENED. As a result, it is possible to observe ACCESS, MODIFY and CLOSE events without ever observing an OPEN event. Fix this inconsistency by replacing all the fsnotify_open() hooks with a single hook inside do_dentry_open(). If there are special cases that would like to opt-out of the possible overhead of fsnotify() call in fsnotify_open(), they would probably also want to avoid the overhead of fsnotify() call in the rest of the fsnotify hooks, so they should be opening that file with the __FMODE_NONOTIFY flag. However, in the majority of those cases, the s_fsnotify_connectors optimization in fsnotify_parent() would be sufficient to avoid the overhead of fsnotify() call anyway. Signed-off-by:
Amir Goldstein <amir73il@gmail.com> Signed-off-by:
Jan Kara <jack@suse.cz> Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>
-
- Jun 09, 2023
-
-
Lorenzo Stoakes authored
We are now in a position where no caller of pin_user_pages() requires the vmas parameter at all, so eliminate this parameter from the function and all callers. This clears the way to removing the vmas parameter from GUP altogether. Link: https://lkml.kernel.org/r/195a99ae949c9f5cb589d2222b736ced96ec199a.1684350871.git.lstoakes@gmail.com Signed-off-by:
Lorenzo Stoakes <lstoakes@gmail.com> Acked-by:
David Hildenbrand <david@redhat.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> [qib] Reviewed-by:
Christoph Hellwig <hch@lst.de> Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> [drivers/media] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Lorenzo Stoakes authored
Now that the GUP explicitly checks FOLL_LONGTERM pin_user_pages() for broken file-backed mappings in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings", there is no need to explicitly check VMAs for this condition, so simply remove this logic from io_uring altogether. Link: https://lkml.kernel.org/r/e4a4efbda9cd12df71e0ed81796dc630231a1ef2.1684350871.git.lstoakes@gmail.com Signed-off-by:
Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jens Axboe <axboe@kernel.dk> Reviewed-by:
David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Jun 07, 2023
-
-
Jens Axboe authored
Just use the ARRAY_SIZE directly, we don't use length for anything else in this function. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Everybody is passing in the request, so get rid of the io_ring_ctx and explicit user_data pass-in. Both the ctx and user_data can be deduced from the request at hand. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 02, 2023
-
-
Jens Axboe authored
We use task_work for a variety of reasons, but doing completions or triggering rety after poll are by far the hottest two. Use the indirect funtion call wrappers to avoid the indirect function call if CONFIG_RETPOLINE is set. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 27, 2023
-
-
Ben Noordhuis authored
Libuv recently started using it so there is at least one consumer now. Cc: stable@vger.kernel.org Fixes: 61a2732a ("io_uring: deprecate epoll_ctl support") Link: https://github.com/libuv/libuv/pull/3979 Signed-off-by:
Ben Noordhuis <info@bnoordhuis.nl> Link: https://lore.kernel.org/r/20230506095502.13401-1-info@bnoordhuis.nl Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 25, 2023
-
-
Wenwen Chen authored
The sq thread actively releases CPU resources by calling the cond_resched() and schedule() interfaces when it is idle. Therefore, more resources are available for other threads to run. There exists a problem in sq thread: it does not unlock sqd->lock before releasing CPU resources every time. This makes other threads pending on sqd->lock for a long time. For example, the following interfaces all require sqd->lock: io_sq_offload_create(), io_register_iowq_max_workers() and io_ring_exit_work(). Before the sq thread releases CPU resources, unlocking sqd->lock will provide the user a better experience because it can respond quickly to user requests. Signed-off-by:
Kanchan <Joshi<joshi.k@samsung.com> Signed-off-by:
Wenwen <Chen<wenwen.chen@samsung.com> Link: https://lore.kernel.org/r/20230525082626.577862-1-wenwen.chen@samsung.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
We want to use IOU_F_TWQ_LAZY_WAKE in commands. First, introduce a new cmd tw helper accepting TWQ flags, and then add io_uring_cmd_do_in_task_laz() that will pass IOU_F_TWQ_LAZY_WAKE and imply the "lazy" semantics, i.e. it posts no more than 1 CQE and delaying execution of this tw should not prevent forward progress. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5b9f6716006df7e817f18bd555aee2f8f9c8b0c3.1684154817.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-