- Oct 02, 2024
-
-
Al Viro authored
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
-
Dan Carpenter authored
These are called from blkcg_print_blkgs() which already disables IRQs so disabling it again is wrong. It means that IRQs will be enabled slightly earlier than intended, however, so far as I can see, this bug is harmless. Fixes: 35198e32 ("blk-iocost: read params inside lock in sysfs apis") Signed-off-by:
Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/Zv0kudA9xyGdaA4g@stanley.mountain Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
Fix the documentation to match the new function signature. Fixes: 76c313f6 ("blk-integrity: improved sg segment mapping") Signed-off-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240922141800.3622319-1-kbusch@meta.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 20, 2024
-
-
Dr. David Alan Gilbert authored
blk_limits_io_min and blk_limits_io_opt are unused since the recent commit 0a94a469 ("dm: stop using blk_limits_io_{min,opt}") Remove them. Signed-off-by:
Dr. David Alan Gilbert <linux@treblig.org> Link: https://lore.kernel.org/r/20240920004817.676216-1-linux@treblig.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 17, 2024
-
-
Damien Le Moal authored
Commit 734e1a86 ("block: Prevent deadlocks when switching elevators") introduced the function elv_iosched_load_module() to allow loading an elevator module outside of elv_iosched_store() with the target device queue not frozen, to avoid deadlocks. However, the "none" scheduler does not have a module and as a result, elv_iosched_load_module() always returns an error when trying to switch to this valid scheduler. Fix this by ignoring the return value of the request_module() call done by elv_iosched_load_module(). This restores the behavior before commit 734e1a86, which was to ignore the request_module() result and instead rely on elevator_change() to handle the "none" scheduler case. Reported-by:
Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 734e1a86 ("block: Prevent deadlocks when switching elevators") Cc: stable@vger.kernel.org Signed-off-by:
Damien Le Moal <dlemoal@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240917133231.134806-1-dlemoal@kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
NeilBrown authored
bd_prepare_to_claim() waits for a var to change, not for a bit to be cleared. Change from bit_waitqueue() to __var_waitqueue() and correspondingly use wake_up_var(). This will allow a future patch which change the "bit" function to expect an "unsigned long *" instead of "void *". Signed-off-by:
NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20240826063659.15327-2-neilb@suse.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 13, 2024
-
-
Keith Busch authored
Make the integrity mapping more like data mapping, blk_rq_map_sg. Use the request to validate the segment count, and update the callers so they don't have to. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240913191746.2628196-1-kbusch@meta.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
There are no external users of this. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240913182854.2445457-9-kbusch@meta.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
Provide a helper to keep the request flags and nr_integrity_segments in sync with the bio's integrity payload. This is an integrity equivalent to the normal data helper function, 'blk_rq_map_user()'. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Kanchan Joshi <joshi.k@samsung.com> Signed-off-by:
Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240913182854.2445457-6-kbusch@meta.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
If a bio is merged to a request, the entire bio list is merged, so don't temporarily detach it from its list when counting segments. In most cases, bi_next will already be NULL, so detaching is usually a no-op. But if the bio does have a list, the current code is miscounting the segments for the resulting merge. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240913182854.2445457-5-kbusch@meta.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
Both types of merging when integrity data is used are miscounting the segments: Merging two requests wasn't accounting for the new segment count, so add the "next" segment count to the first on a successful merge to ensure this value is accurate. Merging a bio into an existing request was double counting the bio's segments, even if the merge failed later on. Move the segment accounting to the end when the merge is successful. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240913182854.2445457-4-kbusch@meta.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
This value is used for merging considerations, so it needs to be accurate. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240913182854.2445457-3-kbusch@meta.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
Always defining the field will make using it easier and less error prone in future patches. There shouldn't be any downside to this: the field fits in what would otherwise be a 2-byte hole, so we're not saving space by conditionally leaving it out. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240913182854.2445457-2-kbusch@meta.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 12, 2024
-
-
Riyan Dhiman authored
The blk_add_partition() function initially used a single if-condition (IS_ERR(part)) to check for errors when adding a partition. This was modified to handle the specific case of -ENXIO separately, allowing the function to proceed without logging the error in this case. However, this change unintentionally left a path where md_autodetect_dev() could be called without confirming that part is a valid pointer. This commit separates the error handling logic by splitting the initial if-condition, improving code readability and handling specific error scenarios explicitly. The function now distinguishes the general error case from -ENXIO without altering the existing behavior of md_autodetect_dev() calls. Fixes: b7205307 (block: allow partitions on host aware zone devices) Signed-off-by:
Riyan Dhiman <riyandhiman14@gmail.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240911132954.5874-1-riyandhiman14@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 11, 2024
-
-
Colin Ian King authored
The static array vrate_adj_pct is read-only, so make it const as well. Signed-off-by:
Colin Ian King <colin.i.king@gmail.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240911214124.197403-1-colin.i.king@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
io_uring allows implementing custom file specific asynchronous operations via the fops->uring_cmd callback, a.k.a. IORING_OP_URING_CMD requests or just io_uring commands. Use it to add support for async discards. Normally, it first tries to queue up bios in a non-blocking context, and if that fails, we'd retry from a blocking context by returning -EAGAIN to the core io_uring. We always get the result from bios asynchronously by setting a custom bi_end_io callback, at which point we drag the request into the task context to either reissue or complete it and post a completion to the user. Unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only do a best effort attempt to invalidate page cache, and it can race with any writes and reads and leave page cache stale. It's the same kind of races we allow to direct writes. Also, apart from cases where discarding is not allowed at all, e.g. discards are not supported or the file/device is read only, the user should assume that the sector range on disk is not valid anymore, even when an error was returned to the user. Suggested-by:
Conrad Meyer <conradmeyer@meta.com> Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b5210443e4fa0257934f73dfafcc18a77cd0e09.1726072086.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
In preparation to further changes extract a helper function out of blk_ioctl_discard() that validates if we can do IO against the given range of disk byte addresses. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19a7779323c71e742a2f511e4cf49efcfd68cfd4.1726072086.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Kundan Kumar authored
Use newly added mm function unpin_user_folio() to put refs by npages count. Signed-off-by:
Kundan Kumar <kundan.kumar@samsung.com> Tested-by:
Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20240911064935.5630-5-kundan.kumar@samsung.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Kundan Kumar authored
Add a bigger size from folio to bio and skip merge processing for pages. Fetch the offset of page within a folio. Depending on the size of folio and folio_offset, fetch a larger length. This length may consist of multiple contiguous pages if folio is multiorder. Using the length calculate number of pages which will be added to bio and increment the loop counter to skip those pages. This technique helps to avoid overhead of merging pages which belong to same large order folio. Also folio-ize the functions bio_iov_add_page() and bio_iov_add_zone_append_page() Signed-off-by:
Kundan Kumar <kundan.kumar@samsung.com> Tested-by:
Luis Chamberlain <mcgrof@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240911064935.5630-3-kundan.kumar@samsung.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Kundan Kumar authored
Added new bio_add_hw_folio() function as a wrapper around bio_add_hw_page(). This is a prep patch. Signed-off-by:
Kundan Kumar <kundan.kumar@samsung.com> Tested-by:
Luis Chamberlain <mcgrof@kernel.org> Reviewed-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240911064935.5630-2-kundan.kumar@samsung.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 10, 2024
-
-
Yu Kuai authored
Make code cleaner, there are no functional changes. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-8-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Now that 'bfqq_already_existing' is only used in one branch, it can be removed. There are no functional changes. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-7-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
The local variable is used to call bfq_bfqq_resume_state() later, since 'bfqd->lock' is held, and bfqq status will not change between setting 'split' and calling bfq_bfqq_resume_state(), move forward bfq_bfqq_resume_state() so that 'split' can be removed. There are no functional chagnes. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-6-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
It's not used, hence can be removed. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-5-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Because bfq_put_cooperator() is always followed by bfq_release_process_ref(). Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-4-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Original state: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \--------------\ \-------------\ \-------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 1 2 4 After commit 0e456dba ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()"), if P1 issues a new IO: Without the patch: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \------------------------------\ \-------------\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 2 4 bfqq3 will be used to handle IO from P1, this is not expected, IO should be redirected to bfqq4; With the patch: ------------------------------------------- | | Process 1 Process 2 Process 3 | Process 4 (BIC1) (BIC2) (BIC3) | (BIC4) | | | | \-------------\ \-------------\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 2 4 IO is redirected to bfqq4, however, procress reference of bfqq3 is still 2, while there is only P2 using it. Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify code. Fixes: 0e456dba ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-3-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
After commit 42c306ed ("block, bfq: don't break merge chain in bfq_split_bfqq()"), if the current procress is the last holder of bfqq, the bfqq can be freed after bfq_split_bfqq(). Hence recored the bfqq and then access bfqq->waker_bfqq may trigger UAF. What's more, the waker_bfqq may in the merge chain of bfqq, hence just recored waker_bfqq is still not safe. Fix the problem by adding a helper bfq_waker_bfqq() to check if bfqq->waker_bfqq is in the merge chain, and current procress is the only holder. Fixes: 42c306ed ("block, bfq: don't break merge chain in bfq_split_bfqq()") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-2-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Currently, blk-throttle handle all IO fifo, hence if data IO is throttled and then meta IO is dispatched, the meta IO will have to wait for the data IO, causing priority inversion problems. This patch support to handle metadata first and then pay debt while throttling data. Test script: use cgroup v1 to throttle root cgroup, then create new dir and file while write back is throttled test() { mkdir /mnt/test/xxx touch /mnt/test/xxx/1 sync /mnt/test/xxx sync /mnt/test/xxx } mkfs.ext4 -F /dev/nvme0n1 -E lazy_itable_init=0,lazy_journal_init=0 mount /dev/nvme0n1 /mnt/test echo "259:0 $((1024*1024))" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device dd if=/dev/zero of=/mnt/test/foo1 bs=16M count=1 conv=fdatasync status=none & sleep 4 time test echo "259:0 0" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device sleep 1 umount /dev/nvme0n1 Test result: time cost for creating new dir and file before this patch: 14s after this patch: 0.1s Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240903135149.271857-3-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
last_low_overflow_time is not used anymore after commit bf20ab53 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW"). Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240903135149.271857-2-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
Commit af281414 ("block: freeze the queue in queue_attr_store") changed queue_attr_store() to always freeze a sysfs attribute queue before calling the attribute store() method, to ensure that no IOs are in-flight when an attribute value is being updated. However, this change created a potential deadlock situation for the scheduler queue attribute as changing the queue elevator with elv_iosched_store() can result in a call to request_module() if the user requested module is not already registered. If the file of the requested module is stored on the block device of the frozen queue, a deadlock will happen as the read operations triggered by request_module() will wait for the queue freeze to end. Solve this issue by introducing the load_module method in struct queue_sysfs_entry, and to calling this method function in queue_attr_store() before freezing the attribute queue. The macro definition QUEUE_RW_LOAD_MODULE_ENTRY() is added to define a queue sysfs attribute that needs loading a module. The definition of the scheduler atrribute is changed to using QUEUE_RW_LOAD_MODULE_ENTRY(), with the function elv_iosched_load_module() defined as the load_module method. elv_iosched_store() can then be simplified to remove the call to request_module(). Reported-by:
Richard W.M. Jones <rjones@redhat.com> Reported-by:
Jiri Jaburek <jjaburek@redhat.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219166 Fixes: af281414 ("block: freeze the queue in queue_attr_store") Cc: stable@vger.kernel.org Signed-off-by:
Damien Le Moal <dlemoal@kernel.org> Tested-by:
Richard W.M. Jones <rjones@redhat.com> Link: https://lore.kernel.org/r/20240908000704.414538-1-dlemoal@kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 07, 2024
-
-
Keith Busch authored
The single-queue optimized list flush doesn't have an unplug trace event to pair with the plug event. Add one. In the unlikely event an error occurs and falls back to the less optimized plug flush path, it's possible a 2nd unplug trace event will be logged, but it will show the remainig count that weren't previously handled. Signed-off-by:
Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240906194540.3719642-1-kbusch@meta.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 04, 2024
-
-
Alexey Dobriyan authored
I independently rediscovered commit 22d24a54 block: fix overflow in blk_ioctl_discard() but for secure erase. Same problem: uint64_t r[2] = {512, 18446744073709551104ULL}; ioctl(fd, BLKSECDISCARD, r); will enter near infinite loop inside blkdev_issue_secure_erase(): a.out: attempt to access beyond end of device loop0: rw=5, sector=3399043073, nr_sectors = 1024 limit=2048 bio_check_eod: 3286214 callbacks suppressed Signed-off-by:
Alexey Dobriyan <adobriyan@gmail.com> Link: https://lore.kernel.org/r/9e64057f-650a-46d1-b9f7-34af391536ef@p183 Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Alvaro Parker authored
The explanatory comment used `set_task_state` instead of `set_current_state` which is the function actually used in the code. Signed-off-by:
Alvaro Parker <alparkerdf@gmail.com> Link: https://lore.kernel.org/r/20240903172214.520086-1-alparkerdf@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Mikulas Patocka authored
bio_integrity_add_page restricts the size of the integrity metadata to queue_max_hw_sectors(q). This restriction is not needed because oversized bios are split automatically. This restriction causes problems with dm-integrity 'inline' mode - if we send a large bio to dm-integrity and the bio's metadata are larger than queue_max_hw_sectors(q), bio_integrity_add_page fails and the bio is ended with BLK_STS_RESOURCE error. An example that triggers it: dd: error writing '/dev/mapper/in2': Cannot allocate memory 1+0 records in 0+0 records out 0 bytes copied, 0.00169291 s, 0.0 kB/s Signed-off-by:
Mikulas Patocka <mpatocka@redhat.com> Fixes: fb098768 ("dm-integrity: introduce the Inline mode") Fixes: 0ece1d64 ("bio-integrity: create multi-page bvecs in bio_integrity_add_page()") Reviewed-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Tested-by:
Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/e41b3b8e-16c2-70cb-97cb-881234bb200d@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 03, 2024
-
-
Yu Kuai authored
Instead of open coding it, there are no functional changes. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-5-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Consider the following scenario: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \-------------\ \-------------\ \--------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 1 2 4 If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq() decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will break the merge chain. Expected result: caller will allocate a new bfqq for BIC1 Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) | | | \-------------\ \--------------\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 1 3 Since the condition is only used for the last bfqq4 when the previous bfqq2 and bfqq3 are already splited. Fix the problem by checking if bfqq is the last one in the merge chain as well. Fixes: 36eca894 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Consider the following merge chain: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \--------------\ \-------------\ \-------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 IO from Process 1 will get bfqf2 from BIC1 first, then bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then handle this IO from bfqq3. However, the merge chain can be much deeper and bfqq3 can be merged to other bfqq as well. Fix this problem by iterating to the last bfqq in bfq_setup_cooperator(). Fixes: 36eca894 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
1) initial state, three tasks: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) | Λ | Λ | Λ | | | | | | V | V | V | bfqq1 bfqq2 bfqq3 process ref: 1 1 1 2) bfqq1 merged to bfqq2: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) | | | Λ \--------------\| | | V V | bfqq1--------->bfqq2 bfqq3 process ref: 0 2 1 3) bfqq2 merged to bfqq3: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) here -> Λ | | \--------------\ \-------------\| V V bfqq1--------->bfqq2---------->bfqq3 process ref: 0 1 3 In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then get bfqq3 through merge chain, and finially handle IO by bfqq3. Howerver, current code will think bfqq2 is owned by BIC1, like initial state, and set bfqq2->bic to BIC1. bfq_insert_request -> by Process 1 bfqq = bfq_init_rq(rq) bfqq = bfq_get_bfqq_handle_split bfqq = bic_to_bfqq -> get bfqq2 from BIC1 bfqq->ref++ rq->elv.priv[0] = bic rq->elv.priv[1] = bfqq if (bfqq_process_refs(bfqq) == 1) bfqq->bic = bic -> record BIC1 to bfqq2 __bfq_insert_request new_bfqq = bfq_setup_cooperator -> get bfqq3 from bfqq2->new_bfqq bfqq_request_freed(bfqq) new_bfqq->ref++ rq->elv.priv[1] = new_bfqq -> handle IO by bfqq3 Fix the problem by checking bfqq is from merge chain fist. And this might fix a following problem reported by our syzkaller(unreproducible): ================================================================== BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline] BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline] BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889 Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595 CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G L 6.6.0-07439-gba2303cacfda #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: kblockd blk_mq_requeue_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0x10d/0x610 mm/kasan/report.c:475 kasan_report+0x8e/0xc0 mm/kasan/report.c:588 bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline] bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline] bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889 bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757 bfq_init_rq block/bfq-iosched.c:6876 [inline] bfq_insert_request block/bfq-iosched.c:6254 [inline] bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304 blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593 blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 </TASK> Allocated by task 20776: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328 kasan_slab_alloc include/linux/kasan.h:188 [inline] slab_post_alloc_hook mm/slab.h:763 [inline] slab_alloc_node mm/slub.c:3458 [inline] kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503 ioc_create_icq block/blk-ioc.c:370 [inline] ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436 bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812 blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403 __blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517 blk_mq_get_new_requests block/blk-mq.c:2940 [inline] blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042 __submit_bio+0x331/0x6f0 block/blk-core.c:624 __submit_bio_noacct_mq block/blk-core.c:703 [inline] submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732 submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826 xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958 xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619 xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 Freed by task 946: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] __kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1815 [inline] slab_free_freelist_hook mm/slub.c:1841 [inline] slab_free mm/slub.c:3786 [inline] kmem_cache_free+0x118/0x6f0 mm/slub.c:3808 rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189 rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462 __do_softirq+0x1b0/0x7a2 kernel/softirq.c:553 Last potentially related work creation: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2712 [inline] call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 Second to last potentially related work creation: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2712 [inline] call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 The buggy address belongs to the object at ffff888123839d68 which belongs to the cache bfq_io_cq of size 1360 The buggy address is located 336 bytes inside of freed 1360-byte region [ffff888123839d68, ffff88812383a2b8) The buggy address belongs to the physical page: page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838 head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x17ffffc0000a40(workingset|slab|head|node=0|zone=2|lastcpupid=0x1fffff) page_type: 0xffffffff() raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488 raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Fixes: 36eca894 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-2-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Josef Bacik authored
In order to switch fuse over to using iomap for buffered writes we need to be able to have the struct file for the original write, in case we have to read in the page to make it uptodate. Handle this by using the existing private field in the iomap_iter, and add the argument to iomap_file_buffered_write. This will allow us to pass the file in through the iomap buffered write path, and is flexible for any other file systems needs. Signed-off-by:
Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/7f55c7c32275004ba00cddf862d970e6e633f750.1724755651.git.josef@toxicpanda.com Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
- Aug 29, 2024
-
-
Christoph Hellwig authored
bio_split_rw is designed to split read and write bios with a payload. Currently it is called by __bio_split_to_limits for all operations not explicitly list, which works because bio_may_need_split explicitly checks for bi_vcnt == 1 and thus skips the bypass if there is no payload and bio_for_each_bvec loop will never execute it's body if bi_size is 0. But all this is hard to understand, fragile and wasted pointless cycles. Switch __bio_split_to_limits to only call bio_split_rw for READ and WRITE command and don't attempt any kind split for operation that do not require splitting. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Tested-by:
Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by:
Hans Holmberg <hans.holmberg@wdc.com> Link: https://lore.kernel.org/r/20240826173820.1690925-5-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-