- Oct 18, 2021
-
-
John Garry authored
Now that we use shared tags for shared sbitmap support, we don't require the tags sbitmap pointers, so drop them. This essentially reverts commit 222a5ae0 ("blk-mq: Use pointers for blk_mq_tags bitmap tags"). Function blk_mq_init_bitmap_tags() is removed also, since it would be only a wrappper for blk_mq_init_bitmaps(). Reviewed-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Signed-off-by:
John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
In addition to reverting commit 7b05bf77 ("Revert "block/mq-deadline: Prioritize high-priority requests""), this patch uses 'jiffies' instead of ktime_get() in the code for aging lower priority requests. This patch has been tested as follows: Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler. Result without and with this patch: 555 K IOPS. Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler. Result without and with this patch: about 380 K IOPS. Ran the following script: set -e scriptdir=$(dirname "$0") if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi modprobe scsi_debug ndelay=1000000 max_queue=16 sd='' while [ -z "$sd" ]; do sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*) done echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire" if [ -e /sys/fs/cgroup/io.prio.class ]; then cd /sys/fs/cgroup echo restrict-to-be >io.prio.class echo +io > cgroup.subtree_control else cd /sys/fs/cgroup/blkio/ echo restrict-to-be >blkio.prio.class fi echo $$ >cgroup.procs mkdir -p hipri cd hipri if [ -e io.prio.class ]; then echo none-to-rt >io.prio.class else echo none-to-rt >blkio.prio.class fi { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & } echo $$ >cgroup.procs "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt Result: * 11000 IOPS for the high-priority job * 40 IOPS for the low-priority job If the prio aging expiry time is changed from 100s into 0, the IOPS results change into 6712 and 6796 IOPS. The max-iops script is a script that runs fio with the following arguments: --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60 --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j} --iodepth=${arg_d} --iodepth_batch_submit=${arg_a} --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1} --filename=${positional_argument_1} Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20210927220328.1410161-5-bvanassche@acm.org [axboe: @latest -> @latest_start] Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Calculating the sum over all CPUs of per-CPU counters frequently is inefficient. Hence switch from per-CPU to individual counters. Three counters are protected by the mq-deadline spinlock since these are only accessed from contexts that already hold that spinlock. The fourth counter is atomic because protecting it with the mq-deadline spinlock would trigger lock contention. Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210927220328.1410161-4-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Check a statistics invariant at module unload time. When running blktests, the invariant is verified every time a request queue is removed and hence is verified at least once per test. Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210927220328.1410161-3-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
The scheduler .insert_requests() callback is called when a request is queued for the first time and also when it is requeued. Only count a request the first time it is queued. Additionally, since the mq-deadline scheduler only performs zone locking for requests that have been inserted, skip the zone unlock code for requests that have not been inserted into the mq-deadline scheduler. Fixes: 38ba64d1 ("block/mq-deadline: Track I/O statistics") Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210927220328.1410161-2-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Except for the features passed to blk_queue_required_elevator_features, elevator.h is only needed internally to the block layer. Move the ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to block/, dropping all the spurious includes outside of that. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 02, 2021
-
-
Geert Uytterhoeven authored
If CONFIG_BLK_DEBUG_FS=n: block/mq-deadline.c:274:12: warning: ‘dd_queued’ defined but not used [-Wunused-function] 274 | static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) | ^~~~~~~~~ Fix this by moving dd_queued() just before the sole function that calls it. Fixes: 7b05bf77 ("Revert "block/mq-deadline: Prioritize high-priority requests"") Signed-off-by:
Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 38ba64d1 ("block/mq-deadline: Track I/O statistics") Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210830091128.1854266-1-geert@linux-m68k.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 26, 2021
-
-
Jens Axboe authored
This reverts commit fb926032. Zhen reports that this commit slows down mq-deadline on a 128 thread box, going from 258K IOPS to 170-180K. My testing shows that Optane gen2 IOPS goes from 2.3M IOPS to 1.2M IOPS on a 64 thread box. Looking in detail at the code, the main culprit here is needing to sum percpu counters in the dispatch hot path, leading to very high CPU utilization there. To make matters worse, the code currently needs to sum 2 percpu counters, and it does so in the most naive way of iterating possible CPUs _twice_. Since we're close to release, revert this commit and we can re-do it with regular per-priority counters instead for the 5.15 kernel. Link: https://lore.kernel.org/linux-block/20210826144039.2143-1-thunder.leizhen@huawei.com/ Reported-by:
Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 24, 2021
-
-
Bart Van Assche authored
The block layer may call the I/O scheduler .finish_request() callback without having called the .insert_requests() callback. Make sure that the mq-deadline I/O statistics are correct if the block layer inserts an I/O request that bypasses the I/O scheduler. This patch prevents that lower priority I/O is delayed longer than necessary for mixed I/O priority workloads. Cc: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by:
Niklas Cassel <Niklas.Cassel@wdc.com> Fixes: 08a9ad8b ("block/mq-deadline: Add cgroup support") Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210824170520.1659173-1-bvanassche@acm.org Reviewed-by:
Niklas Cassel <niklas.cassel@wdc.com> Tested-by:
Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 11, 2021
-
-
Tejun Heo authored
This reverts commit 08a9ad8b ("block/mq-deadline: Add cgroup support") and a follow-up commit c06bc5a3 ("block/mq-deadline: Remove a WARN_ON_ONCE() call"). The added cgroup support has the following issues: * It breaks cgroup interface file format rule by adding custom elements to a nested key-value file. * It registers mq-deadline as a cgroup-aware policy even though all it's doing is collecting per-cgroup stats. Even if we need these stats, this isn't the right way to add them. * It hasn't been reviewed from cgroup side. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 09, 2021
-
-
Ming Lei authored
When merging one bio to request, if they are discard IO and the queue supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE because both block core and related drivers(nvme, virtio-blk) doesn't handle mixed discard io merge(traditional IO merge together with discard merge) well. Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation, so both blk-mq and drivers just need to handle multi-range discard. Reported-by:
Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Tested-by:
Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 2705dfb2 ("block: fix discard request merge") Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 27, 2021
-
-
Bart Van Assche authored
The purpose of the WARN_ON_ONCE() statement in dd_insert_request() is to verify that dd_prepare_request() cleared rq->elv.priv[0]. Since dd_prepare_request() is called during request initialization but not if a request is requeued, a warning is triggered if a request is requeued. Fix this by removing the WARN_ON_ONCE() statement. This patch suppresses the following kernel warning: WARNING: CPU: 28 PID: 432 at block/mq-deadline-main.c:740 dd_insert_request+0x4d4/0x5b0 Workqueue: kblockd blk_mq_requeue_work Call Trace: dd_insert_requests+0xfa/0x130 blk_mq_sched_insert_request+0x22c/0x240 blk_mq_requeue_work+0x21c/0x2d0 process_one_work+0x4c2/0xa70 worker_thread+0x2e5/0x6d0 kthread+0x21c/0x250 ret_from_fork+0x1f/0x30 Reported-by:
Sachin Sant <sachinp@linux.vnet.ibm.com> Fixes: 08a9ad8b ("block/mq-deadline: Add cgroup support") Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210627211112.12720-1-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 25, 2021
-
-
Jan Kara authored
Lockdep complains about lock inversion between ioc->lock and bfqd->lock: bfqd -> ioc: put_io_context+0x33/0x90 -> ioc->lock grabbed blk_mq_free_request+0x51/0x140 blk_put_request+0xe/0x10 blk_attempt_req_merge+0x1d/0x30 elv_attempt_insert_merge+0x56/0xa0 blk_mq_sched_try_insert_merge+0x4b/0x60 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed blk_mq_sched_insert_requests+0xd6/0x2b0 blk_mq_flush_plug_list+0x154/0x280 blk_finish_plug+0x40/0x60 ext4_writepages+0x696/0x1320 do_writepages+0x1c/0x80 __filemap_fdatawrite_range+0xd7/0x120 sync_file_range+0xac/0xf0 ioc->bfqd: bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed put_io_context_active+0x78/0xb0 -> ioc->lock grabbed exit_io_context+0x48/0x50 do_exit+0x7e9/0xdd0 do_group_exit+0x54/0xc0 To avoid this inversion we change blk_mq_sched_try_insert_merge() to not free the merged request but rather leave that upto the caller similarly to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure to free all the merged requests after dropping bfqd->lock. Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Reviewed-by:
Ming Lei <ming.lei@redhat.com> Acked-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 21, 2021
-
-
Bart Van Assche authored
While one or more requests with a certain I/O priority are pending, do not dispatch lower priority requests. Dispatch lower priority requests anyway after the "aging" time has expired. This patch has been tested as follows: modprobe scsi_debug ndelay=1000000 max_queue=16 && sd='' && while [ -z "$sd" ]; do sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*) done && echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire && cd /sys/fs/cgroup/blkio/ && echo $$ >cgroup.procs && echo restrict-to-be >blkio.prio.class && mkdir -p hipri && cd hipri && echo none-to-rt >blkio.prio.class && { max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } && echo $$ >cgroup.procs && max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt Result: * 11000 IOPS for the high-priority job * 40 IOPS for the low-priority job If the aging expiry time is changed from 100s into 0, the IOPS results change into 6712 and 6796 IOPS. The max-iops script is a script that runs fio with the following arguments: --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60 --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j} --iodepth=${arg_d} --iodepth_batch_submit=${arg_a} --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1} --filename=${positional_argument_1} Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-17-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Maintain statistics per cgroup and export these to user space. These statistics are essential for verifying whether the proper I/O priorities have been assigned to requests. An example of the statistics data with this patch applied: $ cat /sys/fs/cgroup/io.stat 11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0 8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0 Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-16-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Track I/O statistics per I/O priority and export these statistics to debugfs. These statistics help developers of the deadline scheduler. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-15-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Maintain one dispatch list and one FIFO list per I/O priority class: RT, BE and IDLE. Maintain statistics for each priority level. Split the debugfs attributes per priority level as follows: $ ls /sys/kernel/debug/block/.../sched/ async_depth dispatch2 read_next_rq write2_fifo_list batching read0_fifo_list starved write_next_rq dispatch0 read1_fifo_list write0_fifo_list dispatch1 read2_fifo_list write1_fifo_list Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-14-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
When dispatching the first request of a batch, the deadline_move_request() call clears .next_rq[] for the opposite data direction. .next_rq[] is not restored when changing data direction. Fix this by not clearing .next_rq[] and by keeping track of the data direction of a batch in a variable instead. This patch is a micro-optimization because: - The number of deadline_next_request() calls for the read direction is halved. - The number of times that deadline_next_request() returns NULL is reduced. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-13-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
For interactive workloads it is important that synchronous requests are not delayed. Hence reserve 25% of scheduler tags for synchronous requests. This patch still allows asynchronous requests to fill the hardware queues since blk_mq_init_sched() makes sure that the number of scheduler requests is the double of the hardware queue depth. From blk_mq_init_sched(): q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth, BLKDEV_MAX_RQ); Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-12-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Define separate macros for integers and jiffies to improve readability. Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol(). The former functions are the recommended functions. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-11-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Modern compilers complain if an out-of-range value is passed to a function argument that has an enumeration type. Let the compiler detect out-of-range data direction arguments instead of verifying the data_dir argument at runtime. Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-10-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Change "queue" into "sched" to make the function names reflect better the purpose of these functions. Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-9-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Make __dd_dispatch_request() easier to read by removing two local variables. Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-8-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Document the locking strategy by adding two lockdep_assert_held() statements. Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-7-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Make the code easier to read by adding more comments. Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-6-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 11, 2021
-
-
Omar Sandoval authored
__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx for the current CPU again and uses that to get the corresponding Kyber context in the passed hctx. However, the thread may be preempted between the two calls to blk_mq_get_ctx(), and the ctx returned the second time may no longer correspond to the passed hctx. This "works" accidentally most of the time, but it can cause us to read garbage if the second ctx came from an hctx with more ctx's than the first one (i.e., if ctx->index_hw[hctx->type] > hctx->nr_ctx). This manifested as this UBSAN array index out of bounds error reported by Jakub: UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9 index 13106 is out of range for type 'long unsigned int [128]' Call Trace: dump_stack+0xa4/0xe5 ubsan_epilogue+0x5/0x40 __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34 queued_spin_lock_slowpath+0x476/0x480 do_raw_spin_lock+0x1c2/0x1d0 kyber_bio_merge+0x112/0x180 blk_mq_submit_bio+0x1f5/0x1100 submit_bio_noacct+0x7b0/0x870 submit_bio+0xc2/0x3a0 btrfs_map_bio+0x4f0/0x9d0 btrfs_submit_data_bio+0x24e/0x310 submit_one_bio+0x7f/0xb0 submit_extent_page+0xc4/0x440 __extent_writepage_io+0x2b8/0x5e0 __extent_writepage+0x28d/0x6e0 extent_write_cache_pages+0x4d7/0x7a0 extent_writepages+0xa2/0x110 do_writepages+0x8f/0x180 __writeback_single_inode+0x99/0x7f0 writeback_sb_inodes+0x34e/0x790 __writeback_inodes_wb+0x9e/0x120 wb_writeback+0x4d2/0x660 wb_workfn+0x64d/0xa10 process_one_work+0x53a/0xa80 worker_thread+0x69/0x5b0 kthread+0x20b/0x240 ret_from_fork+0x1f/0x30 Only Kyber uses the hctx, so fix it by passing the request_queue to ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can map the queues itself to avoid the mismatch. Fixes: a6088845 ("block: kyber: make kyber more friendly with merging") Reported-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Apr 16, 2021
-
-
Lin Feng authored
Since commit 01e99aec 'blk-mq: insert passthrough request into hctx->dispatch directly', passthrough request should not appear in IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO schedulers is redundant. (Notes: this patch passes generic IO load test with hdds under SAS controller and hdds under AHCI controller but obviously not covers all. Not sure if passthrough request can still escape into IO scheduler from blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and has lots of indirect callers.) Signed-off-by:
Lin Feng <linf@wangsu.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Feb 22, 2021
-
-
Chaitanya Kulkarni authored
Get rid of the wrapper for trace_block_rq_insert() and call the function directly. Signed-off-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jan 25, 2021
-
-
Jan Kara authored
This reverts commit b445547e. Since both mq-deadline and BFQ completely ignore hctx they are passed to their dispatch function and dispatch whatever request they deem fit checking whether any request for a particular hctx is queued is just pointless since we'll very likely get a request from a different hctx anyway. In the following commit we'll deal with lock contention in these IO schedulers in presence of multiple HW queues in a different way. Signed-off-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 03, 2020
-
-
Kashyap Desai authored
High CPU utilization on "native_queued_spin_lock_slowpath" due to lock contention is possible for mq-deadline and bfq IO schedulers when nr_hw_queues is more than one. It is because kblockd work queue can submit IO from all online CPUs (through blk_mq_run_hw_queues()) even though only one hctx has pending commands. The elevator callback .has_work for mq-deadline and bfq scheduler considers pending work if there are any IOs on request queue but it does not account hctx context. Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering the elevator even though there are no requests queued. [jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap] Signed-off-by:
Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by:
Hannes Reinecke <hare@suse.de> Signed-off-by:
John Garry <john.garry@huawei.com> Tested-by:
Douglas Gilbert <dgilbert@interlog.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 29, 2020
-
-
Christoph Hellwig authored
None of the I/O schedulers actually needs it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Daniel Wagner <dwagner@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 06, 2019
-
-
Damien Le Moal authored
Introduce the definition of elevator features through the elevator_features flags in the elevator_type structure. Each flag can represent a feature supported by an elevator. The first feature defined by this patch is support for zoned block device sequential write constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented by the mq-deadline elevator using zone write locking. Other possible features are IO priorities, write hints, latency targets or single-LUN dual-actuator disks (for which the elevator could maintain one LBA ordered list per actuator). The required_elevator_features field is also added to the request_queue structure to allow a device driver to specify elevator feature flags that an elevator must support for the correct operation of the device (e.g. device drivers for zoned block devices can have the ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature). The helper function blk_queue_required_elevator_features() is defined for setting this new field. With these two new fields in place, the elevator functions elevator_match() and elevator_find() are modified to allow a user to set only an elevator with a set of features that satisfies the device required features. Elevators not matching the device requirements are not shown in the device sysfs queue/scheduler file to prevent their use. The "none" elevator can always be selected as before. Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 03, 2019
-
-
Damien Le Moal authored
Commit 7211aef8 ("block: mq-deadline: Fix write completion handling") added a call to blk_mq_sched_mark_restart_hctx() in dd_dispatch_request() to make sure that write request dispatching does not stall when all target zones are locked. This fix left a subtle race when a write completion happens during a dispatch execution on another CPU: CPU 0: Dispatch CPU1: write completion dd_dispatch_request() lock(&dd->lock); ... lock(&dd->zone_lock); dd_finish_request() rq = find request lock(&dd->zone_lock); unlock(&dd->zone_lock); zone write unlock unlock(&dd->zone_lock); ... __blk_mq_free_request check restart flag (not set) -> queue not run ... if (!rq && have writes) blk_mq_sched_mark_restart_hctx() unlock(&dd->lock) Since the dispatch context finishes after the write request completion handling, marking the queue as needing a restart is not seen from __blk_mq_free_request() and blk_mq_sched_restart() not executed leading to the dispatch stall under 100% write workloads. Fix this by moving the call to blk_mq_sched_mark_restart_hctx() from dd_dispatch_request() into dd_finish_request() under the zone lock to ensure full mutual exclusion between write request dispatch selection and zone unlock on write request completion. Fixes: 7211aef8 ("block: mq-deadline: Fix write completion handling") Cc: stable@vger.kernel.org Reported-by:
Hans Holmberg <Hans.Holmberg@wdc.com> Reviewed-by:
Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 15, 2019
-
-
Mauro Carvalho Chehab authored
Rename the block documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by:
Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
-
- Jun 20, 2019
-
-
Christoph Hellwig authored
We only need the number of segments in the blk-mq submission path. Remove the field from struct bio, and return it from a variant of blk_queue_split instead of that it can passed as an argument to those functions that need the value. This also means we stop recounting segments except for cloning and partial segments. To keep the number of arguments in this how path down remove pointless struct request_queue arguments from any of the functions that had it and grew a nr_segs argument. Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Apr 30, 2019
-
-
Christoph Hellwig authored
Various block layer files do not have any licensing information at all. Add SPDX tags for the default kernel GPLv2 license to those. Reviewed-by:
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Dec 17, 2018
-
-
Damien Le Moal authored
For a zoned block device using mq-deadline, if a write request for a zone is received while another write was already dispatched for the same zone, dd_dispatch_request() will return NULL and the newly inserted write request is kept in the scheduler queue waiting for the ongoing zone write to complete. With this behavior, when no other request has been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to __blk_mq_free_request() call of blk_mq_sched_restart() to not run the queue when the already dispatched write request completes. The newly dispatched request stays stuck in the scheduler queue until eventually another request is submitted. This problem does not affect SCSI disk as the SCSI stack handles queue restart on request completion. However, this problem is can be triggered the nullblk driver with zoned mode enabled. Fix this by always requesting a queue restart in dd_dispatch_request() if no request was dispatched while WRITE requests are queued. Fixes: 5700f691 ("mq-deadline: Introduce zone locking support") Cc: <stable@vger.kernel.org> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Add missing export of blk_mq_sched_restart() Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Nov 07, 2018
-
-
Jens Axboe authored
This is a remnant of when we had ops for both SQ and MQ schedulers. Now it's just MQ, so get rid of the union. Reviewed-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
This removes a bunch of core and elevator related code. On the core front, we remove anything related to queue running, draining, initialization, plugging, and congestions. We also kill anything related to request allocation, merging, retrieval, and completion. Remove any checking for single queue IO schedulers, as they no longer exist. This means we can also delete a bunch of code related to request issue, adding, completion, etc - and all the SQ related ops and helpers. Also kill the load_default_modules(), as all that did was provide for a way to load the default single queue elevator. Tested-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 24, 2018
-
-
Joe Perches authored
Convert the S_<FOO> symbolic permissions to their octal equivalents as using octal and not symbolic permissions is preferred by many as more readable. see: https://lkml.org/lkml/2016/8/2/1945 Done with automated conversion via: $ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace <files...> Miscellanea: o Wrapped modified multi-line calls to a single line where appropriate o Realign modified multi-line calls to open parenthesis Signed-off-by:
Joe Perches <joe@perches.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-