- Jul 02, 2024
-
-
Bart Van Assche authored
The current tag reservation code is based on a misunderstanding of the meaning of data->shallow_depth. Fix the tag reservation code as follows: * By default, do not reserve any tags for synchronous requests because for certain use cases reserving tags reduces performance. See also Harshit Mogalapalli, [bug-report] Performance regression with fio sequential-write on a multipath setup, 2024-03-07 (https://lore.kernel.org/linux-block/5ce2ae5d-61e2-4ede-ad55-551112602401@oracle.com/ ) * Reduce min_shallow_depth to one because min_shallow_depth must be less than or equal any shallow_depth value. * Scale dd->async_depth from the range [1, nr_requests] to [1, bits_per_sbitmap_word]. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Zhiguo Niu <zhiguo.niu@unisoc.com> Fixes: 07757588 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240509170149.7639-3-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Apr 19, 2024
-
-
Jiapeng Chong authored
These functions are defined in the mq-deadline.c file, but not called elsewhere, so delete these unused functions. block/mq-deadline.c:134:1: warning: unused function 'deadline_earlier_request'. block/mq-deadline.c:148:1: warning: unused function 'deadline_latter_request'. Reported-by:
Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8803 Signed-off-by:
Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240419025610.34298-1-jiapeng.chong@linux.alibaba.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Apr 17, 2024
-
-
Damien Le Moal authored
With the block layer generic plugging of write operations for zoned block devices, mq-deadline, or any other scheduler, can only ever see at most one write operation per zone at any time. There is thus no sequentiality requirements for these writes and thus no need to tightly control the dispatching of write requests using zone write locking. Remove all the code that implement this control in the mq-deadline scheduler and remove advertizing support for the ELEVATOR_F_ZBD_SEQ_WRITE elevator feature. Signed-off-by:
Damien Le Moal <dlemoal@kernel.org> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Tested-by:
Hans Holmberg <hans.holmberg@wdc.com> Tested-by:
Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-22-dlemoal@kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Mar 13, 2024
-
-
Bart Van Assche authored
The code "max(1U, 3 * (1U << shift) / 4)" comes from the Kyber I/O scheduler. The Kyber I/O scheduler maintains one internal queue per hwq and hence derives its async_depth from the number of hwq tags. Using this approach for the mq-deadline scheduler is wrong since the mq-deadline scheduler maintains one internal queue for all hwqs combined. Hence this revert. Cc: stable@vger.kernel.org Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Cc: Zhiguo Niu <Zhiguo.Niu@unisoc.com> Fixes: d47f9717 ("block/mq-deadline: use correct way to throttling write requests") Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240313214218.1736147-1-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 08, 2023
-
-
Zhiguo Niu authored
The original formula was inaccurate: dd->async_depth = max(1UL, 3 * q->nr_requests / 4); For write requests, when we assign a tags from sched_tags, data->shallow_depth will be passed to sbitmap_find_bit, see the following code: nr = sbitmap_find_bit_in_word(&sb->map[index], min_t (unsigned int, __map_depth(sb, index), depth), alloc_hint, wrap); The smaller of data->shallow_depth and __map_depth(sb, index) will be used as the maximum range when allocating bits. For a mmc device (one hw queue, deadline I/O scheduler): q->nr_requests = sched_tags = 128, so according to the previous calculation method, dd->async_depth = data->shallow_depth = 96, and the platform is 64bits with 8 cpus, sched_tags.bitmap_tags.sb.shift=5, sb.maps[]=32/32/32/32, 32 is smaller than 96, whether it is a read or a write I/O, tags can be allocated to the maximum range each time, which has not throttling effect. In addition, refer to the methods of bfg/kyber I/O scheduler, limit ratiois are calculated base on sched_tags.bitmap_tags.sb.shift. This patch can throttle write requests really. Fixes: 07757588 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Signed-off-by:
Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/1691061162-22898-1-git-send-email-zhiguo.niu@unisoc.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 12, 2023
-
-
Bart Van Assche authored
A bug was introduced in deadline_from_pos() while implementing the suggestion to use round_down() in the following code: pos -= bdev_offset_from_zone_start(rq->q->disk->part0, pos); This patch makes deadline_from_pos() use round_down() such that 'pos' is rounded down. Reported-by:
Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/5zthzi3lppvcdp4nemum6qck4gpqbdhvgy4k3qwguhgzxc4quj@amulvgycq67h/ Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Fixes: 0effb390 ("block: mq-deadline: Handle requeued requests correctly") Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230712173344.2994513-1-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 19, 2023
-
-
Bart Van Assche authored
Before dispatching a zoned write from the FIFO list, check whether there are any zoned writes in the RB-tree with a lower LBA for the same zone. This patch ensures that zoned writes happen in order even if at_head is set for some writes for a zone and not for others. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-12-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Start dispatching from the start of a zone instead of from the starting position of the most recently dispatched request. If a zoned write is requeued with an LBA that is lower than already inserted zoned writes, make sure that it is submitted first. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-11-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Track the position (sector_t) of the most recently dispatched request instead of tracking a pointer to the next request to dispatch. This patch is the basis for patch "Handle requeued requests correctly". Without this patch it would be significantly more complicated to make sure that zoned writes are dispatched in LBA order per zone. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-10-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
blk_mq_free_requests() calls dd_finish_request() indirectly. Prevent nested locking of dd->lock and dd->zone_lock by moving the code for freeing requests. Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-9-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Make the deadline_skip_seq_writes() code shorter without changing its functionality. Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-8-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Change the return type of deadline_check_fifo() from 'int' into 'bool'. Use time_is_before_eq_jiffies() instead of time_after_eq(). No functionality has been changed. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-7-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Add the missing word "and". Cc: Damien Le Moal <dlemoal@kernel.org> Suggested-by:
Damien Le Moal <dlemoal@kernel.org> Fixes: 945ffb60 ("mq-deadline: add blk-mq adaptation of the deadline IO scheduler") Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Tested-by:
Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-2-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Apr 13, 2023
-
-
Christoph Hellwig authored
Instead of passing a bool at_head, pass down the full flags from the blk_mq_insert_request interface. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
blk_mq_sched_insert_request is the main request insert helper and not directly I/O scheduler related. Move blk_mq_sched_insert_request to blk-mq.c, rename it to blk_mq_insert_request and mark it static. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-7-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
blk_mq_dispatch_plug_list is the only caller of blk_mq_sched_insert_requests, and it makes sense to just fold it there as blk_mq_sched_insert_requests isn't specific to I/O schedulers despite the name. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-6-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
block/blk-mq.h needs various definitions from <linux/blk-mq.h>, include it there instead of relying on the source files to include both. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-4-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
blk-mq-tag.h is always included by blk-mq.h, and causes recursive inclusion hell with further changes. Just merge it into blk-mq.h instead. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Nov 29, 2022
-
-
Damien Le Moal authored
Rename deadline_is_seq_writes() to deadline_is_seq_write() (remove the "s" plural) to more correctly reflect the fact that this function tests a single request, not multiple requests. Fixes: 015d02f4 ("block: mq-deadline: Do not break sequential write streams to zoned HDDs") Signed-off-by:
Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20221126025550.967914-2-damien.lemoal@opensource.wdc.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Nov 24, 2022
-
-
Damien Le Moal authored
mq-deadline ensures an in order dispatching of write requests to zoned block devices using a per zone lock (a bit). This implies that for any purely sequential write workload, the drive is exercised most of the time at a maximum queue depth of one. However, when such sequential write workload crosses a zone boundary (when sequentially writing multiple contiguous zones), zone write locking may prevent the last write to one zone to be issued (as the previous write is still being executed) but allow the first write to the following zone to be issued (as that zone is not yet being writen and not locked). This result in an out of order delivery of the sequential write commands to the device every time a zone boundary is crossed. While such behavior does not break the sequential write constraint of zoned block devices (and does not generate any write error), some zoned hard-disks react badly to seeing these out of order writes, resulting in lower write throughput. This problem can be addressed by always dispatching the first request of a stream of sequential write requests, regardless of the zones targeted by these sequential writes. To do so, the function deadline_skip_seq_writes() is introduced and used in deadline_next_request() to select the next write command to issue if the target device is an HDD (blk_queue_nonrot() being false). deadline_fifo_request() is modified using the new deadline_earlier_request() and deadline_is_seq_write() helpers to ignore requests in the fifo list that have a preceding request in lba order that is sequential. With this fix, a sequential write workload executed with the following fio command: fio --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \ --size=68719476736 --ioengine=libaio --iodepth=32 --rw=write \ --bs=65536 results in an increase from 225 MB/s to 250 MB/s of the write throughput of an SMR HDD (11% increase). Cc: <stable@vger.kernel.org> Signed-off-by:
Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
dd_finish_request() tests if the per prio fifo_list is not empty to determine if request dispatching must be restarted for handling blocked write requests to zoned devices with a call to blk_mq_sched_mark_restart_hctx(). While simple, this implementation has 2 problems: 1) Only the priority level of the completed request is considered. However, writes to a zone may be blocked due to other writes to the same zone using a different priority level. While this is unlikely to happen in practice, as writing a zone with different IO priorirites does not make sense, nothing in the code prevents this from happening. 2) The use of list_empty() is dangerous as dd_finish_request() does not take dd->lock and may run concurrently with the insert and dispatch code. Fix these 2 problems by testing the write fifo list of all priority levels using the new helper dd_has_write_work(), and by testing each fifo list using list_empty_careful(). Fixes: c807ab52 ("block/mq-deadline: Add I/O priority support") Cc: <stable@vger.kernel.org> Signed-off-by:
Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 14, 2022
-
-
Bart Van Assche authored
Use the new blk_opf_t type for an argument that represents a bitwise combination of a request operation and request flags. Rename that argument from 'op' into 'opf'. This patch does not change any functionality. Cc: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-9-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 16, 2022
-
-
Ming Lei authored
q->elevator is referred in blk_mq_has_sqsched() without any protection, no .q_usage_counter is held, no queue srcu and rcu read lock is held, so potential use-after-free may be triggered. Fix the issue by adding one queue flag for checking if the elevator uses single queue style dispatch. Meantime the elevator feature flag of ELEVATOR_F_MQ_AWARE isn't needed any more. Cc: Jan Kara <jack@suse.cz> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 13, 2022
-
-
Bart Van Assche authored
Before commit 322cff70 the fifo_time member of requests on a dispatch list was not used. Commit 322cff70 introduces code that reads the fifo_time member of requests on dispatch lists. Hence this patch that sets the fifo_time member when adding a request to a dispatch list. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Fixes: 322cff70 ("block/mq-deadline: Prioritize high-priority requests") Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220513171307.32564-1-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jan 20, 2022
-
-
Jens Axboe authored
A previous commit added this feature, but it inadvertently used the wrong variable to show/store the setting from/to, victimized by copy/paste. Fix it up so that the async_depth sysfs interface reads and writes from the right setting. Fixes: 07757588 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Link: https://bugzilla.kernel.org/show_bug.cgi?id=215485 Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Oct 18, 2021
-
-
John Garry authored
Now that we use shared tags for shared sbitmap support, we don't require the tags sbitmap pointers, so drop them. This essentially reverts commit 222a5ae0 ("blk-mq: Use pointers for blk_mq_tags bitmap tags"). Function blk_mq_init_bitmap_tags() is removed also, since it would be only a wrappper for blk_mq_init_bitmaps(). Reviewed-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Signed-off-by:
John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
In addition to reverting commit 7b05bf77 ("Revert "block/mq-deadline: Prioritize high-priority requests""), this patch uses 'jiffies' instead of ktime_get() in the code for aging lower priority requests. This patch has been tested as follows: Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler. Result without and with this patch: 555 K IOPS. Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler. Result without and with this patch: about 380 K IOPS. Ran the following script: set -e scriptdir=$(dirname "$0") if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi modprobe scsi_debug ndelay=1000000 max_queue=16 sd='' while [ -z "$sd" ]; do sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*) done echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire" if [ -e /sys/fs/cgroup/io.prio.class ]; then cd /sys/fs/cgroup echo restrict-to-be >io.prio.class echo +io > cgroup.subtree_control else cd /sys/fs/cgroup/blkio/ echo restrict-to-be >blkio.prio.class fi echo $$ >cgroup.procs mkdir -p hipri cd hipri if [ -e io.prio.class ]; then echo none-to-rt >io.prio.class else echo none-to-rt >blkio.prio.class fi { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & } echo $$ >cgroup.procs "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt Result: * 11000 IOPS for the high-priority job * 40 IOPS for the low-priority job If the prio aging expiry time is changed from 100s into 0, the IOPS results change into 6712 and 6796 IOPS. The max-iops script is a script that runs fio with the following arguments: --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60 --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j} --iodepth=${arg_d} --iodepth_batch_submit=${arg_a} --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1} --filename=${positional_argument_1} Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20210927220328.1410161-5-bvanassche@acm.org [axboe: @latest -> @latest_start] Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Calculating the sum over all CPUs of per-CPU counters frequently is inefficient. Hence switch from per-CPU to individual counters. Three counters are protected by the mq-deadline spinlock since these are only accessed from contexts that already hold that spinlock. The fourth counter is atomic because protecting it with the mq-deadline spinlock would trigger lock contention. Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210927220328.1410161-4-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Check a statistics invariant at module unload time. When running blktests, the invariant is verified every time a request queue is removed and hence is verified at least once per test. Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210927220328.1410161-3-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
The scheduler .insert_requests() callback is called when a request is queued for the first time and also when it is requeued. Only count a request the first time it is queued. Additionally, since the mq-deadline scheduler only performs zone locking for requests that have been inserted, skip the zone unlock code for requests that have not been inserted into the mq-deadline scheduler. Fixes: 38ba64d1 ("block/mq-deadline: Track I/O statistics") Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by:
Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210927220328.1410161-2-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Except for the features passed to blk_queue_required_elevator_features, elevator.h is only needed internally to the block layer. Move the ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to block/, dropping all the spurious includes outside of that. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 02, 2021
-
-
Geert Uytterhoeven authored
If CONFIG_BLK_DEBUG_FS=n: block/mq-deadline.c:274:12: warning: ‘dd_queued’ defined but not used [-Wunused-function] 274 | static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) | ^~~~~~~~~ Fix this by moving dd_queued() just before the sole function that calls it. Fixes: 7b05bf77 ("Revert "block/mq-deadline: Prioritize high-priority requests"") Signed-off-by:
Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 38ba64d1 ("block/mq-deadline: Track I/O statistics") Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210830091128.1854266-1-geert@linux-m68k.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 26, 2021
-
-
Jens Axboe authored
This reverts commit fb926032. Zhen reports that this commit slows down mq-deadline on a 128 thread box, going from 258K IOPS to 170-180K. My testing shows that Optane gen2 IOPS goes from 2.3M IOPS to 1.2M IOPS on a 64 thread box. Looking in detail at the code, the main culprit here is needing to sum percpu counters in the dispatch hot path, leading to very high CPU utilization there. To make matters worse, the code currently needs to sum 2 percpu counters, and it does so in the most naive way of iterating possible CPUs _twice_. Since we're close to release, revert this commit and we can re-do it with regular per-priority counters instead for the 5.15 kernel. Link: https://lore.kernel.org/linux-block/20210826144039.2143-1-thunder.leizhen@huawei.com/ Reported-by:
Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 24, 2021
-
-
Bart Van Assche authored
The block layer may call the I/O scheduler .finish_request() callback without having called the .insert_requests() callback. Make sure that the mq-deadline I/O statistics are correct if the block layer inserts an I/O request that bypasses the I/O scheduler. This patch prevents that lower priority I/O is delayed longer than necessary for mixed I/O priority workloads. Cc: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by:
Niklas Cassel <Niklas.Cassel@wdc.com> Fixes: 08a9ad8b ("block/mq-deadline: Add cgroup support") Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210824170520.1659173-1-bvanassche@acm.org Reviewed-by:
Niklas Cassel <niklas.cassel@wdc.com> Tested-by:
Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 11, 2021
-
-
Tejun Heo authored
This reverts commit 08a9ad8b ("block/mq-deadline: Add cgroup support") and a follow-up commit c06bc5a3 ("block/mq-deadline: Remove a WARN_ON_ONCE() call"). The added cgroup support has the following issues: * It breaks cgroup interface file format rule by adding custom elements to a nested key-value file. * It registers mq-deadline as a cgroup-aware policy even though all it's doing is collecting per-cgroup stats. Even if we need these stats, this isn't the right way to add them. * It hasn't been reviewed from cgroup side. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 09, 2021
-
-
Ming Lei authored
When merging one bio to request, if they are discard IO and the queue supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE because both block core and related drivers(nvme, virtio-blk) doesn't handle mixed discard io merge(traditional IO merge together with discard merge) well. Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation, so both blk-mq and drivers just need to handle multi-range discard. Reported-by:
Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Tested-by:
Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 2705dfb2 ("block: fix discard request merge") Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 27, 2021
-
-
Bart Van Assche authored
The purpose of the WARN_ON_ONCE() statement in dd_insert_request() is to verify that dd_prepare_request() cleared rq->elv.priv[0]. Since dd_prepare_request() is called during request initialization but not if a request is requeued, a warning is triggered if a request is requeued. Fix this by removing the WARN_ON_ONCE() statement. This patch suppresses the following kernel warning: WARNING: CPU: 28 PID: 432 at block/mq-deadline-main.c:740 dd_insert_request+0x4d4/0x5b0 Workqueue: kblockd blk_mq_requeue_work Call Trace: dd_insert_requests+0xfa/0x130 blk_mq_sched_insert_request+0x22c/0x240 blk_mq_requeue_work+0x21c/0x2d0 process_one_work+0x4c2/0xa70 worker_thread+0x2e5/0x6d0 kthread+0x21c/0x250 ret_from_fork+0x1f/0x30 Reported-by:
Sachin Sant <sachinp@linux.vnet.ibm.com> Fixes: 08a9ad8b ("block/mq-deadline: Add cgroup support") Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210627211112.12720-1-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 25, 2021
-
-
Jan Kara authored
Lockdep complains about lock inversion between ioc->lock and bfqd->lock: bfqd -> ioc: put_io_context+0x33/0x90 -> ioc->lock grabbed blk_mq_free_request+0x51/0x140 blk_put_request+0xe/0x10 blk_attempt_req_merge+0x1d/0x30 elv_attempt_insert_merge+0x56/0xa0 blk_mq_sched_try_insert_merge+0x4b/0x60 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed blk_mq_sched_insert_requests+0xd6/0x2b0 blk_mq_flush_plug_list+0x154/0x280 blk_finish_plug+0x40/0x60 ext4_writepages+0x696/0x1320 do_writepages+0x1c/0x80 __filemap_fdatawrite_range+0xd7/0x120 sync_file_range+0xac/0xf0 ioc->bfqd: bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed put_io_context_active+0x78/0xb0 -> ioc->lock grabbed exit_io_context+0x48/0x50 do_exit+0x7e9/0xdd0 do_group_exit+0x54/0xc0 To avoid this inversion we change blk_mq_sched_try_insert_merge() to not free the merged request but rather leave that upto the caller similarly to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure to free all the merged requests after dropping bfqd->lock. Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Reviewed-by:
Ming Lei <ming.lei@redhat.com> Acked-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 21, 2021
-
-
Bart Van Assche authored
While one or more requests with a certain I/O priority are pending, do not dispatch lower priority requests. Dispatch lower priority requests anyway after the "aging" time has expired. This patch has been tested as follows: modprobe scsi_debug ndelay=1000000 max_queue=16 && sd='' && while [ -z "$sd" ]; do sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*) done && echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire && cd /sys/fs/cgroup/blkio/ && echo $$ >cgroup.procs && echo restrict-to-be >blkio.prio.class && mkdir -p hipri && cd hipri && echo none-to-rt >blkio.prio.class && { max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } && echo $$ >cgroup.procs && max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt Result: * 11000 IOPS for the high-priority job * 40 IOPS for the low-priority job If the aging expiry time is changed from 100s into 0, the IOPS results change into 6712 and 6796 IOPS. The max-iops script is a script that runs fio with the following arguments: --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60 --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j} --iodepth=${arg_d} --iodepth_batch_submit=${arg_a} --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1} --filename=${positional_argument_1} Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-17-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Maintain statistics per cgroup and export these to user space. These statistics are essential for verifying whether the proper I/O priorities have been assigned to requests. An example of the statistics data with this patch applied: $ cat /sys/fs/cgroup/io.stat 11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0 8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0 Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-16-bvanassche@acm.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-