Skip to content
Snippets Groups Projects
  1. Jul 02, 2024
  2. Apr 19, 2024
  3. Apr 17, 2024
  4. Mar 13, 2024
  5. Aug 08, 2023
    • Zhiguo Niu's avatar
      block/mq-deadline: use correct way to throttling write requests · d47f9717
      Zhiguo Niu authored
      
      The original formula was inaccurate:
      dd->async_depth = max(1UL, 3 * q->nr_requests / 4);
      
      For write requests, when we assign a tags from sched_tags,
      data->shallow_depth will be passed to sbitmap_find_bit,
      see the following code:
      
      nr = sbitmap_find_bit_in_word(&sb->map[index],
      			min_t (unsigned int,
      			__map_depth(sb, index),
      			depth),
      			alloc_hint, wrap);
      
      The smaller of data->shallow_depth and __map_depth(sb, index)
      will be used as the maximum range when allocating bits.
      
      For a mmc device (one hw queue, deadline I/O scheduler):
      q->nr_requests = sched_tags = 128, so according to the previous
      calculation method, dd->async_depth = data->shallow_depth = 96,
      and the platform is 64bits with 8 cpus, sched_tags.bitmap_tags.sb.shift=5,
      sb.maps[]=32/32/32/32, 32 is smaller than 96, whether it is a read or
      a write I/O, tags can be allocated to the maximum range each time,
      which has not throttling effect.
      
      In addition, refer to the methods of bfg/kyber I/O scheduler,
      limit ratiois are calculated base on sched_tags.bitmap_tags.sb.shift.
      
      This patch can throttle write requests really.
      
      Fixes: 07757588 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
      
      Signed-off-by: default avatarZhiguo Niu <zhiguo.niu@unisoc.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Link: https://lore.kernel.org/r/1691061162-22898-1-git-send-email-zhiguo.niu@unisoc.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d47f9717
  6. Jul 12, 2023
  7. May 19, 2023
  8. Apr 13, 2023
  9. Nov 29, 2022
  10. Nov 24, 2022
    • Damien Le Moal's avatar
      block: mq-deadline: Do not break sequential write streams to zoned HDDs · 015d02f4
      Damien Le Moal authored
      
      mq-deadline ensures an in order dispatching of write requests to zoned
      block devices using a per zone lock (a bit). This implies that for any
      purely sequential write workload, the drive is exercised most of the
      time at a maximum queue depth of one.
      
      However, when such sequential write workload crosses a zone boundary
      (when sequentially writing multiple contiguous zones), zone write
      locking may prevent the last write to one zone to be issued (as the
      previous write is still being executed) but allow the first write to the
      following zone to be issued (as that zone is not yet being writen and
      not locked). This result in an out of order delivery of the sequential
      write commands to the device every time a zone boundary is crossed.
      
      While such behavior does not break the sequential write constraint of
      zoned block devices (and does not generate any write error), some zoned
      hard-disks react badly to seeing these out of order writes, resulting in
      lower write throughput.
      
      This problem can be addressed by always dispatching the first request
      of a stream of sequential write requests, regardless of the zones
      targeted by these sequential writes. To do so, the function
      deadline_skip_seq_writes() is introduced and used in
      deadline_next_request() to select the next write command to issue if the
      target device is an HDD (blk_queue_nonrot() being false).
      deadline_fifo_request() is modified using the new
      deadline_earlier_request() and deadline_is_seq_write() helpers to ignore
      requests in the fifo list that have a preceding request in lba order
      that is sequential.
      
      With this fix, a sequential write workload executed with the following
      fio command:
      
      fio  --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \
           --size=68719476736  --ioengine=libaio --iodepth=32 --rw=write \
           --bs=65536
      
      results in an increase from 225 MB/s to 250 MB/s of the write throughput
      of an SMR HDD (11% increase).
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      015d02f4
    • Damien Le Moal's avatar
      block: mq-deadline: Fix dd_finish_request() for zoned devices · 2820e5d0
      Damien Le Moal authored
      
      dd_finish_request() tests if the per prio fifo_list is not empty to
      determine if request dispatching must be restarted for handling blocked
      write requests to zoned devices with a call to
      blk_mq_sched_mark_restart_hctx(). While simple, this implementation has
      2 problems:
      
      1) Only the priority level of the completed request is considered.
         However, writes to a zone may be blocked due to other writes to the
         same zone using a different priority level. While this is unlikely to
         happen in practice, as writing a zone with different IO priorirites
         does not make sense, nothing in the code prevents this from
         happening.
      2) The use of list_empty() is dangerous as dd_finish_request() does not
         take dd->lock and may run concurrently with the insert and dispatch
         code.
      
      Fix these 2 problems by testing the write fifo list of all priority
      levels using the new helper dd_has_write_work(), and by testing each
      fifo list using list_empty_careful().
      
      Fixes: c807ab52 ("block/mq-deadline: Add I/O priority support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2820e5d0
  11. Jul 14, 2022
  12. Jun 16, 2022
  13. May 13, 2022
  14. Jan 20, 2022
  15. Oct 18, 2021
  16. Sep 02, 2021
  17. Aug 26, 2021
  18. Aug 24, 2021
  19. Aug 11, 2021
    • Tejun Heo's avatar
      Revert "block/mq-deadline: Add cgroup support" · 0f783995
      Tejun Heo authored
      
      This reverts commit 08a9ad8b ("block/mq-deadline: Add cgroup support")
      and a follow-up commit c06bc5a3 ("block/mq-deadline: Remove a
      WARN_ON_ONCE() call"). The added cgroup support has the following issues:
      
      * It breaks cgroup interface file format rule by adding custom elements to a
        nested key-value file.
      
      * It registers mq-deadline as a cgroup-aware policy even though all it's
        doing is collecting per-cgroup stats. Even if we need these stats, this
        isn't the right way to add them.
      
      * It hasn't been reviewed from cgroup side.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f783995
  20. Aug 09, 2021
  21. Jun 27, 2021
  22. Jun 25, 2021
    • Jan Kara's avatar
      blk: Fix lock inversion between ioc lock and bfqd lock · fd2ef39c
      Jan Kara authored
      
      Lockdep complains about lock inversion between ioc->lock and bfqd->lock:
      
      bfqd -> ioc:
       put_io_context+0x33/0x90 -> ioc->lock grabbed
       blk_mq_free_request+0x51/0x140
       blk_put_request+0xe/0x10
       blk_attempt_req_merge+0x1d/0x30
       elv_attempt_insert_merge+0x56/0xa0
       blk_mq_sched_try_insert_merge+0x4b/0x60
       bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
       blk_mq_sched_insert_requests+0xd6/0x2b0
       blk_mq_flush_plug_list+0x154/0x280
       blk_finish_plug+0x40/0x60
       ext4_writepages+0x696/0x1320
       do_writepages+0x1c/0x80
       __filemap_fdatawrite_range+0xd7/0x120
       sync_file_range+0xac/0xf0
      
      ioc->bfqd:
       bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
       put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
       exit_io_context+0x48/0x50
       do_exit+0x7e9/0xdd0
       do_group_exit+0x54/0xc0
      
      To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
      free the merged request but rather leave that upto the caller similarly
      to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
      to free all the merged requests after dropping bfqd->lock.
      
      Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd2ef39c
  23. Jun 21, 2021
    • Bart Van Assche's avatar
      block/mq-deadline: Prioritize high-priority requests · fb926032
      Bart Van Assche authored
      
      While one or more requests with a certain I/O priority are pending, do not
      dispatch lower priority requests. Dispatch lower priority requests anyway
      after the "aging" time has expired.
      
      This patch has been tested as follows:
      
      modprobe scsi_debug ndelay=1000000 max_queue=16 &&
      sd='' &&
      while [ -z "$sd" ]; do
        sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
      done &&
      echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire &&
      cd /sys/fs/cgroup/blkio/ &&
      echo $$ >cgroup.procs &&
      echo restrict-to-be >blkio.prio.class &&
      mkdir -p hipri &&
      cd hipri &&
      echo none-to-rt >blkio.prio.class &&
      { max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } &&
      echo $$ >cgroup.procs &&
      max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt
      
      Result:
      * 11000 IOPS for the high-priority job
      *    40 IOPS for the low-priority job
      
      If the aging expiry time is changed from 100s into 0, the IOPS results change
      into 6712 and 6796 IOPS.
      
      The max-iops script is a script that runs fio with the following arguments:
      --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
      --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
      --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
      --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
      --filename=${positional_argument_1}
      
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Link: https://lore.kernel.org/r/20210618004456.7280-17-bvanassche@acm.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fb926032
    • Bart Van Assche's avatar
      block/mq-deadline: Add cgroup support · 08a9ad8b
      Bart Van Assche authored
      
      Maintain statistics per cgroup and export these to user space. These
      statistics are essential for verifying whether the proper I/O priorities
      have been assigned to requests. An example of the statistics data with
      this patch applied:
      
      $ cat /sys/fs/cgroup/io.stat
      11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0
      8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0
      
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Link: https://lore.kernel.org/r/20210618004456.7280-16-bvanassche@acm.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08a9ad8b
Loading