Skip to content
Snippets Groups Projects
  1. Oct 18, 2021
  2. Sep 02, 2021
  3. Aug 26, 2021
  4. Aug 24, 2021
  5. Aug 11, 2021
    • Tejun Heo's avatar
      Revert "block/mq-deadline: Add cgroup support" · 0f783995
      Tejun Heo authored
      
      This reverts commit 08a9ad8b ("block/mq-deadline: Add cgroup support")
      and a follow-up commit c06bc5a3 ("block/mq-deadline: Remove a
      WARN_ON_ONCE() call"). The added cgroup support has the following issues:
      
      * It breaks cgroup interface file format rule by adding custom elements to a
        nested key-value file.
      
      * It registers mq-deadline as a cgroup-aware policy even though all it's
        doing is collecting per-cgroup stats. Even if we need these stats, this
        isn't the right way to add them.
      
      * It hasn't been reviewed from cgroup side.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f783995
  6. Aug 09, 2021
  7. Jun 27, 2021
  8. Jun 25, 2021
    • Jan Kara's avatar
      blk: Fix lock inversion between ioc lock and bfqd lock · fd2ef39c
      Jan Kara authored
      
      Lockdep complains about lock inversion between ioc->lock and bfqd->lock:
      
      bfqd -> ioc:
       put_io_context+0x33/0x90 -> ioc->lock grabbed
       blk_mq_free_request+0x51/0x140
       blk_put_request+0xe/0x10
       blk_attempt_req_merge+0x1d/0x30
       elv_attempt_insert_merge+0x56/0xa0
       blk_mq_sched_try_insert_merge+0x4b/0x60
       bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
       blk_mq_sched_insert_requests+0xd6/0x2b0
       blk_mq_flush_plug_list+0x154/0x280
       blk_finish_plug+0x40/0x60
       ext4_writepages+0x696/0x1320
       do_writepages+0x1c/0x80
       __filemap_fdatawrite_range+0xd7/0x120
       sync_file_range+0xac/0xf0
      
      ioc->bfqd:
       bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
       put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
       exit_io_context+0x48/0x50
       do_exit+0x7e9/0xdd0
       do_group_exit+0x54/0xc0
      
      To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
      free the merged request but rather leave that upto the caller similarly
      to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
      to free all the merged requests after dropping bfqd->lock.
      
      Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd2ef39c
  9. Jun 21, 2021
  10. May 11, 2021
    • Omar Sandoval's avatar
      kyber: fix out of bounds access when preempted · efed9a33
      Omar Sandoval authored
      
      __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
      passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
      for the current CPU again and uses that to get the corresponding Kyber
      context in the passed hctx. However, the thread may be preempted between
      the two calls to blk_mq_get_ctx(), and the ctx returned the second time
      may no longer correspond to the passed hctx. This "works" accidentally
      most of the time, but it can cause us to read garbage if the second ctx
      came from an hctx with more ctx's than the first one (i.e., if
      ctx->index_hw[hctx->type] > hctx->nr_ctx).
      
      This manifested as this UBSAN array index out of bounds error reported
      by Jakub:
      
      UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
      index 13106 is out of range for type 'long unsigned int [128]'
      Call Trace:
       dump_stack+0xa4/0xe5
       ubsan_epilogue+0x5/0x40
       __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
       queued_spin_lock_slowpath+0x476/0x480
       do_raw_spin_lock+0x1c2/0x1d0
       kyber_bio_merge+0x112/0x180
       blk_mq_submit_bio+0x1f5/0x1100
       submit_bio_noacct+0x7b0/0x870
       submit_bio+0xc2/0x3a0
       btrfs_map_bio+0x4f0/0x9d0
       btrfs_submit_data_bio+0x24e/0x310
       submit_one_bio+0x7f/0xb0
       submit_extent_page+0xc4/0x440
       __extent_writepage_io+0x2b8/0x5e0
       __extent_writepage+0x28d/0x6e0
       extent_write_cache_pages+0x4d7/0x7a0
       extent_writepages+0xa2/0x110
       do_writepages+0x8f/0x180
       __writeback_single_inode+0x99/0x7f0
       writeback_sb_inodes+0x34e/0x790
       __writeback_inodes_wb+0x9e/0x120
       wb_writeback+0x4d2/0x660
       wb_workfn+0x64d/0xa10
       process_one_work+0x53a/0xa80
       worker_thread+0x69/0x5b0
       kthread+0x20b/0x240
       ret_from_fork+0x1f/0x30
      
      Only Kyber uses the hctx, so fix it by passing the request_queue to
      ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
      map the queues itself to avoid the mismatch.
      
      Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      efed9a33
  11. Apr 16, 2021
    • Lin Feng's avatar
      bfq/mq-deadline: remove redundant check for passthrough request · 7687b38a
      Lin Feng authored
      
      Since commit 01e99aec 'blk-mq: insert passthrough request into
      hctx->dispatch directly', passthrough request should not appear in
      IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO
      schedulers is redundant.
      
      (Notes: this patch passes generic IO load test with hdds under SAS
      controller and hdds under AHCI controller but obviously not covers all.
      Not sure if passthrough request can still escape into IO scheduler from
      blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and
      has lots of indirect callers.)
      
      Signed-off-by: default avatarLin Feng <linf@wangsu.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7687b38a
  12. Feb 22, 2021
  13. Jan 25, 2021
  14. Sep 03, 2020
  15. May 29, 2020
  16. Sep 06, 2019
    • Damien Le Moal's avatar
      block: Introduce elevator features · 68c43f13
      Damien Le Moal authored
      
      Introduce the definition of elevator features through the
      elevator_features flags in the elevator_type structure. Each flag can
      represent a feature supported by an elevator. The first feature defined
      by this patch is support for zoned block device sequential write
      constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented
      by the mq-deadline elevator using zone write locking.
      
      Other possible features are IO priorities, write hints, latency targets
      or single-LUN dual-actuator disks (for which the elevator could maintain
      one LBA ordered list per actuator).
      
      The required_elevator_features field is also added to the request_queue
      structure to allow a device driver to specify elevator feature flags
      that an elevator must support for the correct operation of the device
      (e.g. device drivers for zoned block devices can have the
      ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature).
      The helper function blk_queue_required_elevator_features() is
      defined for setting this new field.
      
      With these two new fields in place, the elevator functions
      elevator_match() and elevator_find() are modified to allow a user to set
      only an elevator with a set of features that satisfies the device
      required features. Elevators not matching the device requirements are
      not shown in the device sysfs queue/scheduler file to prevent their use.
      
      The "none" elevator can always be selected as before.
      
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      68c43f13
  17. Sep 03, 2019
    • Damien Le Moal's avatar
      block: mq-deadline: Fix queue restart handling · cb8acabb
      Damien Le Moal authored
      
      Commit 7211aef8 ("block: mq-deadline: Fix write completion
      handling") added a call to blk_mq_sched_mark_restart_hctx() in
      dd_dispatch_request() to make sure that write request dispatching does
      not stall when all target zones are locked. This fix left a subtle race
      when a write completion happens during a dispatch execution on another
      CPU:
      
      CPU 0: Dispatch			CPU1: write completion
      
      dd_dispatch_request()
          lock(&dd->lock);
          ...
          lock(&dd->zone_lock);	dd_finish_request()
          rq = find request		lock(&dd->zone_lock);
          unlock(&dd->zone_lock);
          				zone write unlock
      				unlock(&dd->zone_lock);
      				...
      				__blk_mq_free_request
                                            check restart flag (not set)
      				      -> queue not run
          ...
          if (!rq && have writes)
              blk_mq_sched_mark_restart_hctx()
          unlock(&dd->lock)
      
      Since the dispatch context finishes after the write request completion
      handling, marking the queue as needing a restart is not seen from
      __blk_mq_free_request() and blk_mq_sched_restart() not executed leading
      to the dispatch stall under 100% write workloads.
      
      Fix this by moving the call to blk_mq_sched_mark_restart_hctx() from
      dd_dispatch_request() into dd_finish_request() under the zone lock to
      ensure full mutual exclusion between write request dispatch selection
      and zone unlock on write request completion.
      
      Fixes: 7211aef8 ("block: mq-deadline: Fix write completion handling")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarHans Holmberg <Hans.Holmberg@wdc.com>
      Reviewed-by: default avatarHans Holmberg <hans.holmberg@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cb8acabb
  18. Jul 15, 2019
  19. Jun 20, 2019
    • Christoph Hellwig's avatar
      block: remove the bi_phys_segments field in struct bio · 14ccb66b
      Christoph Hellwig authored
      
      We only need the number of segments in the blk-mq submission path.
      Remove the field from struct bio, and return it from a variant of
      blk_queue_split instead of that it can passed as an argument to
      those functions that need the value.
      
      This also means we stop recounting segments except for cloning
      and partial segments.
      
      To keep the number of arguments in this how path down remove
      pointless struct request_queue arguments from any of the functions
      that had it and grew a nr_segs argument.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      14ccb66b
  20. Apr 30, 2019
  21. Dec 17, 2018
    • Damien Le Moal's avatar
      block: mq-deadline: Fix write completion handling · 7211aef8
      Damien Le Moal authored
      
      For a zoned block device using mq-deadline, if a write request for a
      zone is received while another write was already dispatched for the same
      zone, dd_dispatch_request() will return NULL and the newly inserted
      write request is kept in the scheduler queue waiting for the ongoing
      zone write to complete. With this behavior, when no other request has
      been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
      and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
      __blk_mq_free_request() call of blk_mq_sched_restart() to not run the
      queue when the already dispatched write request completes. The newly
      dispatched request stays stuck in the scheduler queue until eventually
      another request is submitted.
      
      This problem does not affect SCSI disk as the SCSI stack handles queue
      restart on request completion. However, this problem is can be triggered
      the nullblk driver with zoned mode enabled.
      
      Fix this by always requesting a queue restart in dd_dispatch_request()
      if no request was dispatched while WRITE requests are queued.
      
      Fixes: 5700f691 ("mq-deadline: Introduce zone locking support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      
      Add missing export of blk_mq_sched_restart()
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7211aef8
  22. Nov 07, 2018
  23. May 24, 2018
Loading