Skip to content
Snippets Groups Projects
  1. Dec 02, 2023
  2. Apr 13, 2023
  3. Feb 03, 2023
  4. Sep 22, 2022
  5. Jul 20, 2022
    • Jinke Han's avatar
      block: don't allow the same type rq_qos add more than once · 14a6e2eb
      Jinke Han authored
      
      In our test of iocost, we encountered some list add/del corruptions of
      inner_walk list in ioc_timer_fn.
      
      The reason can be described as follows:
      
      cpu 0					cpu 1
      ioc_qos_write				ioc_qos_write
      
      ioc = q_to_ioc(queue);
      if (!ioc) {
              ioc = kzalloc();
      					ioc = q_to_ioc(queue);
      					if (!ioc) {
      						ioc = kzalloc();
      						...
      						rq_qos_add(q, rqos);
      					}
              ...
              rq_qos_add(q, rqos);
              ...
      }
      
      When the io.cost.qos file is written by two cpus concurrently, rq_qos may
      be added to one disk twice. In that case, there will be two iocs enabled
      and running on one disk. They own different iocgs on their active list. In
      the ioc_timer_fn function, because of the iocgs from two iocs have the
      same root iocg, the root iocg's walk_list may be overwritten by each other
      and this leads to list add/del corruptions in building or destroying the
      inner_walk list.
      
      And so far, the blk-rq-qos framework works in case that one instance for
      one type rq_qos per queue by default. This patch make this explicit and
      also fix the crash above.
      
      Signed-off-by: default avatarJinke Han <hanjinke.666@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      14a6e2eb
  6. Jun 17, 2022
  7. Mar 14, 2022
    • Tejun Heo's avatar
      block: fix rq-qos breakage from skipping rq_qos_done_bio() · aa1b46dc
      Tejun Heo authored
      
      a647a524 ("block: don't call rq_qos_ops->done_bio if the bio isn't
      tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set.
      While this fixed a potential oops, it also broke blk-iocost by skipping the
      done_bio callback for merged bios.
      
      Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(),
      rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED
      distinguishing the former from the latter. rq_qos_done_bio() is not called
      for bios which wenth through rq_qos_merge(). This royally confuses
      blk-iocost as the merged bios never finish and are considered perpetually
      in-flight.
      
      One reliably reproducible failure mode is an intermediate cgroup geting
      stuck active preventing its children from being activated due to the
      leaf-only rule, leading to loss of control. The following is from
      resctl-bench protection scenario which emulates isolating a web server like
      workload from a memory bomb run on an iocost configuration which should
      yield a reasonable level of protection.
      
        # cat /sys/block/nvme2n1/device/model
        Samsung SSD 970 PRO 512GB
        # cat /sys/fs/cgroup/io.cost.model
        259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025
        # cat /sys/fs/cgroup/io.cost.qos
        259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00
        # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
        ...
        Memory Hog Summary
        ==================
      
        IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m
                    W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m
      
        Isolation and Request Latency Impact Distributions:
      
                      min   p01   p05   p10   p25   p50   p75   p90   p95   p99   max  mean stdev
        isol%       15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82
        lat-imp%        0     0     0     0     0  4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6
      
        Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96%
      
      The isolation result of 58.12% is close to what this device would show
      without any IO control.
      
      Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and
      calling rq_qos_done_bio() on them too. For consistency and clarity, rename
      BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into
      rq_qos_done_bio() so that it's next to the code paths that set the flags.
      
      With the patch applied, the above same benchmark shows:
      
        # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
        ...
        Memory Hog Summary
        ==================
      
        IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m
                    W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m
      
        Isolation and Request Latency Impact Distributions:
      
                      min   p01   p05   p10   p25   p50   p75   p90   p95   p99   max  mean stdev
        isol%       84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42  2.81
        lat-imp%        0     0     0     0     0  2.81  5.73 11.11 13.92 17.53 22.61  4.10  4.68
      
        Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0%
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: a647a524 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked")
      Cc: stable@vger.kernel.org # v5.15+
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Yu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aa1b46dc
  8. Oct 18, 2021
  9. Jun 21, 2021
  10. Jun 16, 2021
    • Ming Lei's avatar
      block: fix race between adding/removing rq qos and normal IO · 2cafe29a
      Ming Lei authored
      
      Yi reported several kernel panics on:
      
      [16687.001777] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
      ...
      [16687.163549] pc : __rq_qos_track+0x38/0x60
      
      or
      
      [  997.690455] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
      ...
      [  997.850347] pc : __rq_qos_done+0x2c/0x50
      
      Turns out it is caused by race between adding rq qos(wbt) and normal IO
      because rq_qos_add can be run when IO is being submitted, fix this issue
      by freezing queue before adding/deleting rq qos to queue.
      
      rq_qos_exit() needn't to freeze queue because it is called after queue
      has been frozen.
      
      iolatency calls rq_qos_add() during allocating queue, so freezing won't
      add delay because queue usage refcount works at atomic mode at that
      time.
      
      iocost calls rq_qos_add() when writing cgroup attribute file, that is
      fine to freeze queue at that time since we usually freeze queue when
      storing to queue sysfs attribute, meantime iocost only exists on the
      root cgroup.
      
      wbt_init calls it in blk_register_queue() and queue sysfs attribute
      store(queue_wb_lat_store() when write it 1st time in case of !BLK_WBT_MQ),
      the following patch will speedup the queue freezing in wbt_init.
      
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Link: https://lore.kernel.org/r/20210609015822.103433-2-ming.lei@redhat.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2cafe29a
  11. Oct 15, 2019
  12. Oct 06, 2019
  13. Aug 29, 2019
    • Tejun Heo's avatar
      blkcg: implement blk-iocost · 7caa4715
      Tejun Heo authored
      
      This patchset implements IO cost model based work-conserving
      proportional controller.
      
      While io.latency provides the capability to comprehensively prioritize
      and protect IOs depending on the cgroups, its protection is binary -
      the lowest latency target cgroup which is suffering is protected at
      the cost of all others.  In many use cases including stacking multiple
      workload containers in a single system, it's necessary to distribute
      IO capacity with better granularity.
      
      One challenge of controlling IO resources is the lack of trivially
      observable cost metric.  The most common metrics - bandwidth and iops
      - can be off by orders of magnitude depending on the device type and
      IO pattern.  However, the cost isn't a complete mystery.  Given
      several key attributes, we can make fairly reliable predictions on how
      expensive a given stream of IOs would be, at least compared to other
      IO patterns.
      
      The function which determines the cost of a given IO is the IO cost
      model for the device.  This controller distributes IO capacity based
      on the costs estimated by such model.  The more accurate the cost
      model the better but the controller adapts based on IO completion
      latency and as long as the relative costs across differents IO
      patterns are consistent and sensible, it'll adapt to the actual
      performance of the device.
      
      Currently, the only implemented cost model is a simple linear one with
      a few sets of default parameters for different classes of device.
      This covers most common devices reasonably well.  All the
      infrastructure to tune and add different cost models is already in
      place and a later patch will also allow using bpf progs for cost
      models.
      
      Please see the top comment in blk-iocost.c and documentation for
      more details.
      
      v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
          for a divide-by-zero bug in current_hweight() triggered by zero
          inuse_sum.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7caa4715
    • Tejun Heo's avatar
      blkcg: s/RQ_QOS_CGROUP/RQ_QOS_LATENCY/ · beab17fc
      Tejun Heo authored
      
      io.weight is gonna be another rq_qos cgroup mechanism.  Let's rename
      RQ_QOS_CGROUP which is being used by io.latency to RQ_QOS_LATENCY in
      preparation.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      beab17fc
    • Tejun Heo's avatar
      block/rq_qos: implement rq_qos_ops->queue_depth_changed() · 9677a3e0
      Tejun Heo authored
      
      wbt already gets queue depth changed notification through
      wbt_set_queue_depth().  Generalize it into
      rq_qos_ops->queue_depth_changed() so that other rq_qos policies can
      easily hook into the events too.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9677a3e0
    • Tejun Heo's avatar
      block/rq_qos: add rq_qos_merge() · d3e65fff
      Tejun Heo authored
      
      Add a merge hook for rq_qos.  This will be used by io.weight.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d3e65fff
  14. Apr 30, 2019
  15. Dec 17, 2018
    • Dennis Zhou's avatar
      block: fix blk-iolatency accounting underflow · 13369816
      Dennis Zhou authored
      The blk-iolatency controller measures the time from rq_qos_throttle() to
      rq_qos_done_bio() and attributes this time to the first bio that needs
      to create the request. This means if a bio is plug-mergeable or
      bio-mergeable, it gets to bypass the blk-iolatency controller.
      
      The recent series [1], to tag all bios w/ blkgs undermined how iolatency
      was determining which bios it was charging and should process in
      rq_qos_done_bio(). Because all bios are being tagged, this caused the
      atomic_t for the struct rq_wait inflight count to underflow and result
      in a stall.
      
      This patch adds a new flag BIO_TRACKED to let controllers know that a
      bio is going through the rq_qos path. blk-iolatency now checks if this
      flag is set to see if it should process the bio in rq_qos_done_bio().
      
      Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
      BIO_THROTTLED was another candidate, but the flag is set for all bios
      that have gone through blk-throttle code. Overloading a flag comes with
      the burden of making sure that when either implementation changes, a
      change in setting rules for one doesn't cause a bug in the other. So
      here, we unfortunately opt for adding a new flag.
      
      [1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/
      
      
      
      Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      13369816
    • Ming Lei's avatar
      blk-mq-debugfs: support rq_qos · cc56694f
      Ming Lei authored
      
      blk-mq-debugfs has been proved as very helpful for debug some
      tough issues, such as IO hang.
      
      We have seen blk-wbt related IO hang several times, even inside
      Red Hat BZ, there is such report not sovled yet, so this patch
      adds support debugfs on rq_qos.
      
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cc56694f
  16. Dec 08, 2018
    • Josef Bacik's avatar
      block: add rq_qos_wait to rq_qos · 84f60324
      Josef Bacik authored
      
      Originally when I split out the common code from blk-wbt into rq_qos I
      left the wbt_wait() where it was and simply copied and modified it
      slightly to work for io-latency.  However they are both basically the
      same thing, and as time has gone on wbt_wait() has ended up much smarter
      and kinder than it was when I copied it into io-latency, which means
      io-latency has lost out on these improvements.
      
      Since they are the same thing essentially except for a few minor things,
      create rq_qos_wait() that replicates what wbt_wait() currently does with
      callbacks that can be passed in for the snowflakes to do their own thing
      as appropriate.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      84f60324
  17. Nov 16, 2018
  18. Nov 15, 2018
  19. Jul 22, 2018
  20. Jul 09, 2018
Loading