Skip to content
Snippets Groups Projects
  1. Sep 10, 2024
  2. Sep 03, 2024
  3. Jun 18, 2024
  4. Feb 05, 2024
  5. Apr 13, 2023
    • Tejun Heo's avatar
      blkcg: Restructure blkg_conf_prep() and friends · faffaab2
      Tejun Heo authored
      
      We want to support lazy init of rq-qos policies so that iolatency is enabled
      lazily on configuration instead of gendisk initialization. The way blkg
      config helpers are structured now is a bit awkward for that. Let's
      restructure:
      
      * blkcg_conf_open_bdev() is renamed to blkg_conf_open_bdev(). The blkcg_
        prefix was used because the bdev opening step is blkg-independent.
        However, the distinction is too subtle and confuses more than helps. Let's
        switch to blkg prefix so that it's consistent with the type and other
        helper names.
      
      * struct blkg_conf_ctx now remembers the original input string and is always
        initialized by the new blkg_conf_init().
      
      * blkg_conf_open_bdev() is updated to take a pointer to blkg_conf_ctx like
        blkg_conf_prep() and can be called multiple times safely. Instead of
        modifying the double pointer to input string directly,
        blkg_conf_open_bdev() now sets blkg_conf_ctx->body.
      
      * blkg_conf_finish() is renamed to blkg_conf_exit() for symmetry and now
        must be called on all blkg_conf_ctx's which were initialized with
        blkg_conf_init().
      
      Combined, this allows the users to either open the bdev first or do it
      altogether with blkg_conf_prep() which will help implementing lazy init of
      rq-qos policies.
      
      blkg_conf_init/exit() will also be used implement synchronization against
      device removal. This is necessary because iolat / iocost are configured
      through cgroupfs instead of one of the files under /sys/block/DEVICE. As
      cgroupfs operations aren't synchronized with block layer, the lazy init and
      other configuration operations may race against device removal. This patch
      makes blkg_conf_init/exit() used consistently for all cgroup-orginating
      configurations making them a good place to implement explicit
      synchronization.
      
      Users are updated accordingly. No behavior change is intended by this patch.
      
      v2: bfq wasn't updated in v1 causing a build error. Fixed.
      
      v3: Update the description to include future use of blkg_conf_init/exit() as
          synchronization points.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Yu Kuai <yukuai1@huaweicloud.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230413000649.115785-3-tj@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      faffaab2
  6. Apr 06, 2023
  7. Feb 14, 2023
  8. Feb 07, 2023
  9. Feb 03, 2023
  10. Jan 30, 2023
  11. Jan 29, 2023
    • Davide Zini's avatar
      block, bfq: inject I/O to underutilized actuators · 2d31c684
      Davide Zini authored
      
      The main service scheme of BFQ for sync I/O is serving one sync
      bfq_queue at a time, for a while. In particular, BFQ enforces this
      scheme when it deems the latter necessary to boost throughput or
      to preserve service guarantees. Unfortunately, when BFQ enforces
      this policy, only one actuator at a time gets served for a while,
      because each bfq_queue contains I/O only for one actuator. The
      other actuators may remain underutilized.
      
      Actually, BFQ may serve (inject) extra I/O, taken from other
      bfq_queues, in parallel with that of the in-service queue. This
      injection mechanism may provide the ground for dealing also with
      the above actuator-underutilization problem. Yet BFQ does not take
      the actuator load into account when choosing which queue to pick
      extra I/O from. In addition, BFQ may happen to inject extra I/O
      only when the in-service queue is temporarily empty.
      
      In view of these facts, this commit extends the
      injection mechanism in such a way that the latter:
      (1) takes into account also the actuator load;
      (2) checks such a load on each dispatch, and injects I/O for an
          underutilized actuator, if there is one and there is I/O for it.
      
      To perform the check in (2), this commit introduces a load
      threshold, currently set to 4.  A linear scan of each actuator is
      performed, until an actuator is found for which the following two
      conditions hold: the load of the actuator is below the threshold,
      and there is at least one non-in-service queue that contains I/O
      for that actuator. If such a pair (actuator, queue) is found, then
      the head request of that queue is returned for dispatch, instead
      of the head request of the in-service queue.
      
      We have set the threshold, empirically, to the minimum possible
      value for which an actuator is fully utilized, or close to be
      fully utilized. By doing so, injected I/O 'steals' as few
      drive-queue slots as possibile to the in-service queue. This
      reduces as much as possible the probability that the service of
      I/O from the in-service bfq_queue gets delayed because of slot
      exhaustion, i.e., because all the slots of the drive queue are
      filled with I/O injected from other queues (NCQ provides for 32
      slots).
      
      This new mechanism also counters actuator underutilization in the
      case of asymmetric configurations of bfq_queues. Namely if there
      are few bfq_queues containing I/O for some actuators and many
      bfq_queues containing I/O for other actuators. Or if the
      bfq_queues containing I/O for some actuators have lower weights
      than the other bfq_queues.
      
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarDavide Zini <davidezini2@gmail.com>
      Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2d31c684
    • Paolo Valente's avatar
      block, bfq: split sync bfq_queues on a per-actuator basis · 9778369a
      Paolo Valente authored
      Single-LUN multi-actuator SCSI drives, as well as all multi-actuator
      SATA drives appear as a single device to the I/O subsystem [1].  Yet
      they address commands to different actuators internally, as a function
      of Logical Block Addressing (LBAs). A given sector is reachable by
      only one of the actuators. For example, Seagate’s Serial Advanced
      Technology Attachment (SATA) version contains two actuators and maps
      the lower half of the SATA LBA space to the lower actuator and the
      upper half to the upper actuator.
      
      Evidently, to fully utilize actuators, no actuator must be left idle
      or underutilized while there is pending I/O for it. The block layer
      must somehow control the load of each actuator individually. This
      commit lays the ground for allowing BFQ to provide such a per-actuator
      control.
      
      BFQ associates an I/O-request sync bfq_queue with each process doing
      synchronous I/O, or with a group of processes, in case of queue
      merging. Then BFQ serves one bfq_queue at a time. While in service, a
      bfq_queue is emptied in request-position order. Yet the same process,
      or group of processes, may generate I/O for different actuators. In
      this case, different streams of I/O (each for a different actuator)
      get all inserted into the same sync bfq_queue. So there is basically
      no individual control on when each stream is served, i.e., on when the
      I/O requests of the stream are picked from the bfq_queue and
      dispatched to the drive.
      
      This commit enables BFQ to control the service of each actuator
      individually for synchronous I/O, by simply splitting each sync
      bfq_queue into N queues, one for each actuator. In other words, a sync
      bfq_queue is now associated to a pair (process, actuator). As a
      consequence of this split, the per-queue proportional-share policy
      implemented by BFQ will guarantee that the sync I/O generated for each
      actuator, by each process, receives its fair share of service.
      
      This is just a preparatory patch. If the I/O of the same process
      happens to be sent to different queues, then each of these queues may
      undergo queue merging. To handle this event, the bfq_io_cq data
      structure must be properly extended. In addition, stable merging must
      be disabled to avoid loss of control on individual actuators. Finally,
      also async queues must be split. These issues are described in detail
      and addressed in next commits. As for this commit, although multiple
      per-process bfq_queues are provided, the I/O of each process or group
      of processes is still sent to only one queue, regardless of the
      actuator the I/O is for. The forwarding to distinct bfq_queues will be
      enabled after addressing the above issues.
      
      [1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/
      
      
      
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Signed-off-by: default avatarGabriele Felici <felicigb@gmail.com>
      Signed-off-by: default avatarCarmine Zaccagnino <carmine@carminezacc.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Link: https://lore.kernel.org/r/20230103145503.71712-2-paolo.valente@linaro.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9778369a
  12. Jan 16, 2023
  13. Dec 14, 2022
  14. Nov 08, 2022
    • Yu Kuai's avatar
      block, bfq: fix null pointer dereference in bfq_bio_bfqg() · f02be900
      Yu Kuai authored
      
      Out test found a following problem in kernel 5.10, and the same problem
      should exist in mainline:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000094
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP
      CPU: 7 PID: 155 Comm: kworker/7:1 Not tainted 5.10.0-01932-g19e0ace2ca1d-dirty 4
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-b4
      Workqueue: kthrotld blk_throtl_dispatch_work_fn
      RIP: 0010:bfq_bio_bfqg+0x52/0xc0
      Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b
      RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002
      RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000
      RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18
      RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10
      R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000
      R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       bfq_bic_update_cgroup+0x3c/0x350
       ? ioc_create_icq+0x42/0x270
       bfq_init_rq+0xfd/0x1060
       bfq_insert_requests+0x20f/0x1cc0
       ? ioc_create_icq+0x122/0x270
       blk_mq_sched_insert_requests+0x86/0x1d0
       blk_mq_flush_plug_list+0x193/0x2a0
       blk_flush_plug_list+0x127/0x170
       blk_finish_plug+0x31/0x50
       blk_throtl_dispatch_work_fn+0x151/0x190
       process_one_work+0x27c/0x5f0
       worker_thread+0x28b/0x6b0
       ? rescuer_thread+0x590/0x590
       kthread+0x153/0x1b0
       ? kthread_flush_work+0x170/0x170
       ret_from_fork+0x1f/0x30
      Modules linked in:
      CR2: 0000000000000094
      ---[ end trace e2e59ac014314547 ]---
      RIP: 0010:bfq_bio_bfqg+0x52/0xc0
      Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b
      RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002
      RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000
      RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18
      RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10
      R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000
      R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Root cause is quite complex:
      
      1) use bfq elevator for the test device.
      2) create a cgroup CG
      3) config blk throtl in CG
      
         blkg_conf_prep
          blkg_create
      
      4) create a thread T1 and issue async io in CG:
      
         bio_init
          bio_associate_blkg
         ...
         submit_bio
          submit_bio_noacct
           blk_throtl_bio -> io is throttled
           // io submit is done
      
      5) switch elevator:
      
         bfq_exit_queue
          blkcg_deactivate_policy
           list_for_each_entry(blkg, &q->blkg_list, q_node)
            blkg->pd[] = NULL
            // bfq policy is removed
      
      5) thread t1 exist, then remove the cgroup CG:
      
         blkcg_unpin_online
          blkcg_destroy_blkgs
           blkg_destroy
            list_del_init(&blkg->q_node)
            // blkg is removed from queue list
      
      6) switch elevator back to bfq
      
       bfq_init_queue
        bfq_create_group_hierarchy
         blkcg_activate_policy
          list_for_each_entry_reverse(blkg, &q->blkg_list)
           // blkg is removed from list, hence bfq policy is still NULL
      
      7) throttled io is dispatched to bfq:
      
       bfq_insert_requests
        bfq_init_rq
         bfq_bic_update_cgroup
          bfq_bio_bfqg
           bfqg = blkg_to_bfqg(blkg)
           // bfqg is NULL because bfq policy is NULL
      
      The problem is only possible in bfq because only bfq can be deactivated and
      activated while queue is online, while others can only be deactivated while
      the device is removed.
      
      Fix the problem in bfq by checking if blkg is online before calling
      blkg_to_bfqg().
      
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221108103434.2853269-1-yukuai1@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f02be900
  15. Nov 02, 2022
  16. Nov 01, 2022
  17. Aug 22, 2022
  18. Jul 14, 2022
  19. Jun 27, 2022
  20. Apr 18, 2022
  21. Feb 18, 2022
    • Yu Kuai's avatar
      block, bfq: don't move oom_bfqq · 8410f709
      Yu Kuai authored
      
      Our test report a UAF:
      
      [ 2073.019181] ==================================================================
      [ 2073.019188] BUG: KASAN: use-after-free in __bfq_put_async_bfqq+0xa0/0x168
      [ 2073.019191] Write of size 8 at addr ffff8000ccf64128 by task rmmod/72584
      [ 2073.019192]
      [ 2073.019196] CPU: 0 PID: 72584 Comm: rmmod Kdump: loaded Not tainted 4.19.90-yk #5
      [ 2073.019198] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [ 2073.019200] Call trace:
      [ 2073.019203]  dump_backtrace+0x0/0x310
      [ 2073.019206]  show_stack+0x28/0x38
      [ 2073.019210]  dump_stack+0xec/0x15c
      [ 2073.019216]  print_address_description+0x68/0x2d0
      [ 2073.019220]  kasan_report+0x238/0x2f0
      [ 2073.019224]  __asan_store8+0x88/0xb0
      [ 2073.019229]  __bfq_put_async_bfqq+0xa0/0x168
      [ 2073.019233]  bfq_put_async_queues+0xbc/0x208
      [ 2073.019236]  bfq_pd_offline+0x178/0x238
      [ 2073.019240]  blkcg_deactivate_policy+0x1f0/0x420
      [ 2073.019244]  bfq_exit_queue+0x128/0x178
      [ 2073.019249]  blk_mq_exit_sched+0x12c/0x160
      [ 2073.019252]  elevator_exit+0xc8/0xd0
      [ 2073.019256]  blk_exit_queue+0x50/0x88
      [ 2073.019259]  blk_cleanup_queue+0x228/0x3d8
      [ 2073.019267]  null_del_dev+0xfc/0x1e0 [null_blk]
      [ 2073.019274]  null_exit+0x90/0x114 [null_blk]
      [ 2073.019278]  __arm64_sys_delete_module+0x358/0x5a0
      [ 2073.019282]  el0_svc_common+0xc8/0x320
      [ 2073.019287]  el0_svc_handler+0xf8/0x160
      [ 2073.019290]  el0_svc+0x10/0x218
      [ 2073.019291]
      [ 2073.019294] Allocated by task 14163:
      [ 2073.019301]  kasan_kmalloc+0xe0/0x190
      [ 2073.019305]  kmem_cache_alloc_node_trace+0x1cc/0x418
      [ 2073.019308]  bfq_pd_alloc+0x54/0x118
      [ 2073.019313]  blkcg_activate_policy+0x250/0x460
      [ 2073.019317]  bfq_create_group_hierarchy+0x38/0x110
      [ 2073.019321]  bfq_init_queue+0x6d0/0x948
      [ 2073.019325]  blk_mq_init_sched+0x1d8/0x390
      [ 2073.019330]  elevator_switch_mq+0x88/0x170
      [ 2073.019334]  elevator_switch+0x140/0x270
      [ 2073.019338]  elv_iosched_store+0x1a4/0x2a0
      [ 2073.019342]  queue_attr_store+0x90/0xe0
      [ 2073.019348]  sysfs_kf_write+0xa8/0xe8
      [ 2073.019351]  kernfs_fop_write+0x1f8/0x378
      [ 2073.019359]  __vfs_write+0xe0/0x360
      [ 2073.019363]  vfs_write+0xf0/0x270
      [ 2073.019367]  ksys_write+0xdc/0x1b8
      [ 2073.019371]  __arm64_sys_write+0x50/0x60
      [ 2073.019375]  el0_svc_common+0xc8/0x320
      [ 2073.019380]  el0_svc_handler+0xf8/0x160
      [ 2073.019383]  el0_svc+0x10/0x218
      [ 2073.019385]
      [ 2073.019387] Freed by task 72584:
      [ 2073.019391]  __kasan_slab_free+0x120/0x228
      [ 2073.019394]  kasan_slab_free+0x10/0x18
      [ 2073.019397]  kfree+0x94/0x368
      [ 2073.019400]  bfqg_put+0x64/0xb0
      [ 2073.019404]  bfqg_and_blkg_put+0x90/0xb0
      [ 2073.019408]  bfq_put_queue+0x220/0x228
      [ 2073.019413]  __bfq_put_async_bfqq+0x98/0x168
      [ 2073.019416]  bfq_put_async_queues+0xbc/0x208
      [ 2073.019420]  bfq_pd_offline+0x178/0x238
      [ 2073.019424]  blkcg_deactivate_policy+0x1f0/0x420
      [ 2073.019429]  bfq_exit_queue+0x128/0x178
      [ 2073.019433]  blk_mq_exit_sched+0x12c/0x160
      [ 2073.019437]  elevator_exit+0xc8/0xd0
      [ 2073.019440]  blk_exit_queue+0x50/0x88
      [ 2073.019443]  blk_cleanup_queue+0x228/0x3d8
      [ 2073.019451]  null_del_dev+0xfc/0x1e0 [null_blk]
      [ 2073.019459]  null_exit+0x90/0x114 [null_blk]
      [ 2073.019462]  __arm64_sys_delete_module+0x358/0x5a0
      [ 2073.019467]  el0_svc_common+0xc8/0x320
      [ 2073.019471]  el0_svc_handler+0xf8/0x160
      [ 2073.019474]  el0_svc+0x10/0x218
      [ 2073.019475]
      [ 2073.019479] The buggy address belongs to the object at ffff8000ccf63f00
       which belongs to the cache kmalloc-1024 of size 1024
      [ 2073.019484] The buggy address is located 552 bytes inside of
       1024-byte region [ffff8000ccf63f00, ffff8000ccf64300)
      [ 2073.019486] The buggy address belongs to the page:
      [ 2073.019492] page:ffff7e000333d800 count:1 mapcount:0 mapping:ffff8000c0003a00 index:0x0 compound_mapcount: 0
      [ 2073.020123] flags: 0x7ffff0000008100(slab|head)
      [ 2073.020403] raw: 07ffff0000008100 ffff7e0003334c08 ffff7e00001f5a08 ffff8000c0003a00
      [ 2073.020409] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
      [ 2073.020411] page dumped because: kasan: bad access detected
      [ 2073.020412]
      [ 2073.020414] Memory state around the buggy address:
      [ 2073.020420]  ffff8000ccf64000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020424]  ffff8000ccf64080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020428] >ffff8000ccf64100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020430]                                   ^
      [ 2073.020434]  ffff8000ccf64180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020438]  ffff8000ccf64200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [ 2073.020439] ==================================================================
      
      The same problem exist in mainline as well.
      
      This is because oom_bfqq is moved to a non-root group, thus root_group
      is freed earlier.
      
      Thus fix the problem by don't move oom_bfqq.
      
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Link: https://lore.kernel.org/r/20220129015924.3958918-4-yukuai3@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8410f709
    • Yu Kuai's avatar
      block, bfq: avoid moving bfqq to it's parent bfqg · c5e4cb0f
      Yu Kuai authored
      
      Moving bfqq to it's parent bfqg is pointless.
      
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20220129015924.3958918-3-yukuai3@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c5e4cb0f
  22. Oct 19, 2021
  23. Oct 18, 2021
  24. Oct 17, 2021
  25. Mar 25, 2021
    • Paolo Valente's avatar
      block, bfq: merge bursts of newly-created queues · 430a67f9
      Paolo Valente authored
      
      Many throughput-sensitive workloads are made of several parallel I/O
      flows, with all flows generated by the same application, or more
      generically by the same task (e.g., system boot). The most
      counterproductive action with these workloads is plugging I/O dispatch
      when one of the bfq_queues associated with these flows remains
      temporarily empty.
      
      To avoid this plugging, BFQ has been using a burst-handling mechanism
      for years now. This mechanism has proven effective for throughput, and
      not detrimental for service guarantees. This commit pushes this
      mechanism a little bit further, basing on the following two facts.
      
      First, all the I/O flows of a the same application or task contribute
      to the execution/completion of that common application or task. So the
      performance figures that matter are total throughput of the flows and
      task-wide I/O latency.  In particular, these flows do not need to be
      protected from each other, in terms of individual bandwidth or
      latency.
      
      Second, the above fact holds regardless of the number of flows.
      
      Putting these two facts together, this commits merges stably the
      bfq_queues associated with these I/O flows, i.e., with the processes
      that generate these IO/ flows, regardless of how many the involved
      processes are.
      
      To decide whether a set of bfq_queues is actually associated with the
      I/O flows of a common application or task, and to merge these queues
      stably, this commit operates as follows: given a bfq_queue, say Q2,
      currently being created, and the last bfq_queue, say Q1, created
      before Q2, Q2 is merged stably with Q1 if
      - very little time has elapsed since when Q1 was created
      - Q2 has the same ioprio as Q1
      - Q2 belongs to the same group as Q1
      
      Merging bfq_queues also reduces scheduling overhead. A fio test with
      ten random readers on /dev/nullb shows a throughput boost of 40%, with
      a quadcore. Since BFQ's execution time amounts to ~50% of the total
      per-request processing time, the above throughput boost implies that
      BFQ's overhead is reduced by more than 50%.
      
      Tested-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      430a67f9
  26. Aug 18, 2020
    • Dmitry Monakhov's avatar
      bfq: fix blkio cgroup leakage v4 · 2de791ab
      Dmitry Monakhov authored
      
      Changes from v1:
          - update commit description with proper ref-accounting justification
      
      commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      introduce leak forbfq_group and blkcg_gq objects because of get/put
      imbalance.
      In fact whole idea of original commit is wrong because bfq_group entity
      can not dissapear under us because it is referenced by child bfq_queue's
      entities from here:
       -> bfq_init_entity()
          ->bfqg_and_blkg_get(bfqg);
          ->entity->parent = bfqg->my_entity
      
       -> bfq_put_queue(bfqq)
          FINAL_PUT
          ->bfqg_and_blkg_put(bfqq_group(bfqq))
          ->kmem_cache_free(bfq_pool, bfqq);
      
      So parent entity can not disappear while child entity is in tree,
      and child entities already has proper protection.
      This patch revert commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      
      bfq_group leak trace caused by bad commit:
      -> blkg_alloc
         -> bfq_pq_alloc
           -> bfqg_get (+1)
      ->bfq_activate_bfqq
        ->bfq_activate_requeue_entity
          -> __bfq_activate_entity
             ->bfq_get_entity
               ->bfqg_and_blkg_get (+1)  <==== : Note1
      ->bfq_del_bfqq_busy
        ->bfq_deactivate_entity+0x53/0xc0 [bfq]
          ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
            -> bfq_forget_entity(is_in_service = true)
      	 entity->on_st_or_in_serv = false   <=== :Note2
      	 if (is_in_service)
      	     return;  ==> do not touch reference
      -> blkcg_css_offline
       -> blkcg_destroy_blkgs
        -> blkg_destroy
         -> bfq_pd_offline
          -> __bfq_deactivate_entity
               if (!entity->on_st_or_in_serv) /* true, because (Note2)
      		return false;
       -> bfq_pd_free
          -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
      So bfq_group and blkcg_gq  will leak forever, see test-case below.
      
      ##TESTCASE_BEGIN:
      #!/bin/bash
      
      max_iters=${1:-100}
      #prep cgroup mounts
      mount -t tmpfs cgroup_root /sys/fs/cgroup
      mkdir /sys/fs/cgroup/blkio
      mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
      
      # Prepare blkdev
      grep blkio /proc/cgroups
      truncate -s 1M img
      losetup /dev/loop0 img
      echo bfq > /sys/block/loop0/queue/scheduler
      
      grep blkio /proc/cgroups
      for ((i=0;i<max_iters;i++))
      do
          mkdir -p /sys/fs/cgroup/blkio/a
          echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
          dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
          echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
          rmdir /sys/fs/cgroup/blkio/a
          grep blkio /proc/cgroups
      done
      ##TESTCASE_END:
      
      Fixes: db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2de791ab
  27. Mar 21, 2020
    • Paolo Valente's avatar
      block, bfq: invoke flush_idle_tree after reparent_active_queues in pd_offline · 4d38a87f
      Paolo Valente authored
      
      In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to
      flush the rb tree that contains all idle entities belonging to the pd
      (cgroup) being destroyed. In particular, bfq_flush_idle_tree() is
      invoked before bfq_reparent_active_queues(). Yet the latter may happen
      to add some entities to the idle tree. It happens if, in some of the
      calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(),
      the queue to move is empty and gets expired.
      
      This commit simply reverses the invocation order between
      bfq_flush_idle_tree() and bfq_reparent_active_queues().
      
      Tested-by: default avatar <cki-project@redhat.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4d38a87f
    • Paolo Valente's avatar
      block, bfq: make reparent_leaf_entity actually work only on leaf entities · 576682fa
      Paolo Valente authored
      
      bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf
      entity represents just a bfq_queue in an entity tree). Yet, the input
      entity is guaranteed to always be a leaf entity only in two-level
      entity trees. In this respect, because of the error fixed by
      commit 14afc593 ("block, bfq: fix overwrite of bfq_group pointer
      in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened
      to actually have only two levels. After the latter commit, this does not
      hold any longer.
      
      This commit fixes this problem by modifying
      bfq_reparent_leaf_entity(), so that it searches an active leaf entity
      down the path that stems from the input entity. Such a leaf entity is
      guaranteed to exist when bfq_reparent_leaf_entity() is invoked.
      
      Tested-by: default avatar <cki-project@redhat.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      576682fa
Loading