- Sep 09, 2020
-
-
Ritesh Harjani authored
If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway not merging this page in this bio, then it make sense to make same_page also as false before returning. Without this patch, we hit below WARNING in iomap. This mostly happens with very large memory system and / or after tweaking vm dirty threshold params to delay writeback of dirty data. WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150 CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G W 5.8.0-rc3 #6 Call Trace: __remove_mapping+0x154/0x320 (unreliable) iomap_releasepage+0x80/0x180 try_to_release_page+0x94/0xe0 invalidate_inode_page+0xc8/0x110 invalidate_mapping_pages+0x1dc/0x540 generic_fadvise+0x3c8/0x450 xfs_file_fadvise+0x2c/0xe0 [xfs] vfs_fadvise+0x3c/0x60 ksys_fadvise64_64+0x68/0xe0 sys_fadvise64+0x28/0x40 system_call_exception+0xf8/0x1c0 system_call_common+0xf0/0x278 Fixes: cc90bc68 ("block: fix "check bi_size overflow before merge"") Reported-by:
Shivaprasad G Bhat <sbhat@linux.ibm.com> Suggested-by:
Christoph Hellwig <hch@infradead.org> Signed-off-by:
Anju T Sudhakar <anju@linux.vnet.ibm.com> Signed-off-by:
Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 08, 2020
-
-
Omar Sandoval authored
Yang Yang reported the following crash caused by requeueing a flush request in Kyber: [ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00 ... [ 2.517468] pc : clear_bit+0x18/0x2c [ 2.517502] lr : sbitmap_queue_clear+0x40/0x228 [ 2.517503] sp : ffffff800832bc60 pstate : 00c00145 ... [ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000) [ 2.517602] Call trace: [ 2.517606] clear_bit+0x18/0x2c [ 2.517619] kyber_finish_request+0x74/0x80 [ 2.517627] blk_mq_requeue_request+0x3c/0xc0 [ 2.517637] __scsi_queue_insert+0x11c/0x148 [ 2.517640] scsi_softirq_done+0x114/0x130 [ 2.517643] blk_done_softirq+0x7c/0xb0 [ 2.517651] __do_softirq+0x208/0x3bc [ 2.517657] run_ksoftirqd+0x34/0x60 [ 2.517663] smpboot_thread_fn+0x1c4/0x2c0 [ 2.517667] kthread+0x110/0x120 [ 2.517669] ret_from_fork+0x10/0x18 This happens because Kyber doesn't track flush requests, so kyber_finish_request() reads a garbage domain token. Only call the scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for the finish_request() hook in blk_mq_free_request()). Now that we're handling it in blk-mq, also remove the check from BFQ. Reported-by:
Yang Yang <yang.yang@vivo.com> Signed-off-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
mdadm relies on the fact that deleting an invalid partition returns -ENXIO or -ENOTTY to detect if a block device is a partition or a whole device. Fixes: 08fc1ab6 ("block: fix locking in bdev_del_partition") Reported-by:
kernel test robot <rong.a.chen@intel.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Sep 01, 2020
-
-
Tejun Heo authored
blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's make it irqsafe. Signed-off-by:
Tejun Heo <tj@kernel.org> Fixes: cd006509 ("blk-iocost: account for IO size when testing latencies") Cc: stable@vger.kernel.org # v5.8+ Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Tejun Heo authored
ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled when it can be called with irq disabled or enabled. This has a small chance of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations instead. Signed-off-by:
Tejun Heo <tj@kernel.org> Fixes: 7caa4715 ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
We need to hold the whole device bd_mutex to protect against other thread concurrently deleting out partition before we get to it, and thus causing a use after free. Fixes: cddae808 ("block: pass a hd_struct to delete_partition") Reported-by:
<syzbot+6448f3c229bc52b82f69@syzkaller.appspotmail.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Commit e8c7d14a ("block: revert back to synchronous request_queue removal") stops to release request queue from wq context because that commit supposed all blk_put_queue() is called in context which is allowed to sleep. However, this assumption isn't true because we release disk's reference in partition's percpu_ref's ->release() which doesn't allow to sleep, because the ->release() is run via call_rcu(). Fixes this issue by moving put disk reference into hd_struct_free_work() Fixes: e8c7d14a ("block: revert back to synchronous request_queue removal") Reported-by:
Ilya Dryomov <idryomov@gmail.com> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Tested-by:
Ilya Dryomov <idryomov@gmail.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If a driver leaves the limit settings as the defaults, then we don't initialize bdi->io_pages. This means that file systems may need to work around bdi->io_pages == 0, which is somewhat messy. Initialize the default value just like we do for ->ra_pages. Cc: stable@vger.kernel.org Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting") Reported-by:
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 23, 2020
-
-
Gustavo A. R. Silva authored
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by:
Gustavo A. R. Silva <gustavoars@kernel.org>
-
- Aug 21, 2020
-
-
Yufen Yu authored
Normally, blkcg_iolatency_exit() will free related memory in iolatency when cleanup queue. But if blk_throtl_init() return error and queue init fail, blkcg_iolatency_exit() will not do that for us. Then it cause memory leak. Fixes: d7067512 ("block: introduce blk-iolatency io controller") Signed-off-by:
Yufen Yu <yuyufen@huawei.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
A previous commit aligning splits to physical block sizes inadvertently modified one return case such that that it now returns 0 length splits when the number of sectors doesn't exceed the physical offset. This later hits a BUG in bio_split(). Restore the previous working behavior. Fixes: 9cc5169c ("block: Improve physical block alignment of split bios") Reported-by:
Eric Deal <eric.deal@wdc.com> Signed-off-by:
Keith Busch <kbusch@kernel.org> Cc: Bart Van Assche <bvanassche@acm.org> Cc: stable@vger.kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
c616cbee ("blk-mq: punt failed direct issue to dispatch list") supposed to add request which has been through ->queue_rq() to the hw queue dispatch list, however it adds request running out of budget or driver tag to hw queue too. This way basically bypasses request merge, and causes too many request dispatched to LLD, and system% is unnecessary increased. Fixes this issue by adding request not through ->queue_rq into sw/scheduler queue, and this way is safe because no ->queue_rq is called on this request yet. High %system can be observed on Azure storvsc device, and even soft lock is observed. This patch reduces %system during heavy sequential IO, meantime decreases soft lockup risk. Fixes: c616cbee ("blk-mq: punt failed direct issue to dispatch list") Signed-off-by:
Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 18, 2020
-
-
Dmitry Monakhov authored
Changes from v1: - update commit description with proper ref-accounting justification commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree") introduce leak forbfq_group and blkcg_gq objects because of get/put imbalance. In fact whole idea of original commit is wrong because bfq_group entity can not dissapear under us because it is referenced by child bfq_queue's entities from here: -> bfq_init_entity() ->bfqg_and_blkg_get(bfqg); ->entity->parent = bfqg->my_entity -> bfq_put_queue(bfqq) FINAL_PUT ->bfqg_and_blkg_put(bfqq_group(bfqq)) ->kmem_cache_free(bfq_pool, bfqq); So parent entity can not disappear while child entity is in tree, and child entities already has proper protection. This patch revert commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree") bfq_group leak trace caused by bad commit: -> blkg_alloc -> bfq_pq_alloc -> bfqg_get (+1) ->bfq_activate_bfqq ->bfq_activate_requeue_entity -> __bfq_activate_entity ->bfq_get_entity ->bfqg_and_blkg_get (+1) <==== : Note1 ->bfq_del_bfqq_busy ->bfq_deactivate_entity+0x53/0xc0 [bfq] ->__bfq_deactivate_entity+0x1b8/0x210 [bfq] -> bfq_forget_entity(is_in_service = true) entity->on_st_or_in_serv = false <=== :Note2 if (is_in_service) return; ==> do not touch reference -> blkcg_css_offline -> blkcg_destroy_blkgs -> blkg_destroy -> bfq_pd_offline -> __bfq_deactivate_entity if (!entity->on_st_or_in_serv) /* true, because (Note2) return false; -> bfq_pd_free -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2) So bfq_group and blkcg_gq will leak forever, see test-case below. ##TESTCASE_BEGIN: #!/bin/bash max_iters=${1:-100} #prep cgroup mounts mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir /sys/fs/cgroup/blkio mount -t cgroup -o blkio none /sys/fs/cgroup/blkio # Prepare blkdev grep blkio /proc/cgroups truncate -s 1M img losetup /dev/loop0 img echo bfq > /sys/block/loop0/queue/scheduler grep blkio /proc/cgroups for ((i=0;i<max_iters;i++)) do mkdir -p /sys/fs/cgroup/blkio/a echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null echo 0 > /sys/fs/cgroup/blkio/cgroup.procs rmdir /sys/fs/cgroup/blkio/a grep blkio /proc/cgroups done ##TESTCASE_END: Fixes: db37a34c ("block, bfq: get a ref to a group when adding it to a service tree") Tested-by:
Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by:
Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Matthew Wilcox (Oracle) authored
If we pass in an offset which is larger than PAGE_SIZE, then page_is_mergeable() thinks it's not mergeable with the previous bio_vec, leading to a large number of bio_vecs being used. Use a slightly more obvious test that the two pages are compatible with each other. Fixes: 52d52d1c ("block: only allow contiguous page structs in a bio_vec") Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 17, 2020
-
-
Ming Lei authored
When queue_max_discard_segments(q) is 1, blk_discard_mergable() will return false for discard request, then normal request merge is applied. However, only queue_max_segments() is checked, so max discard segment limit isn't respected. Check max discard segment limit in the request merge code for fixing the issue. Discard request failure of virtio_blk is fixed. Fixes: 69840466 ("block: fix the DISCARD request merge") Signed-off-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
SCHED_RESTART code path is relied to re-run queue for dispatch requests in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding requests to hctx->dispatch. memory barriers have to be used for ordering the following two pair of OPs: 1) adding requests to hctx->dispatch and checking SCHED_RESTART in blk_mq_dispatch_rq_list() 2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch in blk_mq_sched_restart(). Without the added memory barrier, either: 1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in dispatch side or 2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue in dispatch side, meantime checking if there is request in hctx->dispatch from blk_mq_sched_restart() is missed. IO hang in ltp/fs_fill test is reported by kernel test robot: https://lkml.org/lkml/2020/7/26/77 Turns out it is caused by the above out-of-order OPs. And the IO hang can't be observed any more after applying this patch. Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers") Reported-by:
kernel test robot <rong.a.chen@intel.com> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Xu Wang authored
Replace a comma between expression statements by a semicolon. Signed-off-by:
Xu Wang <vulab@iscas.ac.cn> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 16, 2020
-
-
Randy Dunlap authored
Fix a kernel-doc warning in block/blk-mq.c: ../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert' Fixes: 01e99aec ("blk-mq: insert passthrough request into hctx->dispatch directly") Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Cc: André Almeida <andrealmeid@collabora.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: linux-block@vger.kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 11, 2020
-
-
Ming Lei authored
In case of none scheduler, we share data request's driver tag for flush request, so have to mark the flush request as INFLIGHT for avoiding double account of this driver tag. Fixes: 568f2700 ("blk-mq: centralise related handling into blk_mq_get_driver_tag") Reported-by:
Matthew Wilcox <willy@infradead.org> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Tested-by:
Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 05, 2020
-
-
Coly Li authored
If create a loop device with a backing NVMe SSD, current loop device driver doesn't correctly set its queue's limits.discard_granularity and leaves it as 0. If a discard request at LBA 0 on this loop device, in __blkdev_issue_discard() the calculated req_sects will be 0, and a zero length discard request will trigger a BUG() panic in generic block layer code at block/blk-mq.c:563. [ 955.565006][ C39] ------------[ cut here ]------------ [ 955.559660][ C39] invalid opcode: 0000 [#1] SMP NOPTI [ 955.622171][ C39] CPU: 39 PID: 248 Comm: ksoftirqd/39 Tainted: G E 5.8.0-default+ #40 [ 955.622171][ C39] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE160M-2.70]- 07/17/2020 [ 955.622175][ C39] RIP: 0010:blk_mq_end_request+0x107/0x110 [ 955.622177][ C39] Code: 48 8b 03 e9 59 ff ff ff 48 89 df 5b 5d 41 5c e9 9f ed ff ff 48 8b 35 98 3c f4 00 48 83 c7 10 48 83 c6 19 e8 cb 56 c9 ff eb cb <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 54 [ 955.622179][ C39] RSP: 0018:ffffb1288701fe28 EFLAGS: 00010202 [ 955.749277][ C39] RAX: 0000000000000001 RBX: ffff956fffba5080 RCX: 0000000000004003 [ 955.749278][ C39] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000 [ 955.749279][ C39] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 955.749279][ C39] R10: ffffb1288701fd28 R11: 0000000000000001 R12: ffffffffa8e05160 [ 955.749280][ C39] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffffa7ad3a1e [ 955.749281][ C39] FS: 0000000000000000(0000) GS:ffff95bfbda00000(0000) knlGS:0000000000000000 [ 955.749282][ C39] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 955.749282][ C39] CR2: 00007f6f0ef766a8 CR3: 0000005a37012002 CR4: 00000000007606e0 [ 955.749283][ C39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 955.749284][ C39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 955.749284][ C39] PKRU: 55555554 [ 955.749285][ C39] Call Trace: [ 955.749290][ C39] blk_done_softirq+0x99/0xc0 [ 957.550669][ C39] __do_softirq+0xd3/0x45f [ 957.550677][ C39] ? smpboot_thread_fn+0x2f/0x1e0 [ 957.550679][ C39] ? smpboot_thread_fn+0x74/0x1e0 [ 957.550680][ C39] ? smpboot_thread_fn+0x14e/0x1e0 [ 957.550684][ C39] run_ksoftirqd+0x30/0x60 [ 957.550687][ C39] smpboot_thread_fn+0x149/0x1e0 [ 957.886225][ C39] ? sort_range+0x20/0x20 [ 957.886226][ C39] kthread+0x137/0x160 [ 957.886228][ C39] ? kthread_park+0x90/0x90 [ 957.886231][ C39] ret_from_fork+0x22/0x30 [ 959.117120][ C39] ---[ end trace 3dacdac97e2ed164 ]--- This is the procedure to reproduce the panic, # modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1 # losetup -f /dev/nvme0n1 --direct-io=on # blkdiscard /dev/loop0 -o 0 -l 0x200 This patch fixes the issue by checking q->limits.discard_granularity in __blkdev_issue_discard() before composing the discard bio. If the value is 0, then prints a warning oops information and returns -EOPNOTSUPP to the caller to indicate that this buggy device driver doesn't support discard request. Fixes: 9b15d109 ("block: improve discard bio alignment in __blkdev_issue_discard()") Fixes: c52abf56 ("loop: Better discard support for block devices") Reported-and-suggested-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Coly Li <colyli@suse.de> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Jack Wang <jinpu.wang@cloud.ionos.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Enzo Matsumiya <ematsumiya@suse.com> Cc: Evan Green <evgreen@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 03, 2020
-
-
Johannes Thumshirn authored
When we loose a device for whatever reason while (re)scanning zones, we trip over a NULL pointer in blk_revalidate_zone_cb, like in the following log: sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB) sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08 sd 0:0:0:0: [sda] Sense Key : 0xb [current] sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6 sda: failed to revalidate zones sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B) sda: detected capacity change from 14000519643136 to 0 ================================================================== BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550 Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58 CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Workqueue: events_unbound async_run_entry_fn Call Trace: dump_stack+0x7d/0xb0 ? blk_revalidate_zone_cb+0x1b7/0x550 kasan_report.cold+0x5/0x37 ? blk_revalidate_zone_cb+0x1b7/0x550 check_memory_region+0x145/0x1a0 blk_revalidate_zone_cb+0x1b7/0x550 sd_zbc_parse_report+0x1f1/0x370 ? blk_req_zone_write_trylock+0x200/0x200 ? sectors_to_logical+0x60/0x60 ? blk_req_zone_write_trylock+0x200/0x200 ? blk_req_zone_write_trylock+0x200/0x200 sd_zbc_report_zones+0x3c4/0x5e0 ? sd_dif_config_host+0x500/0x500 blk_revalidate_disk_zones+0x231/0x44d ? _raw_write_lock_irqsave+0xb0/0xb0 ? blk_queue_free_zone_bitmaps+0xd0/0xd0 sd_zbc_read_zones+0x8cf/0x11a0 sd_revalidate_disk+0x305c/0x64e0 ? __device_add_disk+0x776/0xf20 ? read_capacity_16.part.0+0x1080/0x1080 ? blk_alloc_devt+0x250/0x250 ? create_object.isra.0+0x595/0xa20 ? kasan_unpoison_shadow+0x33/0x40 sd_probe+0x8dc/0xcd2 really_probe+0x20e/0xaf0 __driver_attach_async_helper+0x249/0x2d0 async_run_entry_fn+0xbe/0x560 process_one_work+0x764/0x1290 ? _raw_read_unlock_irqrestore+0x30/0x30 worker_thread+0x598/0x12f0 ? __kthread_parkme+0xc6/0x1b0 ? schedule+0xed/0x2c0 ? process_one_work+0x1290/0x1290 kthread+0x36b/0x440 ? kthread_create_worker_on_cpu+0xa0/0xa0 ret_from_fork+0x22/0x30 ================================================================== When the device is already gone we end up with the following scenario: The device's capacity is 0 and thus the number of zones will be 0 as well. When allocating the bitmap for the conventional zones, we then trip over a NULL pointer. So if we encounter a zoned block device with a 0 capacity, don't dare to revalidate the zones sizes. Fixes: 6c6b3549 ("block: set the zone size in blk_revalidate_disk_zones atomically") Signed-off-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 31, 2020
-
-
Randy Dunlap authored
Drop the repeated word "request". Change to the correct kernel-doc notation for function name separtor. Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Randy Dunlap authored
Drop the repeated word "to". Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Randy Dunlap authored
Drop the repeated word "the". Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Randy Dunlap authored
Drop the repeated word "to" in multiple places. Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Randy Dunlap authored
Drop the repeated word "the". Fix typos of "features" and "specified". Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Randy Dunlap authored
Drop the repeated words "a" and "the". Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Randy Dunlap authored
Change "at at" to "at a". Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 30, 2020
-
-
Chengming Zhou authored
We shouldn't skip iocg when its abs_vdebt is not zero. Fixes: 0b80f986 ("iocost: protect iocg->abs_vdebt with iocg->waitq.lock") Signed-off-by:
Chengming Zhou <zhouchengming@bytedance.com> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 29, 2020
-
-
Ahmed S. Darwish authored
A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by:
Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Daniel Wagner <dwagner@suse.de> Link: https://lkml.kernel.org/r/20200720155530.1173732-21-a.darwish@linutronix.de
-
- Jul 28, 2020
-
-
Daniel Wagner authored
tag_set_list is only accessed under the tag_set_lock lock. There is no need for using the _rcu list functions. The _rcu list function were introduced to allow read access to the tag_set_list protected under RCU, see 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list") and 05b79413 ("Revert "blk-mq: don't handle TAG_SHARED in restart""). Those changes got reverted later but the cleanup commit missed a couple of places to undo the changes. Fixes: 97889f9a ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()" Signed-off-by:
Daniel Wagner <dwagner@suse.de> Reviewed-by:
Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 25, 2020
-
-
Alan Stern authored
Commit 05d18ae1 ("scsi: pm: Balance pm_only counter of request queue during system resume") fixed a problem in the block layer's runtime-PM code: blk_set_runtime_active() failed to call blk_clear_pm_only(). However, the commit's implementation was awkward; it forced the SCSI system-resume handler to choose whether to call blk_post_runtime_resume() or blk_set_runtime_active(), depending on whether or not the SCSI device had previously been runtime suspended. This patch simplifies the situation considerably by adding the missing function call directly into blk_set_runtime_active() (under the condition that the queue is not already in the RPM_ACTIVE state). This allows the SCSI routine to revert back to its original form. Furthermore, making this change reveals that blk_post_runtime_resume() (in its success pathway) does exactly the same thing as blk_set_runtime_active(). The duplicate code is easily removed by making one routine call the other. No functional changes are intended. Link: https://lore.kernel.org/r/20200706151436.GA702867@rowland.harvard.edu CC: Can Guo <cang@codeaurora.org> CC: Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Signed-off-by:
Alan Stern <stern@rowland.harvard.edu> Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com>
-
- Jul 20, 2020
-
-
Christoph Hellwig authored
This function is just a tiny wrapper around blk_stack_limits. Open code it int the two callers. Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Tested-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
This function is just a tiny wrapper around blk_stack_limit and has two callers. Simplify the stack a bit by open coding it in the two callers. Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Tested-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Lift the code from device mapper into blk_stack_limits to inherity the stacking limitations. This ensures we do the right thing for all stacked zoned block devices. Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Tested-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 18, 2020
-
-
Boris Burkov authored
In order to improve consistency and usability in cgroup stat accounting, we would like to support the root cgroup's io.stat. Since the root cgroup has processes doing io even if the system has no explicitly created cgroups, we need to be careful to avoid overhead in that case. For that reason, the rstat algorithms don't handle the root cgroup, so just turning the file on wouldn't give correct statistics. To get around this, we simulate flushing the iostat struct by filling it out directly from global disk stats. The result is a root cgroup io.stat file consistent with both /proc/diskstats and io.stat. Note that in order to collect the disk stats, we needed to iterate over devices. To facilitate that, we had to change the linkage of a disk_type to external so that it can be used from blk-cgroup.c to iterate over disks. Suggested-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Boris Burkov <boris@bur.io> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Boris Burkov authored
Previously, the code which printed io.stat only needed access to the generic rstat flushing code, but since we plan to write some more specific code for preparing root cgroup stats, we need to manipulate iostat structs directly. Since declaring static functions ahead does not seem like common practice in this file, simply move the iostat functions up. We only plan to use blkg_iostat_set, but it seems better to keep them all together. Signed-off-by:
Boris Burkov <boris@bur.io> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 17, 2020
-
-
Coly Li authored
This patch improves discard bio split for address and size alignment in __blkdev_issue_discard(). The aligned discard bio may help underlying device controller to perform better discard and internal garbage collection, and avoid unnecessary internal fragment. Current discard bio split algorithm in __blkdev_issue_discard() may have non-discarded fregment on device even the discard bio LBA and size are both aligned to device's discard granularity size. Here is the example steps on how to reproduce the above problem. - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk with thin mode and give it to a Linux virtual machine. - Inside the Linux virtual machine, if the 50GB virtual disk shows up as /dev/sdb, fill data into the first 50GB by, # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200 - Discard the 50GB range from offset 0 on /dev/sdb, # blkdiscard /dev/sdb -o 0 -l 53687091200 - Observe the underlying mapping status of the device # sg_get_lba_status /dev/sdb -m 1048 --lba=0 descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown) descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated descriptor LBA: 0x0000000006600000 blocks: 0 deallocated Although the discard bio starts at LBA 0 and has 50<<30 bytes size which are perfect aligned to the discard granularity, from the above list these are many 1MB (2048 sectors) internal fragments exist unexpectedly. The problem is in __blkdev_issue_discard(), an improper algorithm causes an improper bio size which is not aligned. 25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, 26 sector_t nr_sects, gfp_t gfp_mask, int flags, 27 struct bio **biop) 28 { 29 struct request_queue *q = bdev_get_queue(bdev); [snipped] 56 57 while (nr_sects) { 58 sector_t req_sects = min_t(sector_t, nr_sects, 59 bio_allowed_max_sectors(q)); 60 61 WARN_ON_ONCE((req_sects << 9) > UINT_MAX); 62 63 bio = blk_next_bio(bio, 0, gfp_mask); 64 bio->bi_iter.bi_sector = sector; 65 bio_set_dev(bio, bdev); 66 bio_set_op_attrs(bio, op, 0); 67 68 bio->bi_iter.bi_size = req_sects << 9; 69 sector += req_sects; 70 nr_sects -= req_sects; [snipped] 79 } 80 81 *biop = bio; 82 return 0; 83 } 84 EXPORT_SYMBOL(__blkdev_issue_discard); At line 58-59, to discard a 50GB range, req_sects is set as return value of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above case, the discard granularity is 2048 sectors, although the start LBA and discard length are aligned to discard granularity, req_sects never has chance to be aligned to discard granularity. This is why there are some still-mapped 2048 sectors fragment in every 4 or 8 GB range. If req_sects at line 58 is set to a value aligned to discard_granularity and close to UNIT_MAX, then all consequent split bios inside device driver are (almostly) aligned to discard_granularity of the device queue. The 2048 sectors still-mapped fragment will disappear. This patch introduces bio_aligned_discard_max_sectors() to return the the value which is aligned to q->limits.discard_granularity and closest to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with this new routine to decide a more proper split bio length. But we still need to handle the situation when discard start LBA is not aligned to q->limits.discard_granularity, otherwise even the length is aligned, current code may still leave 2048 fragment around every 4GB range. Therefore, to calculate req_sects, firstly the start LBA of discard range is checked (including partition offset), if it is not aligned to discard granularity, the first split location should make sure following bio has bi_sector aligned to discard granularity. Then there won't be still-mapped fragment in the middle of the discard range. The above is how this patch improves discard bio alignment in __blkdev_issue_discard(). Now with this patch, after discard with same command line mentiond previously, sg_get_lba_status returns, descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated descriptor LBA: 0x0000000006600000 blocks: 0 deallocated We an see there is no 2048 sectors segment anymore, everything is clean. Reported-and-tested-by:
Acshai Manoj <acshai.manoj@microfocus.com> Signed-off-by:
Coly Li <colyli@suse.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Xiao Ni <xni@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Enzo Matsumiya <ematsumiya@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yufen Yu authored
Commit 7520872c ("block: don't defer flushes on blk-mq + scheduling") tried to fix deadlock for cycled wait between flush requests and data request into flush_data_in_flight. The former holded all driver tags and wait for data request completion, but the latter can not complete for waiting free driver tags. After commit 923218f6 ("blk-mq: don't allocate driver tag upfront for flush rq"), flush requests will not get driver tag before queuing into flush queue. * With elevator, flush request just get sched_tags before inserting flush queue. It will not get driver tag until issue them to driver. data request on list fq->flush_data_in_flight will complete in the end. * Without elevator, each flush request will get a driver tag when allocate request. Then data request on fq->flush_data_in_flight don't worry about lacking driver tag. In both of these cases, cycled wait cannot be true. So we may allow to defer flush request. Signed-off-by:
Yufen Yu <yuyufen@huawei.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Wei Yongjun authored
The sparse tool complains as follows: block/blk-timeout.c:93:12: warning: symbol 'blk_timeout_init' was not declared. Should it be static? Function blk_timeout_init() is not used outside of blk-timeout.c, so mark it static. Fixes: 9054650f ("block: relax jiffies rounding for timeouts") Reported-by:
Hulk Robot <hulkci@huawei.com> Signed-off-by:
Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-