- 02 Nov, 2018 1 commit
-
-
Dennis Zhou authored
This reverts a series committed earlier due to null pointer exception bug report in [1]. It seems there are edge case interactions that I did not consider and will need some time to understand what causes the adverse interactions. The original series can be found in [2] with a follow up series in [3]. [1] https://www.spinics.net/lists/cgroups/msg20719.html [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/ [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/ This reverts the following commits: d459d853, b2c3fa54, 101246ec, b3b9f24f, e2b09899, f0fcb3ec, c839e7a0, bdc24917, 74b7c02a, 5bf9a1f3, a7b39b4e, 07b05bcc, 49f4c2dc, 27e6fa99Signed-off-by:
Dennis Zhou <dennis@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 25 Oct, 2018 3 commits
-
-
Damien Le Moal authored
Drivers exposing zoned block devices have to initialize and maintain correctness (i.e. revalidate) of the device zone bitmaps attached to the device request queue (seq_zones_bitmap and seq_zones_wlock). To simplify coding this, introduce a generic helper function blk_revalidate_disk_zones() suitable for most (and likely all) cases. This new function always update the seq_zones_bitmap and seq_zones_wlock bitmaps as well as the queue nr_zones field when called for a disk using a request based queue. For a disk using a BIO based queue, only the number of zones is updated since these queues do not have schedulers and so do not need the zone bitmaps. With this change, the zone bitmap initialization code in sd_zbc.c can be replaced with a call to this function in sd_zbc_read_zones(), which is called from the disk revalidate block operation method. A call to blk_revalidate_disk_zones() is also added to the null_blk driver for devices created with the zoned mode enabled. Finally, to ensure that zoned devices created with dm-linear or dm-flakey expose the correct number of zones through sysfs, a call to blk_revalidate_disk_zones() is added to dm_table_set_restrictions(). The zone bitmaps allocated and initialized with blk_revalidate_disk_zones() are freed automatically from __blk_release_queue() using the block internal function blk_queue_free_zone_bitmaps(). Reviewed-by:
Hannes Reinecke <hare@suse.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Dispatching a report zones command through the request queue is a major pain due to the command reply payload rewriting necessary. Given that blkdev_report_zones() is executing everything synchronously, implement report zones as a block device file operation instead, allowing major simplification of the code in many places. sd, null-blk, dm-linear and dm-flakey being the only block device drivers supporting exposing zoned block devices, these drivers are modified to provide the device side implementation of the report_zones() block device file operation. For device mappers, a new report_zones() target type operation is defined so that the upper block layer calls blkdev_report_zones() can be propagated down to the underlying devices of the dm targets. Implementation for this new operation is added to the dm-linear and dm-flakey targets. Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> [Damien] * Changed method block_device argument to gendisk * Various bug fixes and improvements * Added support for null_blk, dm-linear and dm-flakey. Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
Introduce the blkdev_nr_zones() helper function to get the total number of zones of a zoned block device. This number is always 0 for a regular block device (q->limits.zoned == BLK_ZONED_NONE case). Replace hard-coded number of zones calculation in dmz_get_zoned_device() with a call to this helper. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 22 Oct, 2018 3 commits
-
-
Xiao Ni authored
flush_pool is leaked when flush bio size is zero Fixes: 5a409b4f ("MD: fix lock contention for flush bios") Signed-off-by:
David Jeffery <djeffery@redhat.com> Signed-off-by:
Xiao Ni <xni@redhat.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Jack Wang authored
I noticed kmemleak report memory leak when run create/stop md in a loop, backtrace: [<000000001ca975e7>] mempool_create_node+0x86/0xd0 [<0000000095576bcd>] md_run+0x1057/0x1410 [md_mod] [<000000007b45c5fc>] do_md_run+0x15/0x130 [md_mod] [<000000001ede9ec0>] md_ioctl+0x1f49/0x25d0 [md_mod] [<000000004142cacf>] blkdev_ioctl+0x680/0xd00 The root cause is we alloc mddev->flush_pool and mddev->flush_bio_pool in md_run, but from do_md_stop will not call into md_stop but __md_stop, move the mempool_destroy to __md_stop fixes the problem for me. The bug was introduced in 5a409b4f, the fixes should go to 4.18+ Fixes: 5a409b4f ("MD: fix lock contention for flush bios") Signed-off-by:
Jack Wang <jinpu.wang@profitbricks.com> Reviewed-by:
Xiao Ni <xni@redhat.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Mike Snitzer authored
This dead branch was missed during review. It only makes memory_entry() more inefficient due to needless call to is_power_of_2(), etc. Reported-by:
shenghui <shhuiw@foxmail.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 18 Oct, 2018 14 commits
-
-
Damien Le Moal authored
dmz_fetch_mblock() called from dmz_get_mblock() has a race since the allocation of the new metadata block descriptor and its insertion in the cache rbtree with the READING state is not atomic. Two different contexts requesting the same block may end up each adding two different descriptors of the same block to the cache. Another problem for this function is that the BIO for processing the block read is allocated after the metadata block descriptor is inserted in the cache rbtree. If the BIO allocation fails, the metadata block descriptor is freed without first being removed from the rbtree. Fix the first problem by checking again if the requested block is not in the cache right before inserting the newly allocated descriptor, atomically under the mblk_lock spinlock. The second problem is fixed by simply allocating the BIO before inserting the new block in the cache. Finally, since dmz_fetch_mblock() also increments a block reference counter, rename the function to dmz_get_mblock_slow(). To be symmetric and clear, also rename dmz_lookup_mblock() to dmz_get_mblock_fast() and increment the block reference counter directly in that function rather than in dmz_get_mblock(). Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Damien Le Moal authored
Since the ref field of struct dmz_mblock is always used with the spinlock of struct dmz_metadata locked, there is no need to use an atomic_t type. Change the type of the ref field to an unsigne integer. Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Heinz Mauelshagen authored
With raid4/5/6, journal device and write intent bitmap are mutually exclusive. Signed-off-by:
Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Guoqing Jiang authored
Previously, we allow multiple nodes can resync device, but we had changed it to only support one node can do resync at one time, but suspend_info is still used. Now, let's remove the structure and use suspend_lo/hi to record the range. Reviewed-by:
NeilBrown <neilb@suse.com> Signed-off-by:
Guoqing Jiang <gqjiang@suse.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Guoqing Jiang authored
We need to continue the reshaping if it was interrupted in original node. So original node should call resync_bitmap in case reshaping is aborted. Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes, node which continues the reshaping should restart reshape from mddev->reshape_position instead of from the first beginning. Reviewed-by:
NeilBrown <neilb@suse.com> Signed-off-by:
Guoqing Jiang <gqjiang@suse.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Guoqing Jiang authored
When reshape is happening in one node, other nodes could receive lots of RESYNCING messages, so md_bitmap_sync_with_cluster is called. Since the resyncing window is typically small in these RESYNCING messages, so WARN is always triggered, so we should not call the func when reshape is happening. Reviewed-by:
NeilBrown <neilb@suse.com> Signed-off-by:
Guoqing Jiang <gqjiang@suse.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Guoqing Jiang authored
remove_and_add_spares is not needed if reshape is happening in another node, because raid10_add_disk called inside raid10_start_reshape would handle the role changes of disk. Plus, remove_and_add_spares can't deal with the role change due to reshape. Reviewed-by:
NeilBrown <neilb@suse.com> Signed-off-by:
Guoqing Jiang <gqjiang@suse.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Guoqing Jiang authored
We need to change the capacity in all nodes after one node finishs reshape. And as we did before, we can't change the capacity directly in md_do_sync, instead, the capacity should be only changed in update_size or received CHANGE_CAPACITY msg. So master node calls update_size after completes reshape in md_reap_sync_thread, but we need to skip ops->update_size if MD_CLOSING is set since reshaping could not be finish. Reviewed-by:
NeilBrown <neilb@suse.com> Signed-off-by:
Guoqing Jiang <gqjiang@suse.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Guoqing Jiang authored
Since the resync region from suspend_info means one node is reshaping this area, so the position of reshape_progress should be included in the area. Reviewed-by:
NeilBrown <neilb@suse.com> Signed-off-by:
Guoqing Jiang <gqjiang@suse.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Guoqing Jiang authored
For clustered raid10 scenario, we need to let all the nodes know about that a new disk is added to the array, and the reshape caused by add new member just need to be happened in one node, but other nodes should know about the change. Since reshape means read data from somewhere (which is already used by array) and write data to unused region. Obviously, it is awful if one node is reading data from address while another node is writing to the same address. Considering we have implemented suspend writes in the resyncing area, so we can just broadcast the reading address to other nodes to avoid the trouble. For master node, it would call reshape_request then update sb during the reshape period. To avoid above trouble, we call resync_info_update to send RESYNC message in reshape_request. Then from slave node's view, it receives two type messages: 1. RESYNCING message Slave node add the address (where master node reading data from) to suspend list. 2. METADATA_UPDATED message Once slave nodes know the reshaping is started in master node, it is time to update reshape position and call start_reshape to follow master node's step. After reshape is done, only reshape position is need to be updated, so the majority task of reshaping is happened on the master node. Reviewed-by:
NeilBrown <neilb@suse.com> Signed-off-by:
Guoqing Jiang <gqjiang@suse.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Guoqing Jiang authored
To support add disk under grow mode, we need to resize all the bitmaps of each node before reshape, so that we can ensure all nodes have the same view of the bitmap of the clustered raid. So after the master node resized the bitmap, it broadcast a message to other slave nodes, and it checks the size of each bitmap are same or not by compare pages. We can only continue the reshaping after all nodes update the bitmap to the same size (by checking the pages), otherwise revert bitmap size to previous value. The resize_bitmaps interface and BITMAP_RESIZE message are introduced in md-cluster.c for the purpose. Reviewed-by:
NeilBrown <neilb@suse.com> Signed-off-by:
Guoqing Jiang <gqjiang@suse.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Michał Mirosław authored
Make cpu-usage debugging easier by naming workqueues per device. Example ps output: root 413 0.0 0.0 0 0 ? I< paź02 0:00 [kcryptd_io/253:0] root 414 0.0 0.0 0 0 ? I< paź02 0:00 [kcryptd/253:0] root 415 0.0 0.0 0 0 ? S paź02 1:10 [dmcrypt_write/253:0] root 465 0.0 0.0 0 0 ? I< paź02 0:00 [kcryptd_io/253:2] root 466 0.0 0.0 0 0 ? I< paź02 0:00 [kcryptd/253:2] root 467 0.0 0.0 0 0 ? S paź02 2:06 [dmcrypt_write/253:2] root 15359 0.2 0.0 0 0 ? I< 19:43 0:25 [kworker/u17:8-kcryptd/253:0] root 16563 0.2 0.0 0 0 ? I< 20:10 0:18 [kworker/u17:0-kcryptd/253:2] root 23205 0.1 0.0 0 0 ? I< 21:21 0:04 [kworker/u17:4-kcryptd/253:0] root 13383 0.1 0.0 0 0 ? I< 21:32 0:02 [kworker/u17:2-kcryptd/253:2] root 2610 0.1 0.0 0 0 ? I< 21:42 0:01 [kworker/u17:12-kcryptd/253:2] root 20124 0.1 0.0 0 0 ? I< 21:56 0:01 [kworker/u17:1-kcryptd/253:2] Signed-off-by:
Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Michał Mirosław authored
Add a shortcut for dm_device_name(dm_table_get_md(t)). Signed-off-by:
Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Wenwen Wang authored
In copy_params(), the struct 'dm_ioctl' is first copied from the user space buffer 'user' to 'param_kernel' and the field 'data_size' is checked against 'minimum_data_size' (size of 'struct dm_ioctl' payload up to its 'data' member). If the check fails, an error code EINVAL will be returned. Otherwise, param_kernel->data_size is used to do a second copy, which copies from the same user-space buffer to 'dmi'. After the second copy, only 'dmi->data_size' is checked against 'param_kernel->data_size'. Given that the buffer 'user' resides in the user space, a malicious user-space process can race to change the content in the buffer between the two copies. This way, the attacker can inject inconsistent data into 'dmi' (versus previously validated 'param_kernel'). Fix redundant copying of 'minimum_data_size' from user-space buffer by using the first copy stored in 'param_kernel'. Also remove the 'data_size' check after the second copy because it is now unnecessary. Cc: stable@vger.kernel.org Signed-off-by:
Wenwen Wang <wang6495@umn.edu> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 16 Oct, 2018 3 commits
-
-
Igor Stoppa authored
WARN_ON() already contains an unlikely(), so it's not necessary to wrap it into another. Signed-off-by:
Igor Stoppa <igor.stoppa@huawei.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
John Pittman authored
The API surrounding refcount_t should be used in place of atomic_t when variables are being used as reference counters. This API can prevent issues such as counter overflows and use-after-free conditions. Within the dm zoned target stack, the atomic_t API is used for bioctx->ref and cw->refcount. Change these to use refcount_t, avoiding the issues mentioned. Signed-off-by:
John Pittman <jpittman@redhat.com> Reviewed-by:
Damien Le Moal <damien.lemoal@wdc.com> Tested-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
John Pittman authored
The API surrounding refcount_t should be used in place of atomic_t when variables are being used as reference counters. It can potentially prevent reference counter overflows and use-after-free conditions. In the dm thin layer, one such example is tc->refcount. Change this from the atomic_t API to the refcount_t API to prevent mentioned conditions. Signed-off-by:
John Pittman <jpittman@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 15 Oct, 2018 1 commit
-
-
Shaohua Li authored
Commit d595567d (MD: fix invalid stored role for a disk) broke linear hotadd. Let's only fix the role for disks in raid1/10. Based on Guoqing's original patch. Reported-by:
kernel test robot <rong.a.chen@intel.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Guoqing Jiang <gqjiang@suse.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
- 11 Oct, 2018 4 commits
-
-
Mike Snitzer authored
Now that request-based DM (multipath) is blk-mq only: this restriction is required while the legacy request-based IO path still exists. Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Mike Snitzer authored
Now that request-based DM is only using blk-mq, there is no need to differentiate between legacy "rq" and new "mq". We're back to a single request-based DM -- and there was much rejoicing! Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Jens Axboe authored
dm supports both, and since we're killing off the legacy path in general, get rid of it in dm. Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Damien Le Moal authored
The dm-linear target is independent of the dm-zoned target. For code requiring support for zoned block devices, use CONFIG_BLK_DEV_ZONED instead of CONFIG_DM_ZONED. While at it, similarly to dm linear, also enable the DM_TARGET_ZONED_HM feature in dm-flakey only if CONFIG_BLK_DEV_ZONED is defined. Fixes: beb9caac ("dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled") Fixes: 0be12c1c ("dm linear: add support for zoned block devices") Cc: stable@vger.kernel.org Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 10 Oct, 2018 3 commits
-
-
Jack Wang authored
After 9e1cc0a5 ("md: use mddev_suspend/resume instead of ->quiesce()") We still have similar left in bitmap functions. Replace quiesce() with mddev_suspend/resume. Also move md_bitmap_create out of mddev_suspend. and move mddev_resume after md_bitmap_destroy. as we did in set_bitmap_file. Signed-off-by:
Jack Wang <jinpu.wang@profitbricks.com> Reviewed-by:
Gioh Kim <gi-oh.kim@profitbricks.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Colin Ian King authored
And earlier commit removed the error label to two statements that are now never reachable. Since this code is now dead code, remove it. Detected by CoverityScan, CID#1462409 ("Structurally dead code") Fixes: d5d885fd ("md: introduce new personality funciton start()") Signed-off-by:
Colin Ian King <colin.king@canonical.com> Signed-off-by:
Shaohua Li <shli@fb.com>
-
Mike Snitzer authored
It is best to avoid any extra overhead associated with bio completion. DM core will indirectly call a DM target's .end_io if it is defined. In the case of DM linear, there is no need to do so (for every bio that completes) if CONFIG_DM_ZONED is not enabled. Avoiding an extra indirect call for every bio completion is very important for ensuring DM linear doesn't incur more overhead that further widens the performance gap between dm-linear and raw block devices. Fixes: 0be12c1c ("dm linear: add support for zoned block devices") Cc: stable@vger.kernel.org Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 09 Oct, 2018 2 commits
-
-
Damien Le Moal authored
If dm-linear or dm-flakey are layered on top of a partition of a zoned block device, remapping of the start sector and write pointer position of the zones reported by a report zones BIO must be modified to account for the target table entry mapping (start offset within the device and entry mapping with the dm device). If the target's backing device is a partition of a whole disk, the start sector on the physical device of the partition must also be accounted for when modifying the zone information. However, dm_remap_zone_report() was not considering this last case, resulting in incorrect zone information remapping with targets using disk partitions. Fix this by calculating the target backing device start sector using the position of the completed report zones BIO and the unchanged position and size of the original report zone BIO. With this value calculated, the start sector and write pointer position of the target zones can be correctly remapped. Fixes: 10999307 ("dm: introduce dm_remap_zone_report()") Cc: stable@vger.kernel.org Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Shenghui Wang authored
Commit 7e6358d2 ("dm: fix various targets to dm_register_target after module __init resources created") inadvertently introduced this bug when it moved dm_register_target() after the call to KMEM_CACHE(). Fixes: 7e6358d2 ("dm: fix various targets to dm_register_target after module __init resources created") Cc: stable@vger.kernel.org Signed-off-by:
Shenghui Wang <shhuiw@foxmail.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 08 Oct, 2018 6 commits
-
-
Dongbo Cao authored
when the nbuckets of cache device is smaller than 1024, making cache device will trigger BUG_ON in kernel, add a condition to avoid this. Reported-by:
nitroxis <n@nxs.re> Signed-off-by:
Dongbo Cao <cdbdyx@163.com> Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Dongbo Cao authored
Split the combined '||' statements in if() check, to make the code easier for debug. Signed-off-by:
Dongbo Cao <cdbdyx@163.com> Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Shenghui Wang authored
Current cache_set has MAX_CACHES_PER_SET caches most, and the macro is used for " struct cache *cache_by_alloc[MAX_CACHES_PER_SET]; " in the define of struct cache_set. Use MAX_CACHES_PER_SET instead of magic number 8 in __bch_bucket_alloc_set. Signed-off-by:
Shenghui Wang <shhuiw@foxmail.com> Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
In extents.c:bch_extent_bad(), number 96 is used as parameter to call btree_bug_on(). The purpose is to check whether stale gen value exceeds BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to make the code more understandable. Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Dongbo Cao authored
Parameter "struct kobject *kobj" in bch_debug_init() is useless, remove it in this patch. Signed-off-by:
Dongbo Cao <cdbdyx@163.com> Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Shenghui Wang authored
struct kmem_cache *bch_passthrough_cache is not used in bcache code. Remove it. Signed-off-by:
Shenghui Wang <shhuiw@foxmail.com> Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-