Commit 0e9da3fb authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'for-4.21/block-20181221' of git://

Pull block updates from Jens Axboe:
 "This is the main pull request for block/storage for 4.21.

  Larger than usual, it was a busy round with lots of goodies queued up.
  Most notable is the removal of the old IO stack, which has been a long
  time coming. No new features for a while, everything coming in this
  week has all been fixes for things that were previously merged.

  This contains:

   - Use atomic counters instead of semaphores for mtip32xx (Arnd)

   - Cleanup of the mtip32xx request setup (Christoph)

   - Fix for circular locking dependency in loop (Jan, Tetsuo)

   - bcache (Coly, Guoju, Shenghui)
      * Optimizations for writeback caching
      * Various fixes and improvements

   - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
      * host and target support for NVMe over TCP
      * Error log page support
      * Support for separate read/write/poll queues
      * Much improved polling
      * discard OOM fallback
      * Tracepoint improvements

   - lightnvm (Hans, Hua, Igor, Matias, Javier)
      * Igor added packed metadata to pblk. Now drives without metadata
        per LBA can be used as well.
      * Fix from Geert on uninitialized value on chunk metadata reads.
      * Fixes from Hans and Javier to pblk recovery and write path.
      * Fix from Hua Su to fix a race condition in the pblk recovery
      * Scan optimization added to pblk recovery from Zhoujie.
      * Small geometry cleanup from me.

   - Conversion of the last few drivers that used the legacy path to
     blk-mq (me)

   - Removal of legacy IO path in SCSI (me, Christoph)

   - Removal of legacy IO stack and schedulers (me)

   - Support for much better polling, now without interrupts at all.
     blk-mq adds support for multiple queue maps, which enables us to
     have a map per type. This in turn enables nvme to have separate
     completion queues for polling, which can then be interrupt-less.
     Also means we're ready for async polled IO, which is hopefully
     coming in the next release.

   - Killing of (now) unused block exports (Christoph)

   - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

   - Support for zoned testing with null_blk (Masato)

   - sx8 conversion to per-host tag sets (Christoph)

   - IO priority improvements (Damien)

   - mq-deadline zoned fix (Damien)

   - Ref count blkcg series (Dennis)

   - Lots of blk-mq improvements and speedups (me)

   - sbitmap scalability improvements (me)

   - Make core inflight IO accounting per-cpu (Mikulas)

   - Export timeout setting in sysfs (Weiping)

   - Cleanup the direct issue path (Jianchao)

   - Export blk-wbt internals in block debugfs for easier debugging

   - Lots of other fixes and improvements"

* tag 'for-4.21/block-20181221' of git:// (364 commits)
  kyber: use sbitmap add_wait_queue/list_del wait helpers
  sbitmap: add helpers for add/del wait queue handling
  block: save irq state in blkg_lookup_create()
  dm: don't reuse bio for flushes
  nvme-pci: trace SQ status on completions
  nvme-rdma: implement polling queue map
  nvme-fabrics: allow user to pass in nr_poll_queues
  nvme-fabrics: allow nvmf_connect_io_queue to poll
  nvme-core: optionally poll sync commands
  block: make request_to_qc_t public
  nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
  nvme-tcp: fix endianess annotations
  nvmet-tcp: fix endianess annotations
  nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
  nvme-pci: only set nr_maps to 2 if poll queues are supported
  nvmet: use a macro for default error location
  nvmet: fix comparison of a u16 with -1
  blk-mq: enable IO poll if .nr_queues of type poll > 0
  blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
  blk-mq: skip zero-queue maps in blk_mq_map_swqueue
parents b12a9124 00203ba4
......@@ -244,7 +244,7 @@ Description:
What: /sys/block/<disk>/queue/zoned
Date: September 2016
Contact: Damien Le Moal <>
Contact: Damien Le Moal <>
zoned indicates if the device is a zoned block device
and the zone model of the device if it is indeed zoned.
......@@ -259,6 +259,14 @@ Description:
zone commands, they will be treated as regular block
devices and zoned will report "none".
What: /sys/block/<disk>/queue/nr_zones
Date: November 2018
Contact: Damien Le Moal <>
nr_zones indicates the total number of zones of a zoned block
device ("host-aware" or "host-managed" zone model). For regular
block devices, the value is always 0.
What: /sys/block/<disk>/queue/chunk_sectors
Date: September 2016
Contact: Hannes Reinecke <>
......@@ -268,6 +276,6 @@ Description:
indicates the size in 512B sectors of the RAID volume
stripe segment. For a zoned block device, either
host-aware or host-managed, chunk_sectors indicates the
size of 512B sectors of the zones of the device, with
size in 512B sectors of the zones of the device, with
the eventual exception of the last zone of the device
which may be smaller.
......@@ -1879,8 +1879,10 @@ following two functions.
wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and
associates the bio with the inode's owner cgroup. Can be
called anytime between bio allocation and submission.
associates the bio with the inode's owner cgroup and the
corresponding request queue. This must be called after
a queue (device) has been associated with the bio and
before submission.
wbc_account_io(@wbc, @page, @bytes)
Should be called for each data segment being written out.
......@@ -1899,7 +1901,7 @@ the configuration, the bio may be executed at a lower priority and if
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
cases by skipping wbc_init_bio() and using bio_associate_blkg()
......@@ -65,7 +65,6 @@ Description of Contents:
3.2.3 I/O completion
3.2.4 Implications for drivers that do not interpret bios (don't handle
multiple segments)
3.2.5 Request command tagging
3.3 I/O submission
4. The I/O scheduler
5. Scalability related changes
......@@ -708,93 +707,6 @@ is crossed on completion of a transfer. (The end*request* functions should
be used if only if the request has come down from block/bio path, not for
direct access requests which only specify rq->buffer without a valid rq->bio)
3.2.5 Generic request command tagging Tag helpers
Block now offers some simple generic functionality to help support command
queueing (typically known as tagged command queueing), ie manage more than
one outstanding command on a queue at any given time.
blk_queue_init_tags(struct request_queue *q, int depth)
Initialize internal command tagging structures for a maximum
depth of 'depth'.
blk_queue_free_tags((struct request_queue *q)
Teardown tag info associated with the queue. This will be done
automatically by block if blk_queue_cleanup() is called on a queue
that is using tagging.
The above are initialization and exit management, the main helpers during
normal operations are:
blk_queue_start_tag(struct request_queue *q, struct request *rq)
Start tagged operation for this request. A free tag number between
0 and 'depth' is assigned to the request (rq->tag holds this number),
and 'rq' is added to the internal tag management. If the maximum depth
for this queue is already achieved (or if the tag wasn't started for
some other reason), 1 is returned. Otherwise 0 is returned.
blk_queue_end_tag(struct request_queue *q, struct request *rq)
End tagged operation on this request. 'rq' is removed from the internal
book keeping structures.
To minimize struct request and queue overhead, the tag helpers utilize some
of the same request members that are used for normal request queue management.
This means that a request cannot both be an active tag and be on the queue
list at the same time. blk_queue_start_tag() will remove the request, but
the driver must remember to call blk_queue_end_tag() before signalling
completion of the request to the block layer. This means ending tag
operations before calling end_that_request_last()! For an example of a user
of these helpers, see the IDE tagged command queueing support. Tag info
Some block functions exist to query current tag status or to go from a
tag number to the associated request. These are, in no particular order:
Returns 1 if the queue 'q' is using tagging, 0 if not.
blk_queue_tag_request(q, tag)
Returns a pointer to the request associated with tag 'tag'.
Return current queue depth.
Returns 1 if the queue can accept a new queued command, 0 if we are
at the maximum depth already.
Returns 1 if the request 'rq' is tagged. Internal structure
Internally, block manages tags in the blk_queue_tag structure:
struct blk_queue_tag {
struct request **tag_index; /* array or pointers to rq */
unsigned long *tag_map; /* bitmap of free tags */
struct list_head busy_list; /* fifo list of busy tags */
int busy; /* queue depth */
int max_depth; /* max queue depth */
Most of the above is simple and straight forward, however busy_list may need
a bit of explaining. Normally we don't care too much about request ordering,
but in the event of any barrier requests in the tag queue we need to ensure
that requests are restarted in the order they were queue.
3.3 I/O Submission
The routine submit_bio() is used to submit a single io. Higher level i/o
This diff is collapsed.
......@@ -64,7 +64,7 @@ guess, the kernel will put the process issuing IO to sleep for an amount
of time, before entering a classic poll loop. This mode might be a
little slower than pure classic polling, but it will be more efficient.
If set to a value larger than 0, the kernel will put the process issuing
IO to sleep for this amont of microseconds before entering classic
IO to sleep for this amount of microseconds before entering classic
iostats (RW)
......@@ -194,4 +194,31 @@ blk-throttle makes decision based on the samplings. Lower time means cgroups
have more smooth throughput, but higher CPU overhead. This exists only when
zoned (RO)
This indicates if the device is a zoned block device and the zone model of the
device if it is indeed zoned. The possible values indicated by zoned are
"none" for regular block devices and "host-aware" or "host-managed" for zoned
block devices. The characteristics of host-aware and host-managed zoned block
devices are described in the ZBC (Zoned Block Commands) and ZAC
(Zoned Device ATA Command Set) standards. These standards also define the
"drive-managed" zone model. However, since drive-managed zoned block devices
do not support zone commands, they will be treated as regular block devices
and zoned will report "none".
nr_zones (RO)
For zoned block devices (zoned attribute indicating "host-managed" or
"host-aware"), this indicates the total number of zones of the device.
This is always 0 for regular block devices.
chunk_sectors (RO)
This has different meaning depending on the type of the block device.
For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors
of the RAID volume stripe segment. For a zoned block device, either host-aware
or host-managed, chunk_sectors indicates the size in 512B sectors of the zones
of the device, with the eventual exception of the last zone of the device which
may be smaller.
Jens Axboe <>, February 2009
......@@ -97,11 +97,6 @@ parameters may be changed at runtime by the command
allowing boot to proceed. none ignores them, expecting
user space to do the scan.
[SCSI] use blk-mq I/O path by default
See SCSI_MQ_DEFAULT in drivers/scsi/Kconfig.
Format: <y/n>
sim710= [SCSI,HW]
See header of drivers/scsi/sim710.c.
......@@ -155,12 +155,6 @@ config BLK_CGROUP_IOLATENCY
Note, this is an experimental interface and could be changed someday.
config BLK_WBT_SQ
bool "Single queue writeback throttling"
depends on BLK_WBT
Enable writeback throttling by default on legacy single queue devices
config BLK_WBT_MQ
bool "Multiqueue writeback throttling"
default y
......@@ -3,67 +3,6 @@ if BLOCK
menu "IO Schedulers"
default y
The no-op I/O scheduler is a minimal scheduler that does basic merging
and sorting. Its main uses include non-disk based block devices like
memory devices, and specialised software or hardware environments
that do their own scheduling and require only minimal assistance from
the kernel.
tristate "Deadline I/O scheduler"
default y
The deadline I/O scheduler is simple and compact. It will provide
CSCAN service with FIFO expiration of requests, switching to
a new point in the service tree and doing a batch of IO from there
in case of expiry.
tristate "CFQ I/O scheduler"
default y
The CFQ I/O scheduler tries to distribute bandwidth equally
among all processes in the system. It should provide a fair
and low latency working environment, suitable for both desktop
and server systems.
This is the default I/O scheduler.
bool "CFQ Group Scheduling support"
Enable group IO scheduling in CFQ.
prompt "Default I/O scheduler"
Select the I/O scheduler which will be used by default for all
block devices.
bool "Deadline" if IOSCHED_DEADLINE=y
bool "CFQ" if IOSCHED_CFQ=y
bool "No-op"
default "deadline" if DEFAULT_DEADLINE
default "cfq" if DEFAULT_CFQ
default "noop" if DEFAULT_NOOP
tristate "MQ deadline I/O scheduler"
default y
......@@ -3,7 +3,7 @@
# Makefile for the kernel block layer
obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
......@@ -18,9 +18,6 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
......@@ -334,7 +334,7 @@ static void bfqg_stats_xfer_dead(struct bfq_group *bfqg)
parent = bfqg_parent(bfqg);
if (unlikely(!parent))
......@@ -642,7 +642,7 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
uint64_t serial_nr;
serial_nr = bio_blkcg(bio)->css.serial_nr;
serial_nr = __bio_blkcg(bio)->css.serial_nr;
* Check whether blkcg has changed. The condition may trigger
......@@ -651,7 +651,7 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr))
goto out;
bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio));
bfqg = __bfq_bic_change_cgroup(bfqd, bic, __bio_blkcg(bio));
* Update blkg_path for bfq_log_* functions. We cache this
* path, and update it here, for the following
......@@ -399,9 +399,9 @@ static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
unsigned long flags;
struct bfq_io_cq *icq;
spin_lock_irqsave(q->queue_lock, flags);
spin_lock_irqsave(&q->queue_lock, flags);
icq = icq_to_bic(ioc_lookup_icq(ioc, q));
spin_unlock_irqrestore(q->queue_lock, flags);
spin_unlock_irqrestore(&q->queue_lock, flags);
return icq;
......@@ -4066,7 +4066,7 @@ static void bfq_update_dispatch_stats(struct request_queue *q,
* In addition, the following queue lock guarantees that
* bfqq_group(bfqq) exists as well.
if (idle_timer_disabled)
* Since the idle timer has been disabled,
......@@ -4085,7 +4085,7 @@ static void bfq_update_dispatch_stats(struct request_queue *q,
bfqg_stats_update_io_remove(bfqg, rq->cmd_flags);
static inline void bfq_update_dispatch_stats(struct request_queue *q,
......@@ -4416,7 +4416,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
bfqg = bfq_find_set_group(bfqd, bio_blkcg(bio));
bfqg = bfq_find_set_group(bfqd, __bio_blkcg(bio));
if (!bfqg) {
bfqq = &bfqd->oom_bfqq;
goto out;
......@@ -4669,11 +4669,11 @@ static void bfq_update_insert_stats(struct request_queue *q,
* In addition, the following queue lock guarantees that
* bfqq_group(bfqq) exists as well.
bfqg_stats_update_io_add(bfqq_group(bfqq), bfqq, cmd_flags);
if (idle_timer_disabled)
static inline void bfq_update_insert_stats(struct request_queue *q,
......@@ -5414,9 +5414,9 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
eq->elevator_data = bfqd;
q->elevator = eq;
* Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
......@@ -5756,7 +5756,7 @@ static struct elv_fs_entry bfq_attrs[] = {
static struct elevator_type iosched_bfq_mq = { = {
.ops = {
.limit_depth = bfq_limit_depth,
.prepare_request = bfq_prepare_request,
.requeue_request = bfq_finish_requeue_request,
......@@ -5777,7 +5777,6 @@ static struct elevator_type iosched_bfq_mq = {
.exit_sched = bfq_exit_queue,
.uses_mq = true,
.icq_size = sizeof(struct bfq_io_cq),
.icq_align = __alignof__(struct bfq_io_cq),
.elevator_attrs = bfq_attrs,
......@@ -390,7 +390,6 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
bip->bip_iter.bi_sector += bytes_done >> 9;
bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
* bio_integrity_trim - Trim integrity vector
......@@ -460,7 +459,6 @@ void bioset_integrity_free(struct bio_set *bs)
void __init bio_integrity_init(void)
......@@ -244,7 +244,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
void bio_uninit(struct bio *bio)
......@@ -571,14 +571,13 @@ void bio_put(struct bio *bio)
inline int bio_phys_segments(struct request_queue *q, struct bio *bio)
int bio_phys_segments(struct request_queue *q, struct bio *bio)
if (unlikely(!bio_flagged(bio, BIO_SEG_VALID)))
blk_recount_segments(q, bio);
return bio->bi_phys_segments;
* __bio_clone_fast - clone a bio that shares the original bio's biovec
......@@ -610,7 +609,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
bio_clone_blkcg_association(bio, bio_src);
bio_clone_blkg_association(bio, bio_src);
......@@ -901,7 +901,6 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return 0;
static void submit_bio_wait_endio(struct bio *bio)
......@@ -1592,7 +1591,6 @@ void bio_set_pages_dirty(struct bio *bio)
static void bio_release_pages(struct bio *bio)
......@@ -1662,17 +1660,33 @@ void bio_check_pages_dirty(struct bio *bio)
spin_unlock_irqrestore(&bio_dirty_lock, flags);
void update_io_ticks(struct hd_struct *part, unsigned long now)
unsigned long stamp;
stamp = READ_ONCE(part->stamp);
if (unlikely(stamp != now)) {
if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
__part_stat_add(part, io_ticks, 1);
if (part->partno) {
part = &part_to_disk(part)->part0;
goto again;
void generic_start_io_acct(struct request_queue *q, int op,
unsigned long sectors, struct hd_struct *part)
const int sgrp = op_stat_group(op);
int cpu = part_stat_lock();
part_round_stats(q, cpu, part);
part_stat_inc(cpu, part, ios[sgrp]);
part_stat_add(cpu, part, sectors[sgrp], sectors);
update_io_ticks(part, jiffies);
part_stat_inc(part, ios[sgrp]);
part_stat_add(part, sectors[sgrp], sectors);
part_inc_in_flight(q, part, op_is_write(op));
......@@ -1682,12 +1696,15 @@ EXPORT_SYMBOL(generic_start_io_acct);
void generic_end_io_acct(struct request_queue *q, int req_op,
struct hd_struct *part, unsigned long start_time)
unsigned long duration = jiffies - start_time;
unsigned long now = jiffies;
unsigned long duration = now - start_time;
const int sgrp = op_stat_group(req_op);
int cpu = part_stat_lock();
part_stat_add(cpu, part, nsecs[sgrp], jiffies_to_nsecs(duration));
part_round_stats(q, cpu, part);
update_io_ticks(part, now);
part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
part_stat_add(part, time_in_queue, duration);
part_dec_in_flight(q, part, op_is_write(req_op));
......@@ -1957,102 +1974,133 @@ EXPORT_SYMBOL(bioset_init_from_src);
* bio_associate_blkcg_from_page - associate a bio with the page's blkcg
* bio_disassociate_blkg - puts back the blkg reference if associated
* @bio: target bio
* @page: the page to lookup the blkcg from
* Associate @bio with the blkcg from @page's owning memcg. This works like
* every other associate function wrt references.
* Helper to disassociate the blkg from @bio if a blkg is associated.
int bio_associate_blkcg_from_page(struct bio *bio, struct page *page)
void bio_disassociate_blkg(struct bio *bio)
struct cgroup_subsys_state *blkcg_css;
if (unlikely(bio->bi_css))
return -EBUSY;
if (!page->mem_cgroup)
return 0;
blkcg_css = cgroup_get_e_css(page->mem_cgroup->css.cgroup,
bio->bi_css = blkcg_css;
return 0;
if (bio->bi_blkg) {
bio->bi_blkg = NULL;
#endif /* CONFIG_MEMCG */
* bio_associate_blkcg - associate a bio with the specified blkcg
* __bio_associate_blkg - associate a bio with the a blkg
* @bio: target bio
* @blkcg_css: css of the blkcg to associate
* @blkg: the blkg to associate
* Associate @bio with the blkcg specified by @blkcg_css. Block layer will
* treat @bio as if it were issued by a task which belongs to the blkcg.
* This tries to associate @bio with the specified @blkg. Association failure
* is handled by walking up the blkg tree. Therefore, the blkg associated can
* be anything between @blkg and the root_blkg. This situation only happens
* when a cgroup is dying and then the remaining bios will spill to the closest
* alive blkg.
* This function takes an extra reference of @blkcg_css which will be put
* when @bio is released. The caller must own @bio and is responsible for
* synchronizing calls to this function.
* A reference will be taken on the @blkg and will be released when @bio is
* freed.
int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css)
static void __bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg)
if (unlikely(bio->bi_css))
return -EBUSY;
bio->bi_css = blkcg_css;
return 0;
bio->bi_blkg = blkg_tryget_closest(blkg);
* bio_associate_blkg - associate a bio with the specified blkg
* bio_associate_blkg_from_css - associate a bio with a specified css
* @bio: target bio
* @blkg: the blkg to associate
* @css: target css
* Associate @bio with the blkg specified by @blkg. This is the queue specific
* blkcg information associated with the @bio, a reference will be taken on the
* @blkg and will be freed when the bio is freed.
* Associate @bio with the blkg found by combining the css's blkg and the
* request_queue of the @bio. This falls back to the queue's root_blkg if
* the association fails with the css.
int bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg)
void bio_associate_blkg_from_css(struct bio *bio,
struct cgroup_subsys_state *css)
if (unlikely(bio->bi_blkg))
return -EBUSY;
if (!blkg_try_get(blkg))
return -ENODEV;
bio->bi_blkg = blkg;
return 0;
struct request_queue *q = bio->bi_disk->queue;
struct blkcg_gq *blkg;
if (!css || !css->parent)
blkg = q->root_blkg;
blkg = blkg_lookup_create(css_to_blkcg(css), q);