1. 04 Nov, 2017 3 commits
    • Ming Lei's avatar
      blk-mq: don't allocate driver tag upfront for flush rq · 923218f6
      Ming Lei authored
      The idea behind it is simple:
      
      1) for none scheduler, driver tag has to be borrowed for flush rq,
         otherwise we may run out of tag, and that causes an IO hang. And
         get/put driver tag is actually noop for none, so reordering tags
         isn't necessary at all.
      
      2) for a real I/O scheduler, we need not allocate a driver tag upfront
         for flush rq. It works just fine to follow the same approach as
         normal requests: allocate driver tag for each rq just before calling
         ->queue_rq().
      
      One driver visible change is that the driver tag isn't shared in the
      flush request sequence. That won't be a problem, since we always do that
      in legacy path.
      
      Then flush rq need not be treated specially wrt. get/put driver tag.
      This cleans up the code - for instance, reorder_tags_to_front() can be
      removed, and we needn't worry about request ordering in dispatch list
      for avoiding I/O deadlock.
      
      Also we have to put the driver tag before requeueing.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      923218f6
    • Ming Lei's avatar
      blk-flush: use blk_mq_request_bypass_insert() · 598906f8
      Ming Lei authored
      In the following patch, we will use RQF_FLUSH_SEQ to decide:
      
      1) if the flag isn't set, the flush rq need to be inserted via
      blk_insert_flush()
      
      2) otherwise, the flush rq need to be dispatched directly since
      it is in flush machinery now.
      
      So we use blk_mq_request_bypass_insert() for requests of bypassing
      flush machinery, just like the legacy path did.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      598906f8
    • Ming Lei's avatar
      blk-flush: don't run queue for requests bypassing flush · 9c71c83c
      Ming Lei authored
      blk_insert_flush() should only insert request since run queue always
      follows it.
      
      In case of bypassing flush, we don't need to run queue because every
      blk_insert_flush() follows one run queue.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9c71c83c
  2. 25 Aug, 2017 1 commit
  3. 23 Aug, 2017 1 commit
    • Christoph Hellwig's avatar
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig authored
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      74d46992
  4. 21 Jun, 2017 1 commit
  5. 09 Jun, 2017 1 commit
    • Christoph Hellwig's avatar
      block: introduce new block status code type · 2a842aca
      Christoph Hellwig authored
      Currently we use nornal Linux errno values in the block layer, and while
      we accept any error a few have overloaded magic meanings.  This patch
      instead introduces a new  blk_status_t value that holds block layer specific
      status codes and explicitly explains their meaning.  Helpers to convert from
      and to the previous special meanings are provided for now, but I suspect
      we want to get rid of them in the long run - those drivers that have a
      errno input (e.g. networking) usually get errnos that don't know about
      the special block layer overloads, and similarly returning them to userspace
      will usually return somethings that strictly speaking isn't correct
      for file system operations, but that's left as an exercise for later.
      
      For now the set of errors is a very limited set that closely corresponds
      to the previous overloaded errno values, but there is some low hanging
      fruite to improve it.
      
      blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
      typechecking, so that we can easily catch places passing the wrong values.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      2a842aca
  6. 19 Apr, 2017 1 commit
  7. 24 Mar, 2017 1 commit
  8. 17 Feb, 2017 1 commit
    • Jens Axboe's avatar
      block: don't defer flushes on blk-mq + scheduling · 7520872c
      Jens Axboe authored
      For blk-mq with scheduling, we can potentially end up with ALL
      driver tags assigned and sitting on the flush queues. If we
      defer because of an inlfight data request, then we can deadlock
      if that data request doesn't already have a tag assigned.
      
      This fixes a deadlock with running the xfs/297 xfstest, where
      thousands of syncs can cause the drive queue to stall.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      7520872c
  9. 31 Jan, 2017 1 commit
    • Christoph Hellwig's avatar
      block: fold cmd_type into the REQ_OP_ space · aebf526b
      Christoph Hellwig authored
      Instead of keeping two levels of indirection for requests types, fold it
      all into the operations.  The little caveat here is that previously
      cmd_type only applied to struct request, while the request and bio op
      fields were set to plain REQ_OP_READ/WRITE even for passthrough
      operations.
      
      Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
      private requests, althought it has to add two for each so that we
      can communicate the data in/out nature of the request.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      aebf526b
  10. 27 Jan, 2017 2 commits
  11. 17 Jan, 2017 1 commit
  12. 09 Dec, 2016 1 commit
  13. 09 Nov, 2016 1 commit
  14. 02 Nov, 2016 1 commit
  15. 01 Nov, 2016 1 commit
  16. 28 Oct, 2016 2 commits
    • Christoph Hellwig's avatar
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig authored
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ef295ecf
    • Christoph Hellwig's avatar
      block: split out request-only flags into a new namespace · e8064021
      Christoph Hellwig authored
      A lot of the REQ_* flags are only used on struct requests, and only of
      use to the block layer and a few drivers that dig into struct request
      internals.
      
      This patch adds a new req_flags_t rq_flags field to struct request for
      them, and thus dramatically shrinks the number of common requests.  It
      also removes the unfortunate situation where we have to fit the fields
      from the same enum into 32 bits for struct bio and 64 bits for
      struct request.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarShaun Tancheff <shaun.tancheff@seagate.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e8064021
  17. 26 Oct, 2016 1 commit
  18. 15 Sep, 2016 1 commit
  19. 07 Jun, 2016 4 commits
  20. 13 Apr, 2016 1 commit
  21. 25 Nov, 2015 2 commits
    • Jens Axboe's avatar
      Revert "blk-flush: Queue through IO scheduler when flush not required" · d7cf931d
      Jens Axboe authored
      This reverts commit 1b2ff19e.
      
      Jan writes:
      
      --
      
      Thanks for report! After some investigation I found out we allocate
      elevator specific data in __get_request() only for non-flush requests. And
      this is actually required since the flush machinery uses the space in
      struct request for something else. Doh. So my patch is just wrong and not
      easy to fix since at the time __get_request() is called we are not sure
      whether the flush machinery will be used in the end. Jens, please revert
      1b2ff19e. Thanks!
      
      I'm somewhat surprised that you can reliably hit the race where flushing
      gets disabled for the device just while the request is in flight. But I
      guess during boot it makes some sense.
      
      --
      
      So let's just revert it, we can fix the queue run manually after the
      fact. This race is rare enough that it didn't trigger in testing, it
      requires the specific disable-while-in-flight scenario to trigger.
      d7cf931d
    • Jens Axboe's avatar
      Revert "blk-flush: Queue through IO scheduler when flush not required" · dcd8376c
      Jens Axboe authored
      This reverts commit 1b2ff19e.
      
      Jan writes:
      
      --
      
      Thanks for report! After some investigation I found out we allocate
      elevator specific data in __get_request() only for non-flush requests. And
      this is actually required since the flush machinery uses the space in
      struct request for something else. Doh. So my patch is just wrong and not
      easy to fix since at the time __get_request() is called we are not sure
      whether the flush machinery will be used in the end. Jens, please revert
      1b2ff19e. Thanks!
      
      I'm somewhat surprised that you can reliably hit the race where flushing
      gets disabled for the device just while the request is in flight. But I
      guess during boot it makes some sense.
      
      --
      
      So let's just revert it, we can fix the queue run manually after the
      fact. This race is rare enough that it didn't trigger in testing, it
      requires the specific disable-while-in-flight scenario to trigger.
      dcd8376c
  22. 16 Nov, 2015 1 commit
    • Jan Kara's avatar
      blk-flush: Queue through IO scheduler when flush not required · 1b2ff19e
      Jan Kara authored
      Currently blk_insert_flush() just adds flush request to q->queue_head
      when flush is not required. That completely bypasses IO scheduler so
      e.g. CFQ can be idling waiting for new request to arrive and will idle
      through the whole window unnecessarily. Luckily this only happens in
      rare cases as usually checks in generic_make_request_checks() clear
      FLUSH and FUA flags early if they are not needed.
      
      When no flushing is actually required, we can easily fix the problem by
      properly queueing the request through the IO scheduler. Ideally IO
      scheduler should be also made aware of requests queued via
      blk_flush_queue_rq(). However inserting flush request through IO
      scheduler can have unwanted side-effects since due to flush batching
      delaying the flush request in IO scheduler will delay all flush requests
      possibly coming from other processes. So we keep adding the request
      directly to q->queue_head.
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1b2ff19e
  23. 15 Aug, 2015 1 commit
    • Ming Lei's avatar
      blk-mq: fix race between timeout and freeing request · 0048b483
      Ming Lei authored
      Inside timeout handler, blk_mq_tag_to_rq() is called
      to retrieve the request from one tag. This way is obviously
      wrong because the request can be freed any time and some
      fiedds of the request can't be trusted, then kernel oops
      might be triggered[1].
      
      Currently wrt. blk_mq_tag_to_rq(), the only special case is
      that the flush request can share same tag with the request
      cloned from, and the two requests can't be active at the same
      time, so this patch fixes the above issue by updating tags->rqs[tag]
      with the active request(either flush rq or the request cloned
      from) of the tag.
      
      Also blk_mq_tag_to_rq() gets much simplified with this patch.
      
      Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
      make sure the request can't be freed, so in bt_for_each() this
      helper is replaced with tags->rqs[tag].
      
      [1] kernel oops log
      [  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
      [  439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
      [  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
      [  439.700653] Dumping ftrace buffer:^M
      [  439.700653]    (ftrace buffer empty)^M
      [  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
      [  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
      [  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
      [  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
      [  439.730500] RIP: 0010:[<ffffffff812d89ba>]  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
      [  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
      [  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
      [  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
      [  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
      [  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
      [  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
      [  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
      [  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
      [  439.730500] Stack:^M
      [  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
      [  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
      [  439.755663] Call Trace:^M
      [  439.755663]  <IRQ> ^M
      [  439.755663]  [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M
      [  439.755663]  [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M
      [  439.755663]  [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M
      [  439.755663]  [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M
      [  439.755663]  [<ffffffff8104c76e>] irq_exit+0x40/0x94^M
      [  439.755663]  [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M
      [  439.755663]  [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M
      [  439.755663]  <EOI> ^M
      [  439.755663]  [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M
      [  439.755663]  [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M
      [  439.755663]  [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M
      [  439.755663]  [<ffffffff81550066>] __schedule+0x469/0x6cd^M
      [  439.755663]  [<ffffffff8155039b>] schedule+0x82/0x9a^M
      [  439.789267]  [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M
      [  439.790911]  [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M
      [  439.790911]  [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M
      [  439.790911]  [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M
      [  439.790911]  [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M
      [  439.790911]  [<ffffffff8116292b>] SyS_read+0x49/0x7f^M
      [  439.790911]  [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M
      [  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
      f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
      53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
      ^M
      [  439.790911] RIP  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.790911]  RSP <ffff880819203da0>^M
      [  439.790911] CR2: 0000000000000158^M
      [  439.790911] ---[ end trace d40af58949325661 ]---^M
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0048b483
  24. 25 Sep, 2014 9 commits