1. 11 Sep, 2009 2 commits
    • Tejun Heo's avatar
      scsi,block: update SCSI to handle mixed merge failures · da6c5c72
      Tejun Heo authored
      Update scsi_io_completion() such that it only fails requests till the
      next error boundary and retry the leftover.  This enables block layer
      to merge requests with different failfast settings and still behave
      correctly on errors.  Allow merge of requests of different failfast
      As SCSI is currently the only subsystem which follows failfast status,
      there's no need to worry about other block drivers for now.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: implement mixed merge of different failfast requests · 80a761fd
      Tejun Heo authored
      Failfast has characteristics from other attributes.  When issuing,
      executing and successuflly completing requests, failfast doesn't make
      any difference.  It only affects how a request is handled on failure.
      Allowing requests with different failfast settings to be merged cause
      normal IOs to fail prematurely while not allowing has performance
      penalties as failfast is used for read aheads which are likely to be
      located near in-flight or to-be-issued normal IOs.
      This patch introduces the concept of 'mixed merge'.  A request is a
      mixed merge if it is merge of segments which require different
      handling on failure.  Currently the only mixable attributes are
      failfast ones (or lack thereof).
      When a bio with different failfast settings is added to an existing
      request or requests of different failfast settings are merged, the
      merged request is marked mixed.  Each bio carries failfast settings
      and the request always tracks failfast state of the first bio.  When
      the request fails, blk_rq_err_bytes() can be used to determine how
      many bytes can be safely failed without crossing into an area which
      requires further retrials.
      This allows request merging regardless of failfast settings while
      keeping the failure handling correct.
      This patch only implements mixed merge but doesn't enable it.  The
      next one will update SCSI to make use of mixed merge.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  2. 03 Jul, 2009 1 commit
    • Tejun Heo's avatar
      block: don't merge requests of different failfast settings · ab0fd1de
      Tejun Heo authored
      Block layer used to merge requests and bios with different failfast
      settings.  This caused regular IOs to fail prematurely when they were
      merged into failfast requests for readahead.
      Niel Lambrechts could trigger the problem semi-reliably on ext4 when
      resuming from STR.  ext4 uses readahead when reading inodes and
      combined with the deterministic extra SATA PHY exception cycle during
      resume on the specific configuration, non-readahead inode read would
      fail causing ext4 errors.  Please read the following thread for
      This patch makes block layer reject merging if the failfast settings
      don't match.  This is correct but likely to lower IO performance by
      preventing regular IOs from mingling into surrounding readahead
      requests.  Changes to allow such mixed merges and handle errors
      correctly will be added later.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarNiel Lambrechts <niel.lambrechts@gmail.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Signed-off-by: default avatarJens Axboe <axboe@carl.(none)>
  3. 22 May, 2009 1 commit
  4. 11 May, 2009 3 commits
    • Tejun Heo's avatar
      block: hide request sector and data_len · a2dec7b3
      Tejun Heo authored
      Block low level drivers for some reason have been pretty good at
      abusing block layer API.  Especially struct request's fields tend to
      get violated in all possible ways.  Make it clear that low level
      drivers MUST NOT access or manipulate rq->sector and rq->data_len
      directly by prefixing them with double underscores.
      This change is also necessary to break build of out-of-tree codes
      which assume the previous block API where internal fields can be
      manipulated and rq->data_len carries residual count on completion.
      [ Impact: hide internal fields, block API change ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: drop request->hard_* and *nr_sectors · 2e46e8b2
      Tejun Heo authored
      struct request has had a few different ways to represent some
      properties of a request.  ->hard_* represent block layer's view of the
      request progress (completion cursor) and the ones without the prefix
      are supposed to represent the issue cursor and allowed to be updated
      as necessary by the low level drivers.  The thing is that as block
      layer supports partial completion, the two cursors really aren't
      necessary and only cause confusion.  In addition, manual management of
      request detail from low level drivers is cumbersome and error-prone at
      the very least.
      Another interesting duplicate fields are rq->[hard_]nr_sectors and
      rq->{hard_cur|current}_nr_sectors against rq->data_len and
      rq->bio->bi_size.  This is more convoluted than the hard_ case.
      rq->[hard_]nr_sectors are initialized for requests with bio but
      blk_rq_bytes() uses it only for !pc requests.  rq->data_len is
      initialized for all request but blk_rq_bytes() uses it only for pc
      requests.  This causes good amount of confusion throughout block layer
      and its drivers and determining the request length has been a bit of
      black magic which may or may not work depending on circumstances and
      what the specific LLD is actually doing.
      rq->{hard_cur|current}_nr_sectors represent the number of sectors in
      the contiguous data area at the front.  This is mainly used by drivers
      which transfers data by walking request segment-by-segment.  This
      value always equals rq->bio->bi_size >> 9.  However, data length for
      pc requests may not be multiple of 512 bytes and using this field
      becomes a bit confusing.
      In general, having multiple fields to represent the same property
      leads only to confusion and subtle bugs.  With recent block low level
      driver cleanups, no driver is accessing or manipulating these
      duplicate fields directly.  Drop all the duplicates.  Now rq->sector
      means the current sector, rq->data_len the current total length and
      rq->bio->bi_size the current segment length.  Everything else is
      defined in terms of these three and available only through accessors.
      * blk_recalc_rq_sectors() is collapsed into blk_update_request() and
        now handles pc and fs requests equally other than rq->sector update.
        This means that now pc requests can use partial completion too (no
        in-kernel user yet tho).
      * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
        now uses byte count as the primary data length.
      * blk_rq_pos() is now guranteed to be always correct.  In-block users
      * blk_rq_bytes() is now guaranteed to be always valid as is
        blk_rq_sectors().  In-block users converted.
      * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
        More convenient one is used.
      * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
        pointer to request.
      [ Impact: API cleanup, single way to represent one property of a request ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: convert to pos and nr_sectors accessors · 83096ebf
      Tejun Heo authored
      With recent cleanups, there is no place where low level driver
      directly manipulates request fields.  This means that the 'hard'
      request fields always equal the !hard fields.  Convert all
      rq->sectors, nr_sectors and current_nr_sectors references to
      While at it, drop superflous blk_rq_pos() < 0 test in swim.c.
      [ Impact: use pos and nr_sectors accessors ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarGeert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
      Tested-by: default avatarGrant Likely <grant.likely@secretlab.ca>
      Acked-by: default avatarGrant Likely <grant.likely@secretlab.ca>
      Tested-by: default avatarAdrian McMenamin <adrian@mcmen.demon.co.uk>
      Acked-by: default avatarAdrian McMenamin <adrian@mcmen.demon.co.uk>
      Acked-by: default avatarMike Miller <mike.miller@hp.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      Cc: Borislav Petkov <petkovbb@googlemail.com>
      Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
      Cc: Eric Moore <Eric.Moore@lsi.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Pete Zaitcev <zaitcev@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Paul Clements <paul.clements@steeleye.com>
      Cc: Tim Waugh <tim@cyberelk.net>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Alex Dubov <oakad@yahoo.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Dario Ballabio <ballabio_dario@emc.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: unsik Kim <donari75@gmail.com>
      Cc: Laurent Vivier <Laurent@lvivier.info>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  5. 24 Apr, 2009 1 commit
  6. 07 Apr, 2009 1 commit
  7. 26 Mar, 2009 1 commit
  8. 06 Mar, 2009 1 commit
  9. 26 Feb, 2009 1 commit
    • Jens Axboe's avatar
      block: reduce stack footprint of blk_recount_segments() · 1e428079
      Jens Axboe authored
      blk_recalc_rq_segments() requires a request structure passed in, which
      we don't have from blk_recount_segments(). So the latter allocates one on
      the stack, using > 400 bytes of stack for that. This can cause us to spill
      over one page of stack from ext4 at least:
       0)     4560     400   blk_recount_segments+0x43/0x62
       1)     4160      32   bio_phys_segments+0x1c/0x24
       2)     4128      32   blk_rq_bio_prep+0x2a/0xf9
       3)     4096      32   init_request_from_bio+0xf9/0xfe
       4)     4064     112   __make_request+0x33c/0x3f6
       5)     3952     144   generic_make_request+0x2d1/0x321
       6)     3808      64   submit_bio+0xb9/0xc3
       7)     3744      48   submit_bh+0xea/0x10e
       8)     3696     368   ext4_mb_init_cache+0x257/0xa6a [ext4]
       9)     3328     288   ext4_mb_regular_allocator+0x421/0xcd9 [ext4]
      10)     3040     160   ext4_mb_new_blocks+0x211/0x4b4 [ext4]
      11)     2880     336   ext4_ext_get_blocks+0xb61/0xd45 [ext4]
      12)     2544      96   ext4_get_blocks_wrap+0xf2/0x200 [ext4]
      13)     2448      80   ext4_da_get_block_write+0x6e/0x16b [ext4]
      14)     2368     352   mpage_da_map_blocks+0x7e/0x4b3 [ext4]
      15)     2016     352   ext4_da_writepages+0x2ce/0x43c [ext4]
      16)     1664      32   do_writepages+0x2d/0x3c
      17)     1632     144   __writeback_single_inode+0x162/0x2cd
      18)     1488      96   generic_sync_sb_inodes+0x1e3/0x32b
      19)     1392      16   sync_sb_inodes+0xe/0x10
      20)     1376      48   writeback_inodes+0x69/0xb3
      21)     1328     208   balance_dirty_pages_ratelimited_nr+0x187/0x2f9
      22)     1120     224   generic_file_buffered_write+0x1d4/0x2c4
      23)      896     176   __generic_file_aio_write_nolock+0x35f/0x393
      24)      720      80   generic_file_aio_write+0x6c/0xc8
      25)      640      80   ext4_file_write+0xa9/0x137 [ext4]
      26)      560     320   do_sync_write+0xf0/0x137
      27)      240      48   vfs_write+0xb3/0x13c
      28)      192      64   sys_write+0x4c/0x74
      29)      128     128   system_call_fastpath+0x16/0x1b
      Split the segment counting out into a __blk_recalc_rq_segments() helper
      to avoid allocating an onstack request just for checking the physical
      segment count.
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  10. 06 Nov, 2008 1 commit
  11. 17 Oct, 2008 1 commit
    • FUJITA Tomonori's avatar
      block: fix nr_phys_segments miscalculation bug · 86771427
      FUJITA Tomonori authored
      This fixes the bug reported by Nikanth Karthikesan <knikanth@suse.de>:
      The root cause of the bug is that blk_phys_contig_segment
      miscalculates q->max_segment_size.
      blk_phys_contig_segment checks:
      req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size
      But blk_recalc_rq_segments might expect that req->biotail and the
      previous bio in the req are supposed be merged into one
      segment. blk_recalc_rq_segments might also expect that next_req->bio
      and the next bio in the next_req are supposed be merged into one
      segment. In such case, we merge two requests that can't be merged
      here. Later, blk_rq_map_sg gives more segments than it should.
      We need to keep track of segment size in blk_recalc_rq_segments and
      use it to see if two requests can be merged. This patch implements it
      in the similar way that we used to do for hw merging (virtual
      Signed-off-by: default avatarFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  12. 09 Oct, 2008 8 commits
    • Jens Axboe's avatar
      block: inherit CPU completion on bio->rq and rq->rq merges · ab780f1e
      Jens Axboe authored
      Somewhat incomplete, as we do allow merges of requests and bios
      that have different completion CPUs given. This is done on the
      assumption that a larger IO is still more beneficial than CPU
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: move stats from disk to part0 · 074a7aca
      Tejun Heo authored
      Move stats related fields - stamp, in_flight, dkstats - from disk to
      part0 and unify stat handling such that...
      * part_stat_*() now updates part0 together if the specified partition
        is not part0.  ie. part_stat_*() are now essentially all_stat_*().
      * {disk|all}_stat_*() are gone.
      * part_round_stats() is updated similary.  It handles part0 stats
        automatically and disk_round_stats() is killed.
      * part_{inc|dec}_in_fligh() is implemented which automatically updates
        part0 stats for parts other than part0.
      * disk_map_sector_rcu() is updated to return part0 if no part matches.
        Combined with the above changes, this makes NULL special case
        handling in callers unnecessary.
      * Separate stats show code paths for disk are collapsed into part
        stats show code paths.
      * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
      While at it, reposition stat handling macros a bit and add missing
      parentheses around macro parameters.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: fix diskstats access · c9959059
      Tejun Heo authored
      There are two variants of stat functions - ones prefixed with double
      underbars which don't care about preemption and ones without which
      disable preemption before manipulating per-cpu counters.  It's unclear
      whether the underbarred ones assume that preemtion is disabled on
      entry as some callers don't do that.
      This patch unifies diskstats access by implementing disk_stat_lock()
      and disk_stat_unlock() which take care of both RCU (for partition
      access) and preemption (for per-cpu counter access).  diskstats access
      should always be enclosed between the two functions.  As such, there's
      no need for the versions which disables preemption.  They're removed
      and double underbars ones are renamed to drop the underbars.  As an
      extra argument is added, there's no danger of using the old version
      disk_stat_lock() uses get_cpu() and returns the cpu index and all
      diskstat functions which access per-cpu counters now has @cpu
      argument to help RT.
      This change adds RCU or preemption operations at some places but also
      collapses several preemption ops into one at others.  Overall, the
      performance difference should be negligible as all involved ops are
      very lightweight per-cpu ones.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: fix disk->part[] dereferencing race · e71bf0d0
      Tejun Heo authored
      disk->part[] is protected by its matching bdev's lock.  However,
      non-critical accesses like collecting stats and printing out sysfs and
      proc information used to be performed without any locking.  As
      partitions can come and go dynamically, partitions can go away
      underneath those non-critical accesses.  As some of those accesses are
      writes, this theoretically can lead to silent corruption.
      This patch fixes the race by using RCU for the partition array and dev
      reference counter to hold partitions.
      * Rename disk->part[] to disk->__part[] to make sure no one outside
        genhd layer proper accesses it directly.
      * Use RCU for disk->__part[] dereferencing.
      * Implement disk_{get|put}_part() which can be used to get and put
        partitions from gendisk respectively.
      * Iterators are implemented to help iterate through all partitions
      * Functions which require RCU readlock are marked with _rcu suffix.
      * Use disk_put_part() in __blkdev_put() instead of directly putting
        the contained kobject.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: misc updates · 310a2c10
      Tejun Heo authored
      This patch makes the following misc updates in preparation for
      disk->part dereference fix and extended block devt support.
      * implment part_to_disk()
      * fix comment about gendisk->part indexing
      * rename get_part() to disk_map_sector()
      * don't use n which is always zero while printing disk information in
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Mikulas Patocka's avatar
      drop vmerge accounting · 5df97b91
      Mikulas Patocka authored
      Remove hw_segments field from struct bio and struct request. Without virtual
      merge accounting they have no purpose.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Mikulas Patocka's avatar
      block: drop virtual merging accounting · b8b3e16c
      Mikulas Patocka authored
      Remove virtual merge accounting.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • David Woodhouse's avatar
      Allow elevators to sort/merge discard requests · e17fc0a1
      David Woodhouse authored
      But blkdev_issue_discard() still emits requests which are interpreted as
      soft barriers, because naïve callers might otherwise issue subsequent
      writes to those same sectors, which might cross on the queue (if they're
      reallocated quickly enough).
      Callers still _can_ issue non-barrier discard requests, but they have to
      take care of queue ordering for themselves.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  13. 03 Jul, 2008 1 commit
  14. 07 May, 2008 1 commit
  15. 29 Apr, 2008 1 commit
  16. 21 Apr, 2008 1 commit
    • FUJITA Tomonori's avatar
      block: move the padding adjustment to blk_rq_map_sg · f18573ab
      FUJITA Tomonori authored
      blk_rq_map_user adjusts bi_size of the last bio. It breaks the rule
      that req->data_len (the true data length) is equal to sum(bio). It
      broke the scsi command completion code.
      commit e97a294e
       was introduced to fix
      the above issue. However, the partial completion code doesn't work
      with it. The commit is also a layer violation (scsi mid-layer should
      not know about the block layer's padding).
      This patch moves the padding adjustment to blk_rq_map_sg (suggested by
      James). The padding works like the drain buffer. This patch breaks the
      rule that req->data_len is equal to sum(sg), however, the drain buffer
      already broke it. So this patch just restores the rule that
      req->data_len is equal to sub(bio) without breaking anything new.
      Now when a low level driver needs padding, blk_rq_map_user and
      blk_rq_map_user_iov guarantee there's enough room for padding.
      blk_rq_map_sg can safely extend the last entry of a scatter list.
      blk_rq_map_sg must extend the last entry of a scatter list only for a
      request that got through bio_copy_user_iov. This patches introduces
      new REQ_COPY_USER flag.
      Signed-off-by: default avatarFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  17. 04 Mar, 2008 1 commit
    • FUJITA Tomonori's avatar
      block: restore the meaning of rq->data_len to the true data length · 7a85f889
      FUJITA Tomonori authored
      The meaning of rq->data_len was changed to the length of an allocated
      buffer from the true data length. It breaks SG_IO friends and
      bsg. This patch restores the meaning of rq->data_len to the true data
      length and adds rq->extra_len to store an extended length (due to
      drain buffer and padding).
      This patch also removes the code to update bio in blk_rq_map_user
      introduced by the commit 40b01b9b
      The commit adjusts bio according to memory alignment
      (queue_dma_alignment). However, memory alignment is NOT padding
      alignment. This adjustment also breaks SG_IO friends and bsg. Padding
      alignment needs to be fixed in a proper way (by a separate patch).
      Signed-off-by: default avatarFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Signed-off-by: default avatarJens Axboe <axboe@carl.home.kernel.dk>
  18. 19 Feb, 2008 3 commits
    • Tejun Heo's avatar
      block: clear drain buffer if draining for write command · db0a2e00
      Tejun Heo authored
      Clear drain buffer before chaining if the command in question is a
      Signed-off-by: default avatarTejun Heo <htejun@gmail.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: implement request_queue->dma_drain_needed · 2fb98e84
      Tejun Heo authored
      Draining shouldn't be done for commands where overflow may indicate
      data integrity issues.  Add dma_drain_needed callback to
      request_queue.  Drain buffer is appened iff this function returns
      Signed-off-by: default avatarTejun Heo <htejun@gmail.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: add request->raw_data_len · 6b00769f
      Tejun Heo authored
      With padding and draining moved into it, block layer now may extend
      requests as directed by queue parameters, so now a request has two
      sizes - the original request size and the extended size which matches
      the size of area pointed to by bios and later by sgs.  The latter size
      is what lower layers are primarily interested in when allocating,
      filling up DMA tables and setting up the controller.
      Both padding and draining extend the data area to accomodate
      controller characteristics.  As any controller which speaks SCSI can
      handle underflows, feeding larger data area is safe.
      So, this patch makes the primary data length field, request->data_len,
      indicate the size of full data area and add a separate length field,
      request->raw_data_len, for the unmodified request size.  The latter is
      used to report to higher layer (userland) and where the original
      request size should be fed to the controller or device.
      Signed-off-by: default avatarTejun Heo <htejun@gmail.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  19. 08 Feb, 2008 1 commit
  20. 01 Feb, 2008 1 commit
  21. 29 Jan, 2008 1 commit