1. 02 Sep, 2014 1 commit
  2. 24 Jun, 2014 1 commit
    • Jens Axboe's avatar
      block: add support for limiting gaps in SG lists · 66cb45aa
      Jens Axboe authored
      Another restriction inherited for NVMe - those devices don't support
      SG lists that have "gaps" in them. Gaps refers to cases where the
      previous SG entry doesn't end on a page boundary. For NVMe, all SG
      entries must start at offset 0 (except the first) and end on a page
      boundary (except the last).
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      66cb45aa
  3. 29 May, 2014 1 commit
    • Jens Axboe's avatar
      block: add queue flag for disabling SG merging · 05f1dd53
      Jens Axboe authored
      If devices are not SG starved, we waste a lot of time potentially
      collapsing SG segments. Enough that 1.5% of the CPU time goes
      to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
      which just returns the number of vectors in a bio instead of looping
      over all segments and checking for collapsible ones.
      
      Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
      merging, if they so desire.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      05f1dd53
  4. 07 Feb, 2014 1 commit
  5. 03 Dec, 2013 1 commit
  6. 27 Nov, 2013 1 commit
  7. 24 Nov, 2013 3 commits
    • Kent Overstreet's avatar
      block: Kill bio_iovec_idx(), __bio_iovec() · f619d254
      Kent Overstreet authored
      bio_iovec_idx() and __bio_iovec() don't have any valid uses anymore -
      previous users have been converted to bio_iovec_iter() or other methods.
      
      __BVEC_END() has to go too - the bvec array can't be used directly for
      the last biovec because we might only be using the first portion of it,
      we have to iterate over the bvec array with bio_for_each_segment() which
      checks against the current value of bi_iter.bi_size.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      f619d254
    • Kent Overstreet's avatar
      block: Convert bio_for_each_segment() to bvec_iter · 7988613b
      Kent Overstreet authored
      More prep work for immutable biovecs - with immutable bvecs drivers
      won't be able to use the biovec directly, they'll need to use helpers
      that take into account bio->bi_iter.bi_bvec_done.
      
      This updates callers for the new usage without changing the
      implementation yet.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Paul Clements <Paul.Clements@steeleye.com>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
      Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
      Cc: support@lsi.com
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: drbd-user@lists.linbit.com
      Cc: nbd-general@lists.sourceforge.net
      Cc: cbe-oss-dev@lists.ozlabs.org
      Cc: xen-devel@lists.xensource.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: linux-raid@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: DL-MPTFusionLinux@lsi.com
      Cc: linux-scsi@vger.kernel.org
      Cc: devel@driverdev.osuosl.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: cluster-devel@redhat.com
      Cc: linux-mm@kvack.org
      Acked-by: default avatarGeoff Levand <geoff@infradead.org>
      7988613b
    • Kent Overstreet's avatar
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet authored
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
  8. 29 Oct, 2013 1 commit
    • Jens Axboe's avatar
      blk-mq: don't disallow request merges for req->special being set · e7e24500
      Jens Axboe authored
      For blk-mq, if a driver has requested per-request payload data
      to carry command structures, they are stuffed into req->special.
      For an old style request based driver, req->special is used
      for the same purpose but indicates that a per-driver request
      structure has been prepared for the request already. So for the
      old style driver, we do not merge such requests.
      
      As most/all blk-mq drivers will use the payload feature, and
      since we have no problem merging on these, make this check
      dependent on whether it's a blk-mq enabled driver or not.
      Reported-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e7e24500
  9. 20 Mar, 2013 1 commit
  10. 20 Sep, 2012 3 commits
  11. 02 Aug, 2012 2 commits
    • Asias He's avatar
      block: Add blk_bio_map_sg() helper · 85b9f66a
      Asias He authored
      Add a helper to map a bio to a scatterlist, modelled after
      blk_rq_map_sg.
      
      This helper is useful for any driver that wants to create
      a scatterlist from its ->make_request_fn method.
      
      Changes in v2:
       - Use __blk_segment_map_sg to avoid duplicated code
       - Add cocbook style function comment
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: virtualization@lists.linux-foundation.org
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAsias He <asias@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      85b9f66a
    • Asias He's avatar
      block: Introduce __blk_segment_map_sg() helper · 963ab9e5
      Asias He authored
      Split the mapping code in blk_rq_map_sg() to a helper
      __blk_segment_map_sg(), so that other mapping function, e.g.
      blk_bio_map_sg(), can share the code.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: virtualization@lists.linux-foundation.org
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAsias He <asias@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      963ab9e5
  12. 08 Feb, 2012 1 commit
    • Tejun Heo's avatar
      block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions · 050c8ea8
      Tejun Heo authored
      blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
      test.  blk_try_merge() determines merge direction and expects the
      caller to have tested elv_rq_merge_ok() previously.
      
      elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
      elv_iosched_allow_merge().  elv_try_merge() is removed and the two
      callers are updated to call elv_rq_merge_ok() explicitly followed by
      blk_try_merge().  While at it, make rq_merge_ok() functions return
      bool.
      
      This is to prepare for plug merge update and doesn't introduce any
      behavior change.
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      050c8ea8
  13. 21 Mar, 2011 1 commit
    • Jens Axboe's avatar
      block: attempt to merge with existing requests on plug flush · 5e84ea3a
      Jens Axboe authored
      One of the disadvantages of on-stack plugging is that we potentially
      lose out on merging since all pending IO isn't always visible to
      everybody. When we flush the on-stack plugs, right now we don't do
      any checks to see if potential merge candidates could be utilized.
      
      Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
      It works just ELEVATOR_INSERT_SORT, but first checks whether we can
      merge with an existing request before doing the insertion (if we fail
      merging).
      
      This fixes a regression with multiple processes issuing IO that
      can be merged.
      
      Thanks to Shaohua Li <shaohua.li@intel.com> for testing and fixing
      an accounting bug.
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      5e84ea3a
  14. 07 Jan, 2011 1 commit
  15. 05 Jan, 2011 1 commit
    • Jerome Marchand's avatar
      block: fix accounting bug on cross partition merges · 09e099d4
      Jerome Marchand authored
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      Also add a refcount to struct hd_struct to keep the partition in
      memory as long as users exist. We use kref_test_and_get() to ensure
      we don't add a reference to a partition which is going away.
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      09e099d4
  16. 17 Dec, 2010 1 commit
    • Martin K. Petersen's avatar
      block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead · e692cb66
      Martin K. Petersen authored
      When stacking devices, a request_queue is not always available. This
      forced us to have a no_cluster flag in the queue_limits that could be
      used as a carrier until the request_queue had been set up for a
      metadevice.
      
      There were several problems with that approach. First of all it was up
      to the stacking device to remember to set queue flag after stacking had
      completed. Also, the queue flag and the queue limits had to be kept in
      sync at all times. We got that wrong, which could lead to us issuing
      commands that went beyond the max scatterlist limit set by the driver.
      
      The proper fix is to avoid having two flags for tracking the same thing.
      We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
      block layer merging functions. The queue_limit 'no_cluster' is turned
      into 'cluster' to avoid double negatives and to ease stacking.
      Clustering defaults to being enabled as before. The queue flag logic is
      removed from the stacking function, and explicitly setting the cluster
      flag is no longer necessary in DM and MD.
      Reported-by: default avatarEd Lin <ed.lin@promise.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      e692cb66
  17. 24 Oct, 2010 1 commit
  18. 19 Oct, 2010 1 commit
    • Yasuaki Ishimatsu's avatar
      block: fix accounting bug on cross partition merges · 7681bfee
      Yasuaki Ishimatsu authored
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      When reloading partition tables, quiesce IO to ensure that no
      request references to the partition struct exists. When it is safe
      to free the partition table, the IO for that device is restarted
      again.
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      7681bfee
  19. 25 Sep, 2010 1 commit
    • Adrian Hunter's avatar
      block: prevent merges of discard and write requests · f281fb5f
      Adrian Hunter authored
      Add logic to prevent two I/O requests being merged if
      only one of them is a discard.  Ditto secure discard.
      
      Without this fix, it is possible for write requests
      to transform into discard requests.  For example:
      
        Submit bio 1 to discard 8 sectors from sector n
        Submit bio 2 to write 8 sectors from sector n + 16
        Submit bio 3 to write 8 sectors from sector n + 8
      
      Bio 1 becomes request 1.  Bio 2 becomes request 2.
      Bio 3 is merged with request 2, and then subsequently
      request 2 is merged with request 1 resulting in just
      one I/O request which discards all 24 sectors.
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@nokia.com>
      
      (Moved the checks above the position checks /Jens)
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      f281fb5f
  20. 10 Sep, 2010 1 commit
  21. 07 Aug, 2010 3 commits
  22. 26 Feb, 2010 1 commit
  23. 06 Oct, 2009 1 commit
    • Nikanth Karthikesan's avatar
      block: Seperate read and write statistics of in_flight requests v2 · 316d315b
      Nikanth Karthikesan authored
      Commit a9327cac added seperate read
      and write statistics of in_flight requests. And exported the number
      of read and write requests in progress seperately through sysfs.
      
      But  Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
      output from "iostat -kx 2". Global values for service time and
      utilization were garbage. For interval values, utilization was always
      100%, and service time is higher than normal.
      
      So this was reverted by commit 0f78ab98
      
      The problem was in part_round_stats_single(), I missed the following:
              if (now == part->stamp)
                      return;
      
      -       if (part->in_flight) {
      +       if (part_in_flight(part)) {
                      __part_stat_add(cpu, part, time_in_queue,
                                      part_in_flight(part) * (now - part->stamp));
                      __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
      
      With this chunk included, the reported regression gets fixed.
      Signed-off-by: default avatarNikanth Karthikesan <knikanth@suse.de>
      
      --
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      316d315b
  24. 04 Oct, 2009 1 commit
    • Jens Axboe's avatar
      Revert "Seperate read and write statistics of in_flight requests" · 0f78ab98
      Jens Axboe authored
      This reverts commit a9327cac.
      
      Corrado Zoccolo <czoccolo@gmail.com> reports:
      
      "with 2.6.32-rc1 I started getting the following strange output from
      "iostat -kx 2":
      Linux 2.6.31bisect (et2) 	04/10/2009 	_i686_	(2 CPU)
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                10,70    0,00    3,16   15,75    0,00   70,38
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda              18,22     0,00    0,67    0,01    14,77     0,02
      43,94     0,01   10,53 39043915,03 2629219,87
      sdb              60,89     9,68   50,79    3,04  1724,43    50,52
      65,95     0,70   13,06 488437,47 2629219,87
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 2,72    0,00    0,74    0,00    0,00   96,53
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 6,68    0,00    0,99    0,00    0,00   92,33
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 4,40    0,00    0,73    1,47    0,00   93,40
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     4,00    0,00    3,00     0,00    28,00
      18,67     0,06   19,50 333,33 100,00
      
      Global values for service time and utilization are garbage. For
      interval values, utilization is always 100%, and service time is
      higher than normal.
      
      I bisected it down to:
      [a9327cac] Seperate read and write
      statistics of in_flight requests
      and verified that reverting just that commit indeed solves the issue
      on 2.6.32-rc1."
      
      So until this is debugged, revert the bad commit.
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      0f78ab98
  25. 14 Sep, 2009 1 commit
  26. 11 Sep, 2009 2 commits
    • Tejun Heo's avatar
      scsi,block: update SCSI to handle mixed merge failures · da6c5c72
      Tejun Heo authored
      Update scsi_io_completion() such that it only fails requests till the
      next error boundary and retry the leftover.  This enables block layer
      to merge requests with different failfast settings and still behave
      correctly on errors.  Allow merge of requests of different failfast
      settings.
      
      As SCSI is currently the only subsystem which follows failfast status,
      there's no need to worry about other block drivers for now.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      da6c5c72
    • Tejun Heo's avatar
      block: implement mixed merge of different failfast requests · 80a761fd
      Tejun Heo authored
      Failfast has characteristics from other attributes.  When issuing,
      executing and successuflly completing requests, failfast doesn't make
      any difference.  It only affects how a request is handled on failure.
      Allowing requests with different failfast settings to be merged cause
      normal IOs to fail prematurely while not allowing has performance
      penalties as failfast is used for read aheads which are likely to be
      located near in-flight or to-be-issued normal IOs.
      
      This patch introduces the concept of 'mixed merge'.  A request is a
      mixed merge if it is merge of segments which require different
      handling on failure.  Currently the only mixable attributes are
      failfast ones (or lack thereof).
      
      When a bio with different failfast settings is added to an existing
      request or requests of different failfast settings are merged, the
      merged request is marked mixed.  Each bio carries failfast settings
      and the request always tracks failfast state of the first bio.  When
      the request fails, blk_rq_err_bytes() can be used to determine how
      many bytes can be safely failed without crossing into an area which
      requires further retrials.
      
      This allows request merging regardless of failfast settings while
      keeping the failure handling correct.
      
      This patch only implements mixed merge but doesn't enable it.  The
      next one will update SCSI to make use of mixed merge.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      80a761fd
  27. 03 Jul, 2009 1 commit
    • Tejun Heo's avatar
      block: don't merge requests of different failfast settings · ab0fd1de
      Tejun Heo authored
      Block layer used to merge requests and bios with different failfast
      settings.  This caused regular IOs to fail prematurely when they were
      merged into failfast requests for readahead.
      
      Niel Lambrechts could trigger the problem semi-reliably on ext4 when
      resuming from STR.  ext4 uses readahead when reading inodes and
      combined with the deterministic extra SATA PHY exception cycle during
      resume on the specific configuration, non-readahead inode read would
      fail causing ext4 errors.  Please read the following thread for
      details.
      
        http://lkml.org/lkml/2009/5/23/21
      
      This patch makes block layer reject merging if the failfast settings
      don't match.  This is correct but likely to lower IO performance by
      preventing regular IOs from mingling into surrounding readahead
      requests.  Changes to allow such mixed merges and handle errors
      correctly will be added later.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarNiel Lambrechts <niel.lambrechts@gmail.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Signed-off-by: default avatarJens Axboe <axboe@carl.(none)>
      ab0fd1de
  28. 22 May, 2009 1 commit
  29. 11 May, 2009 3 commits
    • Tejun Heo's avatar
      block: hide request sector and data_len · a2dec7b3
      Tejun Heo authored
      Block low level drivers for some reason have been pretty good at
      abusing block layer API.  Especially struct request's fields tend to
      get violated in all possible ways.  Make it clear that low level
      drivers MUST NOT access or manipulate rq->sector and rq->data_len
      directly by prefixing them with double underscores.
      
      This change is also necessary to break build of out-of-tree codes
      which assume the previous block API where internal fields can be
      manipulated and rq->data_len carries residual count on completion.
      
      [ Impact: hide internal fields, block API change ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      a2dec7b3
    • Tejun Heo's avatar
      block: drop request->hard_* and *nr_sectors · 2e46e8b2
      Tejun Heo authored
      struct request has had a few different ways to represent some
      properties of a request.  ->hard_* represent block layer's view of the
      request progress (completion cursor) and the ones without the prefix
      are supposed to represent the issue cursor and allowed to be updated
      as necessary by the low level drivers.  The thing is that as block
      layer supports partial completion, the two cursors really aren't
      necessary and only cause confusion.  In addition, manual management of
      request detail from low level drivers is cumbersome and error-prone at
      the very least.
      
      Another interesting duplicate fields are rq->[hard_]nr_sectors and
      rq->{hard_cur|current}_nr_sectors against rq->data_len and
      rq->bio->bi_size.  This is more convoluted than the hard_ case.
      
      rq->[hard_]nr_sectors are initialized for requests with bio but
      blk_rq_bytes() uses it only for !pc requests.  rq->data_len is
      initialized for all request but blk_rq_bytes() uses it only for pc
      requests.  This causes good amount of confusion throughout block layer
      and its drivers and determining the request length has been a bit of
      black magic which may or may not work depending on circumstances and
      what the specific LLD is actually doing.
      
      rq->{hard_cur|current}_nr_sectors represent the number of sectors in
      the contiguous data area at the front.  This is mainly used by drivers
      which transfers data by walking request segment-by-segment.  This
      value always equals rq->bio->bi_size >> 9.  However, data length for
      pc requests may not be multiple of 512 bytes and using this field
      becomes a bit confusing.
      
      In general, having multiple fields to represent the same property
      leads only to confusion and subtle bugs.  With recent block low level
      driver cleanups, no driver is accessing or manipulating these
      duplicate fields directly.  Drop all the duplicates.  Now rq->sector
      means the current sector, rq->data_len the current total length and
      rq->bio->bi_size the current segment length.  Everything else is
      defined in terms of these three and available only through accessors.
      
      * blk_recalc_rq_sectors() is collapsed into blk_update_request() and
        now handles pc and fs requests equally other than rq->sector update.
        This means that now pc requests can use partial completion too (no
        in-kernel user yet tho).
      
      * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
        now uses byte count as the primary data length.
      
      * blk_rq_pos() is now guranteed to be always correct.  In-block users
        converted.
      
      * blk_rq_bytes() is now guaranteed to be always valid as is
        blk_rq_sectors().  In-block users converted.
      
      * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
        More convenient one is used.
      
      * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
        pointer to request.
      
      [ Impact: API cleanup, single way to represent one property of a request ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      2e46e8b2
    • Tejun Heo's avatar
      block: convert to pos and nr_sectors accessors · 83096ebf
      Tejun Heo authored
      With recent cleanups, there is no place where low level driver
      directly manipulates request fields.  This means that the 'hard'
      request fields always equal the !hard fields.  Convert all
      rq->sectors, nr_sectors and current_nr_sectors references to
      accessors.
      
      While at it, drop superflous blk_rq_pos() < 0 test in swim.c.
      
      [ Impact: use pos and nr_sectors accessors ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarGeert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
      Tested-by: default avatarGrant Likely <grant.likely@secretlab.ca>
      Acked-by: default avatarGrant Likely <grant.likely@secretlab.ca>
      Tested-by: default avatarAdrian McMenamin <adrian@mcmen.demon.co.uk>
      Acked-by: default avatarAdrian McMenamin <adrian@mcmen.demon.co.uk>
      Acked-by: default avatarMike Miller <mike.miller@hp.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      Cc: Borislav Petkov <petkovbb@googlemail.com>
      Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
      Cc: Eric Moore <Eric.Moore@lsi.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Pete Zaitcev <zaitcev@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Paul Clements <paul.clements@steeleye.com>
      Cc: Tim Waugh <tim@cyberelk.net>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Alex Dubov <oakad@yahoo.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Dario Ballabio <ballabio_dario@emc.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: unsik Kim <donari75@gmail.com>
      Cc: Laurent Vivier <Laurent@lvivier.info>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      83096ebf
  30. 24 Apr, 2009 1 commit