1. 09 Oct, 2008 9 commits
    • Tejun Heo's avatar
      block: move stats from disk to part0 · 074a7aca
      Tejun Heo authored
      Move stats related fields - stamp, in_flight, dkstats - from disk to
      part0 and unify stat handling such that...
      
      * part_stat_*() now updates part0 together if the specified partition
        is not part0.  ie. part_stat_*() are now essentially all_stat_*().
      
      * {disk|all}_stat_*() are gone.
      
      * part_round_stats() is updated similary.  It handles part0 stats
        automatically and disk_round_stats() is killed.
      
      * part_{inc|dec}_in_fligh() is implemented which automatically updates
        part0 stats for parts other than part0.
      
      * disk_map_sector_rcu() is updated to return part0 if no part matches.
        Combined with the above changes, this makes NULL special case
        handling in callers unnecessary.
      
      * Separate stats show code paths for disk are collapsed into part
        stats show code paths.
      
      * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
      
      While at it, reposition stat handling macros a bit and add missing
      parentheses around macro parameters.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      074a7aca
    • Tejun Heo's avatar
      block: always set bdev->bd_part · 0762b8bd
      Tejun Heo authored
      Till now, bdev->bd_part is set only if the bdev was for parts other
      than part0.  This patch makes bdev->bd_part always set so that code
      paths don't have to differenciate common handling.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      0762b8bd
    • Tejun Heo's avatar
      block: move policy from disk to part0 · b7db9956
      Tejun Heo authored
      Move disk->policy to part0->policy.  Implement and use get_disk_ro().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      b7db9956
    • Tejun Heo's avatar
      block: implement and use {disk|part}_to_dev() · ed9e1982
      Tejun Heo authored
      Implement {disk|part}_to_dev() and use them to access generic device
      instead of directly dereferencing {disk|part}->dev.  To make sure no
      user is left behind, rename generic devices fields to __dev.
      
      This is in preparation of unifying partition 0 handling with other
      partitions.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      ed9e1982
    • Tejun Heo's avatar
      block: fix diskstats access · c9959059
      Tejun Heo authored
      There are two variants of stat functions - ones prefixed with double
      underbars which don't care about preemption and ones without which
      disable preemption before manipulating per-cpu counters.  It's unclear
      whether the underbarred ones assume that preemtion is disabled on
      entry as some callers don't do that.
      
      This patch unifies diskstats access by implementing disk_stat_lock()
      and disk_stat_unlock() which take care of both RCU (for partition
      access) and preemption (for per-cpu counter access).  diskstats access
      should always be enclosed between the two functions.  As such, there's
      no need for the versions which disables preemption.  They're removed
      and double underbars ones are renamed to drop the underbars.  As an
      extra argument is added, there's no danger of using the old version
      unconverted.
      
      disk_stat_lock() uses get_cpu() and returns the cpu index and all
      diskstat functions which access per-cpu counters now has @cpu
      argument to help RT.
      
      This change adds RCU or preemption operations at some places but also
      collapses several preemption ops into one at others.  Overall, the
      performance difference should be negligible as all involved ops are
      very lightweight per-cpu ones.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      c9959059
    • Tejun Heo's avatar
      block: don't depend on consecutive minor space · f331c029
      Tejun Heo authored
      * Implement disk_devt() and part_devt() and use them to directly
        access devt instead of computing it from ->major and ->first_minor.
      
        Note that all references to ->major and ->first_minor outside of
        block layer is used to determine devt of the disk (the part0) and as
        ->major and ->first_minor will continue to represent devt for the
        disk, converting these users aren't strictly necessary.  However,
        convert them for consistency.
      
      * Implement disk_max_parts() to avoid directly deferencing
        genhd->minors.
      
      * Update bdget_disk() such that it doesn't assume consecutive minor
        space.
      
      * Move devt computation from register_disk() to add_disk() and make it
        the only one (all other usages use the initially determined value).
      
      These changes clean up the code and will help disk->part dereference
      fix and extended block device numbers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      f331c029
    • Jens Axboe's avatar
      block: make bi_phys_segments an unsigned int instead of short · 5b99c2ff
      Jens Axboe authored
      raid5 can overflow with more than 255 stripes, and we can increase it
      to an int for free on both 32 and 64-bit archs due to the padding.
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      5b99c2ff
    • Jens Axboe's avatar
      960e739d
    • Mikulas Patocka's avatar
      drop vmerge accounting · 5df97b91
      Mikulas Patocka authored
      Remove hw_segments field from struct bio and struct request. Without virtual
      merge accounting they have no purpose.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      5df97b91
  2. 01 Oct, 2008 3 commits
    • Chandra Seetharaman's avatar
      dm mpath: add missing path switching locking · 7253a334
      Chandra Seetharaman authored
      Moving the path activation to workqueue along with scsi_dh patches introduced
      a race. It is due to the fact that the current_pgpath (in the multipath data
      structure) can be modified if changes happen in any of the paths leading to
      the lun. If the changes lead to current_pgpath being set to NULL, then it
      leads to the invalid access which results in the panic below.
      
      This patch fixes that by storing the pgpath to activate in the multipath data
      structure and properly protecting it.
      
      Note that if activate_path is called twice in succession with different pgpath,
      with the second one being called before the first one is done, then activate
      path will be called twice for the second pgpath, which is fine.
      
      Unable to handle kernel paging request for data at address 0x00000020
      Faulting instruction address: 0xd000000000aa1844
      cpu 0x1: Vector: 300 (Data Access) at [c00000006b987a80]
          pc: d000000000aa1844: .activate_path+0x30/0x218 [dm_multipath]
          lr: c000000000087a2c: .run_workqueue+0x114/0x204
          sp: c00000006b987d00
         msr: 8000000000009032
         dar: 20
       dsisr: 40000000
        current = 0xc0000000676bb3f0
        paca    = 0xc0000000006f3680
          pid   = 2528, comm = kmpath_handlerd
      enter ? for help
      [c00000006b987da0] c000000000087a2c .run_workqueue+0x114/0x204
      [c00000006b987e40] c000000000088b58 .worker_thread+0x120/0x144
      [c00000006b987f00] c00000000008ca70 .kthread+0x78/0xc4
      [c00000006b987f90] c000000000027cc8 .kernel_thread+0x4c/0x68
      Signed-off-by: default avatarChandra Seetharaman <sekharan@us.ibm.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      7253a334
    • Mikulas Patocka's avatar
      dm: cope with access beyond end of device in dm_merge_bvec · b01cd5ac
      Mikulas Patocka authored
      If for any reason dm_merge_bvec() is given an offset beyond the end of the
      device, avoid an oops and always allow one page to be added to an empty bio.
      We'll reject the I/O later after the bio is submitted.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      b01cd5ac
    • Mikulas Patocka's avatar
      dm: always allow one page in dm_merge_bvec · 5037108a
      Mikulas Patocka authored
      Some callers assume they can always add at least one page to an empty bio,
      so dm_merge_bvec should not return 0 in this case: we'll reject the I/O
      later after the bio is submitted.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      5037108a
  3. 19 Sep, 2008 1 commit
    • NeilBrown's avatar
      md: Don't wait UNINTERRUPTIBLE for other resync to finish · 9744197c
      NeilBrown authored
      When two md arrays share some block device (e.g each uses different
      partitions on the one device), a resync of one array will wait for
      the resync on the other to finish.
      
      This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
      the softlockup code notices and complains.
      
      So use TASK_INTERRUPTIBLE instead and make sure to flush signals
      before calling schedule.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      9744197c
  4. 01 Sep, 2008 2 commits
    • NeilBrown's avatar
      Fix problem with waiting while holding rcu read lock in md/bitmap.c · b2d2c4ce
      NeilBrown authored
      A recent patch to protect the rdev list with rcu locking leaves us
      with a problem because we can sleep on memalloc while holding the
      rcu lock.
      
      The rcu lock is only needed while walking the linked list as
      uninteresting devices (failed or spares) can be removed at any time.
      
      So only take the rcu lock while actually walking the linked list.
      Take a refcount on the rdev during the time when we drop the lock
      and do the memalloc to start IO.
      When we return to the locked code, all the interesting devices
      on the list will not have moved, so we can simply use
      list_for_each_continue_rcu to pick up where we left off.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b2d2c4ce
    • NeilBrown's avatar
      Remove invalidate_partition call from do_md_stop. · 271f5a9b
      NeilBrown authored
      When stopping an md array, or just switching to read-only, we
      currently call invalidate_partition while holding the mddev lock.
      The main reason for this is probably to ensure all dirty buffers
      are flushed (invalidate_partition calls fsync_bdev).
      
      However if any dirty buffers are found, it will almost certainly cause
      a deadlock as starting writeout will require an update to the
      superblock, and performing that updates requires taking the mddev
      lock - which is already held.
      
      This deadlock can be demonstrated by running "reboot -f -n" with
      a root filesystem on md/raid, and some dirty buffers in memory.
      
      All other calls to stop an array should already happen after a flush.
      The normal sequence is to stop using the array (e.g. umount) which
      will cause __blkdev_put to call sync_blockdev.  Then open the
      array and issue the STOP_ARRAY ioctl while the buffers are all still
      clean.
      
      So this invalidate_partition is normally a no-op, except for one case
      where it will cause a deadlock.
      
      So remove it.
      
      This patch possibly addresses the regression recored in
         http://bugzilla.kernel.org/show_bug.cgi?id=11460
      and
         http://bugzilla.kernel.org/show_bug.cgi?id=11452
      
      though it isn't yet clear how it ever worked.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      271f5a9b
  5. 07 Aug, 2008 1 commit
  6. 05 Aug, 2008 6 commits
    • NeilBrown's avatar
      Allow raid10 resync to happening in larger chunks. · 0310fa21
      NeilBrown authored
      The raid10 resync/recovery code currently limits the amount of
      in-flight resync IO to 2Meg.  This was copied from raid1 where
      it seems quite adequate.  However for raid10, some layouts require
      a bit of seeking to perform a resync, and allowing a larger buffer
      size means that the seeking can be significantly reduced.
      
      There is probably no real need to limit the amount of in-flight
      IO at all.  Any shortage of memory will naturally reduce the
      amount of buffer space available down to a set minimum, and any
      concurrent normal IO will quickly cause resync IO to back off.
      
      The only problem would be that normal IO has to wait for all resync IO
      to finish, so a very large amount of resync IO could cause unpleasant
      latency when normal IO starts up.
      
      So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is
      available) which seems to be a good amount.  Also reduce the amount
      of memory reserved as there is no need to keep 2Meg just for resync if
      memory is tight.
      
      Thanks to Keld for the suggestion.
      
      Cc: Keld Jørn Simonsen <keld@dkuug.dk>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      0310fa21
    • NeilBrown's avatar
      Allow faulty devices to be removed from a readonly array. · c89a8eee
      NeilBrown authored
      Removing faulty devices from an array is a two stage process.
      First the device is moved from being a part of the active array
      to being similar to a spare device.  Then it can be removed
      by a request from user space.
      
      The first step is currently not performed for read-only arrays,
      so the second step can never succeed.
      
      So allow readonly arrays to remove failed devices (which aren't
      blocked).
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c89a8eee
    • NeilBrown's avatar
      Don't let a blocked_rdev interfere with read request in raid5/6 · ac4090d2
      NeilBrown authored
      When we have externally managed metadata, we need to mark a failed
      device as 'Blocked' and not allow any writes until that device
      have been marked as faulty in the metadata and the Blocked flag has
      been removed.
      
      However it is perfectly OK to allow read requests when there is a
      Blocked device, and with a readonly array, there may not be any
      metadata-handler watching for blocked devices.
      
      So in raid5/raid6 only allow a Blocked device to interfere with
      Write request or resync.  Read requests go through untouched.
      
      raid1 and raid10 already differentiate between read and write
      properly.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ac4090d2
    • NeilBrown's avatar
      Fail safely when trying to grow an array with a write-intent bitmap. · dba034ee
      NeilBrown authored
      We cannot currently change the size of a write-intent bitmap.
      So if we change the size of an array which has such a bitmap, it
      tries to set bits beyond the end of the bitmap.
      
      For now, simply reject any request to change the size of an array
      which has a bitmap.  mdadm can remove the bitmap and add a new one
      after the array has changed size.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      dba034ee
    • NeilBrown's avatar
      Restore force switch of md array to readonly at reboot time. · 2b25000b
      NeilBrown authored
      A recent patch allowed do_md_stop to know whether it was being called
      via an ioctl or not, and thus where to allow for an extra open file
      descriptor when checking if it is in use.
      This broke then switch to readonly performed by the shutdown notifier,
      which needs to work even when the array is still (apparently) active
      (as md doesn't get told when the filesystem becomes readonly).
      
      So restore this feature by pretending that there can be lots of
      file descriptors open, but we still want do_md_stop to switch to
      readonly.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      2b25000b
    • NeilBrown's avatar
      Make writes to md/safe_mode_delay immediately effective. · 19052c0e
      NeilBrown authored
      If we reduce the 'safe_mode_delay', it could still wait for the old
      delay to completely expire before doing anything about safe_mode.
      Thus the effect if the change is delayed.
      
      To make the effect more immediate, run the timeout function
      immediately if the delay was reduced.  This may cause it to run
      slightly earlier that required, but that is the safer option.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      19052c0e
  7. 01 Aug, 2008 3 commits
  8. 29 Jul, 2008 2 commits
  9. 26 Jul, 2008 1 commit
  10. 23 Jul, 2008 3 commits
  11. 21 Jul, 2008 9 commits