1. 09 Oct, 2008 3 commits
    • Tejun Heo's avatar
      block: move stats from disk to part0 · 074a7aca
      Tejun Heo authored
      Move stats related fields - stamp, in_flight, dkstats - from disk to
      part0 and unify stat handling such that...
      * part_stat_*() now updates part0 together if the specified partition
        is not part0.  ie. part_stat_*() are now essentially all_stat_*().
      * {disk|all}_stat_*() are gone.
      * part_round_stats() is updated similary.  It handles part0 stats
        automatically and disk_round_stats() is killed.
      * part_{inc|dec}_in_fligh() is implemented which automatically updates
        part0 stats for parts other than part0.
      * disk_map_sector_rcu() is updated to return part0 if no part matches.
        Combined with the above changes, this makes NULL special case
        handling in callers unnecessary.
      * Separate stats show code paths for disk are collapsed into part
        stats show code paths.
      * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
      While at it, reposition stat handling macros a bit and add missing
      parentheses around macro parameters.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: always set bdev->bd_part · 0762b8bd
      Tejun Heo authored
      Till now, bdev->bd_part is set only if the bdev was for parts other
      than part0.  This patch makes bdev->bd_part always set so that code
      paths don't have to differenciate common handling.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    • Tejun Heo's avatar
      block: implement and use {disk|part}_to_dev() · ed9e1982
      Tejun Heo authored
      Implement {disk|part}_to_dev() and use them to access generic device
      instead of directly dereferencing {disk|part}->dev.  To make sure no
      user is left behind, rename generic devices fields to __dev.
      This is in preparation of unifying partition 0 handling with other
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  2. 19 Sep, 2008 1 commit
    • NeilBrown's avatar
      md: Don't wait UNINTERRUPTIBLE for other resync to finish · 9744197c
      NeilBrown authored
      When two md arrays share some block device (e.g each uses different
      partitions on the one device), a resync of one array will wait for
      the resync on the other to finish.
      This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
      the softlockup code notices and complains.
      So use TASK_INTERRUPTIBLE instead and make sure to flush signals
      before calling schedule.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  3. 01 Sep, 2008 1 commit
    • NeilBrown's avatar
      Remove invalidate_partition call from do_md_stop. · 271f5a9b
      NeilBrown authored
      When stopping an md array, or just switching to read-only, we
      currently call invalidate_partition while holding the mddev lock.
      The main reason for this is probably to ensure all dirty buffers
      are flushed (invalidate_partition calls fsync_bdev).
      However if any dirty buffers are found, it will almost certainly cause
      a deadlock as starting writeout will require an update to the
      superblock, and performing that updates requires taking the mddev
      lock - which is already held.
      This deadlock can be demonstrated by running "reboot -f -n" with
      a root filesystem on md/raid, and some dirty buffers in memory.
      All other calls to stop an array should already happen after a flush.
      The normal sequence is to stop using the array (e.g. umount) which
      will cause __blkdev_put to call sync_blockdev.  Then open the
      array and issue the STOP_ARRAY ioctl while the buffers are all still
      So this invalidate_partition is normally a no-op, except for one case
      where it will cause a deadlock.
      So remove it.
      This patch possibly addresses the regression recored in
      though it isn't yet clear how it ever worked.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  4. 07 Aug, 2008 1 commit
  5. 05 Aug, 2008 4 commits
    • NeilBrown's avatar
      Allow faulty devices to be removed from a readonly array. · c89a8eee
      NeilBrown authored
      Removing faulty devices from an array is a two stage process.
      First the device is moved from being a part of the active array
      to being similar to a spare device.  Then it can be removed
      by a request from user space.
      The first step is currently not performed for read-only arrays,
      so the second step can never succeed.
      So allow readonly arrays to remove failed devices (which aren't
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      Fail safely when trying to grow an array with a write-intent bitmap. · dba034ee
      NeilBrown authored
      We cannot currently change the size of a write-intent bitmap.
      So if we change the size of an array which has such a bitmap, it
      tries to set bits beyond the end of the bitmap.
      For now, simply reject any request to change the size of an array
      which has a bitmap.  mdadm can remove the bitmap and add a new one
      after the array has changed size.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      Restore force switch of md array to readonly at reboot time. · 2b25000b
      NeilBrown authored
      A recent patch allowed do_md_stop to know whether it was being called
      via an ioctl or not, and thus where to allow for an extra open file
      descriptor when checking if it is in use.
      This broke then switch to readonly performed by the shutdown notifier,
      which needs to work even when the array is still (apparently) active
      (as md doesn't get told when the filesystem becomes readonly).
      So restore this feature by pretending that there can be lots of
      file descriptors open, but we still want do_md_stop to switch to
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      Make writes to md/safe_mode_delay immediately effective. · 19052c0e
      NeilBrown authored
      If we reduce the 'safe_mode_delay', it could still wait for the old
      delay to completely expire before doing anything about safe_mode.
      Thus the effect if the change is delayed.
      To make the effect more immediate, run the timeout function
      immediately if the delay was reduced.  This may cause it to run
      slightly earlier that required, but that is the safer option.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  6. 29 Jul, 2008 1 commit
  7. 23 Jul, 2008 1 commit
  8. 21 Jul, 2008 6 commits
    • NeilBrown's avatar
      md: Protect access to mddev->disks list using RCU · 4b80991c
      NeilBrown authored
      All modifications and most access to the mddev->disks list are made
      under the reconfig_mutex lock.  However there are three places where
      the list is walked without any locking.  If a reconfig happens at this
      time, havoc (and oops) can ensue.
      So use RCU to protect these accesses:
        - wrap them in rcu_read_{,un}lock()
        - use list_for_each_entry_rcu
        - add to the list with list_add_rcu
        - delete from the list with list_del_rcu
        - delay the 'free' with call_rcu rather than schedule_work
      Note that export_rdev did a list_del_init on this list.  In almost all
      cases the entry was not in the list anymore so it was a no-op and so
      safe.  It is no longer safe as after list_del_rcu we may not touch
      the list_head.
      An audit shows that export_rdev is called:
        - after unbind_rdev_from_array, in which case the delete has
           already been done,
        - after bind_rdev_to_array fails, in which case the delete isn't needed.
        - before the device has been put on a list at all (e.g. in
            add_new_disk where reading the superblock fails).
        - and in autorun devices after a failure when the device is on a
            different list.
      So remove the list_del_init call from export_rdev, and add it back
      immediately before the called to export_rdev for that last case.
      Note also that ->same_set is sometimes used for lists other than
      mddev->list (e.g. candidates).  In these cases rcu is not needed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md: only count actual openers as access which prevent a 'stop' · f2ea68cf
      NeilBrown authored
      Open isn't the only thing that increments ->active.  e.g. reading
      /proc/mdstat will increment it briefly.  So to avoid false positives
      in testing for concurrent access, introduce a new counter that counts
      just the number of times the md device it open.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Andre Noll's avatar
      md: Make mddev->array_size sector-based. · f233ea5c
      Andre Noll authored
      This patch renames the array_size field of struct mddev_s to array_sectors
      and converts all instances to use units of 512 byte sectors instead of 1k
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Andre Noll's avatar
      md: Make super_type->rdev_size_change() take sector-based sizes. · 15f4a5fd
      Andre Noll authored
      Also, change the type of the size parameter from unsigned long long to
      sector_t and rename it to num_sectors.
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Andre Noll's avatar
      md: Fix check for overlapping devices. · d07bd3bc
      Andre Noll authored
      The checks in overlaps() expect all parameters either in block-based
      or sector-based quantities. However, its single caller passes two
      rdev->data_offset arguments as well as two rdev->size arguments, the
      former being sector counts while the latter are measured in 1K blocks.
      This could cause rdev_size_store() to accept an invalid size from user
      space. Fix it by passing only sector-based quantities to overlaps().
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Neil Brown's avatar
      md: Tidy up rdev_size_store a bit: · d7027458
      Neil Brown authored
       - used strict_strtoull in place of simple_strtoull
       - use my_mddev in place of rdev->mddev (they have the same value)
      and more significantly,
       - don't adjust mddev->size to fit, rather reject changes which make
         rdev->size smaller than mddev->size
      Adjusting mddev->size is a hangover from bind_rdev_to_array which
      does a similar thing.  But it really is a better design to insist that
      mddev->size is set as required, then the rdev->sizes are set to allow
      for that.  The previous way invites confusion.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  9. 11 Jul, 2008 10 commits
  10. 08 Jul, 2008 7 commits
  11. 01 Jul, 2008 1 commit
    • Dan Williams's avatar
      md: resolve external metadata handling deadlock in md_allow_write · b5470dc5
      Dan Williams authored
      md_allow_write() marks the metadata dirty while holding mddev->lock and then
      waits for the write to complete.  For externally managed metadata this causes a
      deadlock as userspace needs to take the lock to communicate that the metadata
      update has completed.
      Change md_allow_write() in the 'external' case to start the 'mark active'
      operation and then return -EAGAIN.  The expected side effects while waiting for
      userspace to write 'active' to 'array_state' are holding off reshape (code
      currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
      fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
      cause updates to 'raid_disks' to fail.  Except for 'stripe_cache_size' changes
      these failures can be mitigated by coordinating with mdmon.
      md_write_start() still prevents writes from occurring until the metadata
      handler has had a chance to take action as it unconditionally waits for
      MD_CHANGE_CLEAN to be cleared.
      [neilb@suse.de: return -EAGAIN, try GFP_NOIO]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  12. 27 Jun, 2008 4 commits