1. 01 Dec, 2009 1 commit
    • NeilBrown's avatar
      md: revert incorrect fix for read error handling in raid1. · d0e26078
      NeilBrown authored
      commit 4706b349 was a forward port of a fix that was needed
      for SLES10.  But in fact it is not needed in mainline because
      the earlier commit dd00a99e
      
       fixes the same problem in a
      better way.
      Further, this commit introduces a bug in the way it interacts with
      the automatic read-error-correction.  If, after a read error is
      successfully corrected, the same disk is chosen to re-read - the
      re-read won't be attempted but an error will be returned instead.
      
      After reverting that commit, there is the possibility that a
      read error on a read-only array (where read errors cannot
      be corrected as that requires a write) will repeatedly read the same
      device and continue to get an error.
      So in the "Array is readonly" case, fail the drive immediately on
      a read error.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      d0e26078
  2. 16 Oct, 2009 2 commits
    • NeilBrown's avatar
      md: raid1/raid10: handle allocation errors during array setup. · ed9bfdf1
      NeilBrown authored
      
      
      Both raid1 and raid10 create a mempool during startup.
      If the 'alloc' function for this mempool fails, unplug_slaves
      is called.
      If that happens when the pool is being initialised, unplug_slaves
      will try to use the 'conf' structure that isn't filled in yet, and
      badness will happen.
      
      So ensure that unplug_slaves doesn't get called unless we know
      that the conf structure if fully initialised.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ed9bfdf1
    • NeilBrown's avatar
      md/raid1/raid10: add a cond_resched · 1d9d5241
      NeilBrown authored
      
      
      During 'check' of a raid1 or raid10 it is possible for the management
      thread to spend a lot of time running 'memcmp' on blocks from
      different devices, so make sure the thread has a chance to schedule.
      raid5d already has a cond_resched (in process_stripe).
      Reported-By: default avatarLee Howard <faxguy@howardsilvan.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      1d9d5241
  3. 23 Sep, 2009 3 commits
  4. 11 Sep, 2009 1 commit
  5. 03 Aug, 2009 2 commits
    • NeilBrown's avatar
      md: Use revalidate_disk to effect changes in size of device. · 449aad3e
      NeilBrown authored
      
      
      As revalidate_disk calls check_disk_size_change, it will cause
      any capacity change of a gendisk to be propagated to the blockdev
      inode.  So use that instead of mucking about with locks and
      i_size_write.
      
      Also add a call to revalidate_disk in do_md_run and a few other places
      where the gendisk capacity is changed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      449aad3e
    • Andre Noll's avatar
      md: Push down data integrity code to personalities. · ac5e7113
      Andre Noll authored
      
      
      This patch replaces md_integrity_check() by two new public functions:
      md_integrity_register() and md_integrity_add_rdev() which are both
      personality-independent.
      
      md_integrity_register() is called from the ->run and ->hot_remove
      methods of all personalities that support data integrity.  The
      function iterates over the component devices of the array and
      determines if all active devices are integrity capable and if their
      profiles match. If this is the case, the common profile is registered
      for the mddev via blk_integrity_register().
      
      The second new function, md_integrity_add_rdev() is called from the
      ->hot_add_disk methods, i.e. whenever a new device is being added
      to a raid array. If the new device does not support data integrity,
      or has a profile different from the one already registered, data
      integrity for the mddev is disabled.
      
      For raid0 and linear, only the call to md_integrity_register() from
      the ->run method is necessary.
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ac5e7113
  6. 01 Jul, 2009 1 commit
  7. 17 Jun, 2009 3 commits
  8. 16 Jun, 2009 1 commit
    • NeilBrown's avatar
      md: remove mddev_to_conf "helper" macro · 070ec55d
      NeilBrown authored
      
      
      Having a macro just to cast a void* isn't really helpful.
      I would must rather see that we are simply de-referencing ->private,
      than have to know what the macro does.
      
      So open code the macro everywhere and remove the pointless cast.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      070ec55d
  9. 22 May, 2009 1 commit
  10. 15 Apr, 2009 1 commit
  11. 06 Apr, 2009 2 commits
  12. 31 Mar, 2009 8 commits
    • Dan Williams's avatar
      md: 'array_size' sysfs attribute · b522adcd
      Dan Williams authored
      
      
      Allow userspace to set the size of the array according to the following
      semantics:
      
      1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
         a) If size is set before the array is running, do_md_run will fail
            if size is greater than the default size
         b) A reshape attempt that reduces the default size to less than the set
            array size should be blocked
      2/ once userspace sets the size the kernel will not change it
      3/ writing 'default' to this attribute returns control of the size to the
         kernel and reverts to the size reported by the personality
      
      Also, convert locations that need to know the default size from directly
      reading ->array_sectors to <pers>_size.  Resync/reshape operations
      always follow the default size.
      
      Finally, fixup other locations that read a number of 1k-blocks from
      userspace to use strict_blocks_to_sectors() which checks for unsigned
      long long to sector_t overflow and blocks to sectors overflow.
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      b522adcd
    • Dan Williams's avatar
      md: centralize ->array_sectors modifications · 1f403624
      Dan Williams authored
      
      
      Get personalities out of the business of directly modifying
      ->array_sectors.  Lays groundwork to introduce policy on when
      ->array_sectors can be modified.
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      1f403624
    • Dan Williams's avatar
      md: add 'size' as a personality method · 80c3a6ce
      Dan Williams authored
      
      
      In preparation for giving userspace control over ->array_sectors we need
      to be able to retrieve the 'default' size, and the 'anticipated' size
      when a reshape is requested.  For personalities that do not reshape emit
      a warning if anything but the default size is requested.
      
      In the raid5 case we need to update ->previous_raid_disks to make the
      new 'default' size available.
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      80c3a6ce
    • NeilBrown's avatar
      md: enable suspend/resume of md devices. · 409c57f3
      NeilBrown authored
      
      
      To be able to change the 'level' of an md/raid array, we need to
      suspend the device so that no requests are active - then move some
      pointers around etc.
      
      The code already keeps counts of active requests and the ->quiesce
      function can be used to wait until those counts hit zero.
      However the quiesce function blocks new requests once they are all
      ready 'inside' the personality module, and that is too late if we want
      to replace the personality modules.
      
      So make all md requests come in through a common md_make_request
      function that keeps track of how many requests have entered the
      modules but may not yet be on the internal reference counts.
      Allow md_make_request to be blocked when we want to suspend the
      device, and make it possible to wait for all those in-transit requests
      to be added to internal lists so that ->quiesce can wait for them.
      
      There is still a problem that when a request completes, we drop the
      ref count inside the personality code so there is a short time between
      when the refcount hits zero, and when the personality code is no
      longer being used.
      The personality code never blocks (schedule or spinlock) between
      dropping the refcount and exiting the routine, so this should be safe
      (as put_module calls synchronize_sched() before unmapping the module
      code).
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      409c57f3
    • Andre Noll's avatar
      md: Make mddev->size sector-based. · 58c0fed4
      Andre Noll authored
      
      
      This patch renames the "size" field of struct mddev_s to "dev_sectors"
      and stores the number of 512-byte sectors instead of the number of
      1K-blocks in it.
      
      All users of that field, including raid levels 1,4-6,10, are adjusted
      accordingly. This simplifies the code a bit because it allows to get
      rid of a couple of divisions/multiplications by two.
      
      In order to make checkpatch happy, some minor coding style issues
      have also been addressed. In particular, size_store() now uses
      strict_strtoull() instead of simple_strtoull().
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      58c0fed4
    • NeilBrown's avatar
      md: move md_k.h from include/linux/raid/ to drivers/md/ · 43b2e5d8
      NeilBrown authored
      
      
      It really is nicer to keep related code together..
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      43b2e5d8
    • NeilBrown's avatar
      md: move lots of #include lines out of .h files and into .c · bff61975
      NeilBrown authored
      
      
      This makes the includes more explicit, and is preparation for moving
      md_k.h to drivers/md/md.h
      
      Remove include/raid/md.h as its only remaining use was to #include
      other files.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      bff61975
    • Christoph Hellwig's avatar
      md: move headers out of include/linux/raid/ · ef740c37
      Christoph Hellwig authored
      
      
      Move the headers with the local structures for the disciplines and
      bitmap.h into drivers/md/ so that they are more easily grepable for
      hacking and not far away.  md.h is left where it is for now as there
      are some uses from the outside.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ef740c37
  13. 25 Feb, 2009 1 commit
    • NeilBrown's avatar
      md: avoid races when stopping resync. · 73d5c38a
      NeilBrown authored
      
      
      There has been a race in raid10 and raid1 for a long time
      which has only recently started showing up due to a scheduler changed.
      
      When a sync_read request finishes, as soon as reschedule_retry
      is called, another thread can mark the resync request as having
      completed, so md_do_sync can finish, ->stop can be called, and
      ->conf can be freed.  So using conf after reschedule_retry is not
      safe.
      
      Similarly, when finishing a sync_write, calling md_done_sync must be
      the last thing we do, as it allows a chain of events which will free
      conf and other data structures.
      
      The first of these requires action in raid10.c
      The second requires action in raid1.c and raid10.c
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      73d5c38a
  14. 06 Feb, 2009 1 commit
    • NeilBrown's avatar
      md: Allow read error in a single drive raid1 to be passed up. · 4706b349
      NeilBrown authored
      
      
      If a raid1 only has a single working device and gets a read error, 
      we choose to simply return that error up to the filesystem (or whatever)
      rather than failing the whole array.
      
      However the codes doesn't quite do that.  We attempt a readbalance
      which allocates the same drive, so we retry the read - indefinitely. 
      
      Instead:  If read_balance in the error case chooses the same drive that just
      failed, treat it as a failure and don't retry.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      4706b349
  15. 08 Jan, 2009 2 commits
    • NeilBrown's avatar
      md: don't retry recovery of raid1 that fails due to error on source drive. · 4044ba58
      NeilBrown authored
      
      
      If a raid1 has only one working drive and it has a sector which
      gives an error on read, then an attempt to recover onto a spare will
      fail, but as the single remaining drive is not removed from the
      array, the recovery will be immediately re-attempted, resulting
      in an infinite recovery loop.
      
      So detect this situation and don't retry recovery once an error
      on the lone remaining drive is detected.
      
      Allow recovery to be retried once every time a spare is added
      in case the problem wasn't actually a media error.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      4044ba58
    • Cheng Renquan's avatar
      md: use list_for_each_entry macro directly · 159ec1fc
      Cheng Renquan authored
      
      
      The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
      list_for_each_entry_safe, from <linux/list.h>, it should be defined to
      use list_for_each_entry_safe, instead of reinventing the wheel.
      
      But some calls to each_entry_safe don't really need a safe version,
      just a direct list_for_each_entry is enough, this could save a temp
      variable (tmp) in every function that used rdev_for_each.
      
      In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
      totally save many tmp vars; and only in the other situations that will call
      list_del to delete an entry, the safe version is used.
      Signed-off-by: default avatarCheng Renquan <crquan@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      159ec1fc
  16. 15 Oct, 2008 1 commit
    • Stephen Rothwell's avatar
      md: build failure due to missing delay.h · 25570727
      Stephen Rothwell authored
      
      
      Today's linux-next build (powerpc ppc64_defconfig) failed like this:
      
      drivers/md/raid1.c: In function 'sync_request':
      drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
      make[3]: *** [drivers/md/raid1.o] Error 1
      make[3]: *** Waiting for unfinished jobs....
      drivers/md/raid10.c: In function 'sync_request':
      drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
      make[3]: *** [drivers/md/raid10.o] Error 1
      drivers/md/md.c: In function 'md_do_sync':
      drivers/md/md.c:5915: error: implicit declaration of function 'msleep'
      
      Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
      unnecessary #includes, #defines, and function declarations").  I added
      the following patch.
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      25570727
  17. 09 Oct, 2008 4 commits
    • Tejun Heo's avatar
      block: move stats from disk to part0 · 074a7aca
      Tejun Heo authored
      
      
      Move stats related fields - stamp, in_flight, dkstats - from disk to
      part0 and unify stat handling such that...
      
      * part_stat_*() now updates part0 together if the specified partition
        is not part0.  ie. part_stat_*() are now essentially all_stat_*().
      
      * {disk|all}_stat_*() are gone.
      
      * part_round_stats() is updated similary.  It handles part0 stats
        automatically and disk_round_stats() is killed.
      
      * part_{inc|dec}_in_fligh() is implemented which automatically updates
        part0 stats for parts other than part0.
      
      * disk_map_sector_rcu() is updated to return part0 if no part matches.
        Combined with the above changes, this makes NULL special case
        handling in callers unnecessary.
      
      * Separate stats show code paths for disk are collapsed into part
        stats show code paths.
      
      * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
      
      While at it, reposition stat handling macros a bit and add missing
      parentheses around macro parameters.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      074a7aca
    • Tejun Heo's avatar
      block: fix diskstats access · c9959059
      Tejun Heo authored
      
      
      There are two variants of stat functions - ones prefixed with double
      underbars which don't care about preemption and ones without which
      disable preemption before manipulating per-cpu counters.  It's unclear
      whether the underbarred ones assume that preemtion is disabled on
      entry as some callers don't do that.
      
      This patch unifies diskstats access by implementing disk_stat_lock()
      and disk_stat_unlock() which take care of both RCU (for partition
      access) and preemption (for per-cpu counter access).  diskstats access
      should always be enclosed between the two functions.  As such, there's
      no need for the versions which disables preemption.  They're removed
      and double underbars ones are renamed to drop the underbars.  As an
      extra argument is added, there's no danger of using the old version
      unconverted.
      
      disk_stat_lock() uses get_cpu() and returns the cpu index and all
      diskstat functions which access per-cpu counters now has @cpu
      argument to help RT.
      
      This change adds RCU or preemption operations at some places but also
      collapses several preemption ops into one at others.  Overall, the
      performance difference should be negligible as all involved ops are
      very lightweight per-cpu ones.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      c9959059
    • Jens Axboe's avatar
      960e739d
    • Mikulas Patocka's avatar
      drop vmerge accounting · 5df97b91
      Mikulas Patocka authored
      
      
      Remove hw_segments field from struct bio and struct request. Without virtual
      merge accounting they have no purpose.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      5df97b91
  18. 21 Jul, 2008 1 commit
  19. 01 Jul, 2008 1 commit
    • Dan Williams's avatar
      md: resolve external metadata handling deadlock in md_allow_write · b5470dc5
      Dan Williams authored
      
      
      md_allow_write() marks the metadata dirty while holding mddev->lock and then
      waits for the write to complete.  For externally managed metadata this causes a
      deadlock as userspace needs to take the lock to communicate that the metadata
      update has completed.
      
      Change md_allow_write() in the 'external' case to start the 'mark active'
      operation and then return -EAGAIN.  The expected side effects while waiting for
      userspace to write 'active' to 'array_state' are holding off reshape (code
      currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
      fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
      cause updates to 'raid_disks' to fail.  Except for 'stripe_cache_size' changes
      these failures can be mitigated by coordinating with mdmon.
      
      md_write_start() still prevents writes from occurring until the metadata
      handler has had a chance to take action as it unconditionally waits for
      MD_CHANGE_CLEAN to be cleared.
      
      [neilb@suse.de: return -EAGAIN, try GFP_NOIO]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      b5470dc5
  20. 27 Jun, 2008 2 commits
  21. 24 May, 2008 1 commit
    • NeilBrown's avatar
      md: restart recovery cleanly after device failure. · dfc70645
      NeilBrown authored
      
      
      When we get any IO error during a recovery (rebuilding a spare), we abort
      the recovery and restart it.
      
      For RAID6 (and multi-drive RAID1) it may not be best to restart at the
      beginning: when multiple failures can be tolerated, the recovery may be
      able to continue and re-doing all that has already been done doesn't make
      sense.
      
      We already have the infrastructure to record where a recovery is up to
      and restart from there, but it is not being used properly.
      This is because:
        - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
          which causes the recovery not be be checkpointed.
        - We remove spares and then re-added them which loses important state
          information.
      
      The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
      needed.  If there is an error, the relevant drive will be marked as
      Faulty, and that is enough to ensure correct handling of the error.  So we
      first remove MD_RECOVERY_ERR, changing some of the uses of it to
      MD_RECOVERY_INTR.
      
      Then we cause the attempt to remove a non-faulty device from an array to
      fail (unless recovery is impossible as the array is too degraded).  Then
      when remove_and_add_spares attempts to remove the devices on which
      recovery can continue, it will fail, they will remain in place, and
      recovery will continue on them as desired.
      
      Issue:  If we are halfway through rebuilding a spare and another drive
      fails, and a new spare is immediately available,  do we want to:
       1/ complete the current rebuild, then go back and rebuild the new spare or
       2/ restart the rebuild from the start and rebuild both devices in
          parallel.
      
      Both options can be argued for.  The code currently takes option 2 as
        a/ this requires least code change
        b/ this results in a minimally-degraded array in minimal time.
      
      Cc: "Eivind Sarto" <ivan@kasenna.com>
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfc70645