1. 09 Oct, 2009 1 commit
    • Wu Fengguang's avatar
      writeback: account IO throttling wait as iowait · d25105e8
      Wu Fengguang authored
      
      
      It makes sense to do IOWAIT when someone is blocked
      due to IO throttle, as suggested by Kame and Peter.
      
      There is an old comment for not doing IOWAIT on throttle,
      however it has been mismatching the code for a long time.
      
      If we stop accounting IOWAIT for 2.6.32, it could be an
      undesirable behavior change. So restore the io_schedule.
      
      CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      d25105e8
  2. 25 Sep, 2009 4 commits
  3. 24 Sep, 2009 1 commit
  4. 22 Sep, 2009 1 commit
  5. 21 Sep, 2009 2 commits
  6. 16 Sep, 2009 4 commits
    • Jens Axboe's avatar
      writeback: separate starting of sync vs opportunistic writeback · b6e51316
      Jens Axboe authored
      
      
      bdi_start_writeback() is currently split into two paths, one for
      WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
      for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
      only WB_SYNC_NONE.
      
      Push down the writeback_control allocation and only accept the
      parameters that make sense for each function. This cleans up
      the API considerably.
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      b6e51316
    • Jens Axboe's avatar
      writeback: use RCU to protect bdi_list · cfc4ba53
      Jens Axboe authored
      
      
      Now that bdi_writeback_all() no longer handles integrity writeback,
      it doesn't have to block anymore. This means that we can switch
      bdi_list reader side protection to RCU.
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      cfc4ba53
    • Jens Axboe's avatar
      writeback: get rid of wbc->for_writepages · 1fe06ad8
      Jens Axboe authored
      
      
      It's only set, it's never checked. Kill it.
      Acked-by: Jan Kara's avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      1fe06ad8
    • Wu Fengguang's avatar
      HWPOISON: shmem: call set_page_dirty() with locked page · 6746aff7
      Wu Fengguang authored
      
      
      The dirtying of page and set_page_dirty() can be moved into the page lock.
      
      - In shmem_write_end(), the page was dirtied while the page lock was held,
        but it's being marked dirty just after dropping the page lock.
      - In shmem_symlink(), both dirtying and marking can be moved into page lock.
      
      It's valuable for the hwpoison code to know whether one bad page can be dropped
      without losing data. It mainly judges by testing the PG_dirty bit after taking
      the page lock. So it becomes important that the dirtying of page and the
      marking of dirtiness are both done inside the page lock. Which is a common
      practice, but sadly not a rule.
      
      The noticeable exceptions are
      - mapped pages
      - pages with buffer_heads
      The above pages could go dirty at any time. Fortunately the hwpoison will
      unmap the page and release the buffer_heads beforehand anyway.
      
      Many other types of pages (eg. metadata pages) can also be dirtied at will by
      their owners, the hwpoison code cannot do meaningful things to them anyway.
      Only the dirtiness of pagecache pages owned by regular files are interested.
      
      v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)
      Acked-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: default avatarWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      6746aff7
  7. 11 Sep, 2009 2 commits
    • Jens Axboe's avatar
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe authored
      
      
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • Jens Axboe's avatar
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe authored
      
      
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
  8. 10 Jul, 2009 1 commit
  9. 01 Jul, 2009 1 commit
    • Richard Kennedy's avatar
      mm: prevent balance_dirty_pages() from doing too much work · d7831a0b
      Richard Kennedy authored
      
      
      balance_dirty_pages can overreact and move all of the dirty pages to
      writeback unnecessarily.
      
      balance_dirty_pages makes its decision to throttle based on the number of
      dirty plus writeback pages that are over the calculated limit,so it will
      continue to move pages even when there are plenty of pages in writeback
      and less than the threshold still dirty.
      
      This allows it to overshoot its limits and move all the dirty pages to
      writeback while waiting for the drives to catch up and empty the writeback
      list.
      
      A simple fio test easily demonstrates this problem.
      
      fio --name=f1 --directory=/disk1 --size=2G -rw=write --name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10
      
      This is the simplest fix I could find, but I'm not entirely sure that it
      alone will be enough for all cases.  But it certainly is an improvement on
      my desktop machine writing to 2 disks.
      
      Do we need something more for machines with large arrays where
      bdi_threshold * number_of_drives is greater than the dirty_ratio ?
      Signed-off-by: default avatarRichard Kennedy <richard@rsk.demon.co.uk>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7831a0b
  10. 24 Jun, 2009 1 commit
    • Tejun Heo's avatar
      percpu: clean up percpu variable definitions · 245b2e70
      Tejun Heo authored
      
      
      Percpu variable definition is about to be updated such that all percpu
      symbols including the static ones must be unique.  Update percpu
      variable definitions accordingly.
      
      * as,cfq: rename ioc_count uniquely
      
      * cpufreq: rename cpu_dbs_info uniquely
      
      * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
      
      * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
        rename it
      
      * ipv4,6: rename cookie_scratch uniquely
      
      * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
        pmc_irq_entry and nmi_entry to pmc_nmi_entry
      
      * perf_counter: rename disable_count to perf_disable_count
      
      * ftrace: rename test_event_disable to ftrace_test_event_disable
      
      * kmemleak: rename test_pointer to kmemleak_test_pointer
      
      * mce: rename next_interval to mce_next_interval
      
      [ Impact: percpu usage cleanups, no duplicate static percpu var names ]
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      245b2e70
  11. 17 Jun, 2009 1 commit
  12. 17 May, 2009 1 commit
  13. 01 Apr, 2009 2 commits
  14. 26 Mar, 2009 1 commit
    • Wu Fengguang's avatar
      writeback: double the dirty thresholds · 1b5e62b4
      Wu Fengguang authored
      Enlarge default dirty ratios from 5/10 to 10/20.  This fixes [Bug
      #12809] iozone regression with 2.6.29-rc6.
      
      The iozone benchmarks are performed on a 1200M file, with 8GB ram.
      
        iozone -i 0 -i 1 -i 2 -i 3 -i 4 -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
        iozone -B -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
      
      The performance regression is triggered by commit 1cf6e7d8
      
      (mm: task
      dirty accounting fix), which makes more correct/thorough dirty
      accounting.
      
      The default 5/10 dirty ratios were picked (a) with the old dirty logic
      and (b) largely at random and (c) designed to be aggressive.  In
      particular, that (a) means that having fixed some of the dirty
      accounting, maybe the real bug is now that it was always too aggressive,
      just hidden by an accounting issue.
      
      The enlarged 10/20 dirty ratios are just about enough to fix the regression.
      
      [ We will have to look at how this affects the old fsync() latency issue,
        but that probably will need independent work.  - Linus ]
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reported-by: default avatar"Lin, Ming M" <ming.m.lin@intel.com>
      Tested-by: default avatar"Lin, Ming M" <ming.m.lin@intel.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b5e62b4
  15. 18 Feb, 2009 1 commit
  16. 12 Feb, 2009 1 commit
    • Nick Piggin's avatar
      Fix page writeback thinko, causing Berkeley DB slowdown · 3a4c6800
      Nick Piggin authored
      A bug was introduced into write_cache_pages cyclic writeout by commit
      31a12666 ("mm: write_cache_pages cyclic
      fix").  The intention (and comments) is that we should cycle back and
      look for more dirty pages at the beginning of the file if there is no
      more work to be done.
      
      But the !done condition was dropped from the test.  This means that any
      time the page writeout loop breaks (eg.  due to nr_to_write == 0), we
      will set index to 0, then goto again.  This will set done_index to
      index, then find done is set, so will proceed to the end of the
      function.  When updating mapping->writeback_index for cyclic writeout,
      we now use done_index == 0, so we're always cycling back to 0.
      
      This seemed to be causing random mmap writes (slapadd and iozone) to
      start writing more pages from the LRU and writeout would slowdown, and
      caused bugzilla entry
      
      	http://bugzilla.kernel.org/show_bug.cgi?id=12604
      
      
      
      about Berkeley DB slowing down dramatically.
      
      With this patch, iozone random write performance is increased nearly
      5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Reported-and-tested-by: Jan Kara's avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a4c6800
  17. 11 Feb, 2009 2 commits
  18. 04 Feb, 2009 1 commit
    • Artem Bityutskiy's avatar
      write-back: fix nr_to_write counter · dcf6a79d
      Artem Bityutskiy authored
      Commit 05fe478d
      
       introduced some
      @wbc->nr_to_write breakage.
      
      It made the following changes:
       1. Decrement wbc->nr_to_write instead of nr_to_write
       2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
       3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
          WB_SYNC_NONE, otherwise keep going.
      
      However, according to the commit message, the intention was to only make
      change 3.  Change 1 is a bug.  Change 2 does not seem to be necessary,
      and it breaks UBIFS expectations, so if needed, it should be done
      separately later.  And change 2 does not seem to be documented in the
      commit message.
      
      This patch does the following:
       1. Undo changes 1 and 2
       2. Add a comment explaining change 3 (it very useful to have comments
          in _code_, not only in the commit).
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcf6a79d
  19. 06 Jan, 2009 10 commits
    • David Rientjes's avatar
      mm: add dirty_background_bytes and dirty_bytes sysctls · 2da02997
      David Rientjes authored
      
      
      This change introduces two new sysctls to /proc/sys/vm:
      dirty_background_bytes and dirty_bytes.
      
      dirty_background_bytes is the counterpart to dirty_background_ratio and
      dirty_bytes is the counterpart to dirty_ratio.
      
      With growing memory capacities of individual machines, it's no longer
      sufficient to specify dirty thresholds as a percentage of the amount of
      dirtyable memory over the entire system.
      
      dirty_background_bytes and dirty_bytes specify quantities of memory, in
      bytes, that represent the dirty limits for the entire system.  If either
      of these values is set, its value represents the amount of dirty memory
      that is needed to commence either background or direct writeback.
      
      When a `bytes' or `ratio' file is written, its counterpart becomes a
      function of the written value.  For example, if dirty_bytes is written to
      be 8096, 8K of memory is required to commence direct writeback.
      dirty_ratio is then functionally equivalent to 8K / the amount of
      dirtyable memory:
      
      	dirtyable_memory = free pages + mapped pages + file cache
      
      	dirty_background_bytes = dirty_background_ratio * dirtyable_memory
      		-or-
      	dirty_background_ratio = dirty_background_bytes / dirtyable_memory
      
      		AND
      
      	dirty_bytes = dirty_ratio * dirtyable_memory
      		-or-
      	dirty_ratio = dirty_bytes / dirtyable_memory
      
      Only one of dirty_background_bytes and dirty_background_ratio may be
      specified at a time, and only one of dirty_bytes and dirty_ratio may be
      specified.  When one sysctl is written, the other appears as 0 when read.
      
      The `bytes' files operate on a page size granularity since dirty limits
      are compared with ZVC values, which are in page units.
      
      Prior to this change, the minimum dirty_ratio was 5 as implemented by
      get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
      written value between 0 and 100.  This restriction is maintained, but
      dirty_bytes has a lower limit of only one page.
      
      Also prior to this change, the dirty_background_ratio could not equal or
      exceed dirty_ratio.  This restriction is maintained in addition to
      restricting dirty_background_bytes.  If either background threshold equals
      or exceeds that of the dirty threshold, it is implicitly set to half the
      dirty threshold.
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2da02997
    • David Rientjes's avatar
      mm: change dirty limit type specifiers to unsigned long · 364aeb28
      David Rientjes authored
      
      
      The background dirty and dirty limits are better defined with type
      specifiers of unsigned long since negative writeback thresholds are not
      possible.
      
      These values, as returned by get_dirty_limits(), are normally compared
      with ZVC values to determine whether writeback shall commence or be
      throttled.  Such page counts cannot be negative, so declaring the page
      limits as signed is unnecessary.
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      364aeb28
    • Andrew Morton's avatar
      mm: write_cache_pages more terminate quickly · 82fd1a9a
      Andrew Morton authored
      
      
      Now that we have the early-termination logic in place, it makes sense to
      bail out early in all other cases where done is set to 1.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82fd1a9a
    • Nick Piggin's avatar
      mm: write_cache_pages terminate quickly · d5482cdf
      Nick Piggin authored
      
      
      Terminate the write_cache_pages loop upon encountering the first page past
      end, without locking the page.  Pages cannot have their index change when
      we have a reference on them (truncate, eg truncate_inode_pages_range
      performs the same check without the page lock).
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5482cdf
    • Nick Piggin's avatar
      mm: write_cache_pages optimise page cleaning · 515f4a03
      Nick Piggin authored
      
      
      In write_cache_pages, if we get stuck behind another process that is
      cleaning pages, we will be forced to wait for them to finish, then perform
      our own writeout (if it was redirtied during the long wait), then wait for
      that.
      
      If a page under writeout is still clean, we can skip waiting for it (if
      we're part of a data integrity sync, we'll be waiting for all writeout
      pages afterwards, so we'll still be waiting for the other guy's write
      that's cleaned the page).
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      515f4a03
    • Nick Piggin's avatar
      mm: write_cache_pages cleanups · 5a3d5c98
      Nick Piggin authored
      
      
      Get rid of some complex expressions from flow control statements, add a
      comment, remove some duplicate code.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a3d5c98
    • Nick Piggin's avatar
      mm: write_cache_pages integrity fix · 05fe478d
      Nick Piggin authored
      
      
      In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
      so the function will return success after writing out nr_to_write pages,
      even if that was not sufficient to guarantee data integrity.
      
      The callers tend to set it to values that could break data interity
      semantics easily in practice.  For example, nr_to_write can be set to
      mapping->nr_pages * 2, however if a file has a single, dirty page, then
      fsync is called, subsequent pages might be concurrently added and dirtied,
      then write_cache_pages might writeout two of these newly dirty pages,
      while not writing out the old page that should have been written out.
      
      Fix this by ignoring nr_to_write if it is a data integrity sync.
      
      This is a data integrity bug.
      
      The reason this has been done in the past is to avoid stalling sync
      operations behind page dirtiers.
      
       "If a file has one dirty page at offset 1000000000000000 then someone
        does an fsync() and someone else gets in first and starts madly writing
        pages at offset 0, we want to write that page at 1000000000000000.
        Somehow."
      
      What we do today is return success after an arbitrary amount of pages are
      written, whether or not we have provided the data-integrity semantics that
      the caller has asked for.  Even this doesn't actually fix all stall cases
      completely: in the above situation, if the file has a huge number of pages
      in pagecache (but not dirty), then mapping->nrpages is going to be huge,
      even if pages are being dirtied.
      
      This change does indeed make the possibility of long stalls lager, and
      that's not a good thing, but lying about data integrity is even worse.  We
      have to either perform the sync, or return -ELINUXISLAME so at least the
      caller knows what has happened.
      
      There are subsequent competing approaches in the works to solve the stall
      problems properly, without compromising data integrity.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05fe478d
    • Nick Piggin's avatar
      mm: write_cache_pages writepage error fix · 00266770
      Nick Piggin authored
      
      
      In write_cache_pages, if ret signals a real error, but we still have some
      pages left in the pagevec, done would be set to 1, but the remaining pages
      would continue to be processed and ret will be overwritten in the process.
      
      It could easily be overwritten with success, and thus success will be
      returned even if there is an error.  Thus the caller is told all writes
      succeeded, wheras in reality some did not.
      
      Fix this by bailing immediately if there is an error, and retaining the
      first error code.
      
      This is a data integrity bug.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00266770
    • Nick Piggin's avatar
      mm: write_cache_pages early loop termination · bd19e012
      Nick Piggin authored
      
      
      We'd like to break out of the loop early in many situations, however the
      existing code has been setting mapping->writeback_index past the final
      page in the pagevec lookup for cyclic writeback.  This is a problem if we
      don't process all pages up to the final page.
      
      Currently the code mostly keeps writeback_index reasonable and hacked
      around this by not breaking out of the loop or writing pages outside the
      range in these cases.  Keep track of a real "done index" that enables us
      to terminate the loop in a much more flexible manner.
      
      Needed by the subsequent patch to preserve writepage errors, and then
      further patches to break out of the loop early for other reasons.  However
      there are no functional changes with this patch alone.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd19e012
    • Nick Piggin's avatar
      mm: write_cache_pages cyclic fix · 31a12666
      Nick Piggin authored
      
      
      In write_cache_pages, scanned == 1 is supposed to mean that cyclic
      writeback has circled through zero, thus we should not circle again.
      However it gets set to 1 after the first successful pagevec lookup.  This
      leads to cases where not enough data gets written.
      
      Counterexample: file with first 10 pages dirty, writeback_index == 5,
      nr_to_write == 10.  Then the 5 last pages will be found, and scanned will
      be set to 1, after writing those out, we will not cycle back to get the
      first 5.
      
      Rework this logic, now we'll always cycle unless we started off from index
      0.  When cycling, only write out as far as 1 page before the start page
      from the first cycle (so we don't write parts of the file twice).
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31a12666
  20. 20 Oct, 2008 1 commit
    • Rik van Riel's avatar
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel authored
      
      
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe
  21. 16 Oct, 2008 1 commit