1. 19 Jul, 2010 1 commit
  2. 27 May, 2010 2 commits
    • Lee Schermerhorn's avatar
      numa: introduce numa_mem_id()- effective local memory node id · 7aac7898
      Lee Schermerhorn authored
      
      
      Introduce numa_mem_id(), based on generic percpu variable infrastructure
      to track "nearest node with memory" for archs that support memoryless
      nodes.
      
      Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES
      defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
      if/when they support them.
      
      Archs can override definitions of:
      
      numa_mem_id() - returns node number of "local memory" node
      set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
      cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue
      
      Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
      This will initialize the boot cpu at boot time, and all cpus on change of
      numa_zonelist_order, or when node or memory hot-plug requires zonelist
      rebuild.  Archs that support memoryless nodes will need to initialize
      'numa_mem' for secondary cpus as they're brought on-line.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7aac7898
    • Lee Schermerhorn's avatar
      numa: add generic percpu var numa_node_id() implementation · 72812019
      Lee Schermerhorn authored
      
      
      Rework the generic version of the numa_node_id() function to use the new
      generic percpu variable infrastructure.
      
      Guard the new implementation with a new config option:
      
              CONFIG_USE_PERCPU_NUMA_NODE_ID.
      
      Archs which support this new implemention will default this option to 'y'
      when NUMA is configured.  This config option could be removed if/when all
      archs switch over to the generic percpu implementation of numa_node_id().
      Arch support involves:
      
        1) converting any existing per cpu variable implementations to use
           this implementation.  x86_64 is an instance of such an arch.
        2) archs that don't use a per cpu variable for numa_node_id() will
           need to initialize the new per cpu variable "numa_node" as cpus
           are brought on-line.  ia64 is an example.
        3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
           when NUMA is configured.  This is required because I have
           retained the old implementation by default to allow archs to
           be modified incrementally, as desired.
      
      Subsequent patches will convert x86_64 and ia64 to use this implemenation.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72812019
  3. 25 May, 2010 10 commits
    • Haicheng Li's avatar
      mem-hotplug: fix potential race while building zonelist for new populated zone · 4eaf3f64
      Haicheng Li authored
      
      
      Add global mutex zonelists_mutex to fix the possible race:
      
           CPU0                                  CPU1                    CPU2
      (1) zone->present_pages += online_pages;
      (2)                                       build_all_zonelists();
      (3)                                                               alloc_page();
      (4)                                                               free_page();
      (5) build_all_zonelists();
      (6)   __build_all_zonelists();
      (7)     zone->pageset = alloc_percpu();
      
      In step (3,4), zone->pageset still points to boot_pageset, so bad
      things may happen if 2+ nodes are in this state. Even if only 1 node
      is accessing the boot_pageset, (3) may still consume too much memory
      to fail the memory allocations in step (7).
      
      Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
      since there is a new fresh memory block added in step(6).
      
      [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
      Signed-off-by: default avatarHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarAndi Kleen <andi.kleen@intel.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eaf3f64
    • Haicheng Li's avatar
      mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset · 1f522509
      Haicheng Li authored
      For each new populated zone of hotadded node, need to update its pagesets
      with dynamically allocated per_cpu_pageset struct for all possible CPUs:
      
          1) Detach zone->pageset from the shared boot_pageset
             at end of __build_all_zonelists().
      
          2) Use mutex to protect zone->pageset when it's still
             shared in onlined_pages()
      
      Otherwises, multiple zones of different nodes would share same boot strapping
      boot_pageset for same CPU, which will finally cause below kernel panic:
      
        ------------[ cut here ]------------
        kernel BUG at mm/page_alloc.c:1239!
        invalid opcode: 0000 [#1
      
      ] SMP
        ...
        Call Trace:
         [<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0
         [<ffffffff81162e67>] alloc_pages_current+0x87/0xd0
         [<ffffffff81128407>] __page_cache_alloc+0x67/0x70
         [<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260
         [<ffffffff81132751>] ra_submit+0x21/0x30
         [<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0
         [<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0
         [<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670
         [<ffffffff81266cfa>] nfs_file_read+0xca/0x130
         [<ffffffff8117b20a>] do_sync_read+0xfa/0x140
         [<ffffffff8117bf75>] vfs_read+0xb5/0x1a0
         [<ffffffff8117c151>] sys_read+0x51/0x80
         [<ffffffff8103c032>] system_call_fastpath+0x16/0x1b
        RIP  [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900
         RSP <ffff88000d1e78a8>
        ---[ end trace 4bda28328b9990db ]
      
      [akpm@linux-foundation.org: merge fix]
      Signed-off-by: default avatarHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarAndi Kleen <andi.kleen@intel.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f522509
    • Wu Fengguang's avatar
      mem-hotplug: separate setup_per_cpu_pageset() into separate functions · 319774e2
      Wu Fengguang authored
      
      
      No behavior change here.
      
      Move some of setup_per_cpu_pageset() code into a new function
      setup_zone_pageset() that will be useful for memory hotplug.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarHaicheng Li <haicheng.li@linux.intel.com>
      Reviewed-by: default avatarAndi Kleen <andi.kleen@intel.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      319774e2
    • KOSAKI Motohiro's avatar
      mm: introduce free_pages_prepare() · ec95f53a
      KOSAKI Motohiro authored
      
      
      free_hot_cold_page() and __free_pages_ok() have very similar freeing
      preparation.  Consolidate them.
      
      [akpm@linux-foundation.org: fix busted coding style]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec95f53a
    • Mel Gorman's avatar
      mm: compaction: defer compaction using an exponential backoff when compaction fails · 4f92e258
      Mel Gorman authored
      
      
      The fragmentation index may indicate that a failure is due to external
      fragmentation but after a compaction run completes, it is still possible
      for an allocation to fail.  There are two obvious reasons as to why
      
        o Page migration cannot move all pages so fragmentation remains
        o A suitable page may exist but watermarks are not met
      
      In the event of compaction followed by an allocation failure, this patch
      defers further compaction in the zone (1 << compact_defer_shift) times.
      If the next compaction attempt also fails, compact_defer_shift is
      increased up to a maximum of 6.  If compaction succeeds, the defer
      counters are reset again.
      
      The zone that is deferred is the first zone in the zonelist - i.e.  the
      preferred zone.  To defer compaction in the other zones, the information
      would need to be stored in the zonelist or implemented similar to the
      zonelist_cache.  This would impact the fast-paths and is not justified at
      this time.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f92e258
    • Mel Gorman's avatar
      mm: compaction: direct compact when a high-order allocation fails · 56de7263
      Mel Gorman authored
      
      
      Ordinarily when a high-order allocation fails, direct reclaim is entered
      to free pages to satisfy the allocation.  With this patch, it is
      determined if an allocation failed due to external fragmentation instead
      of low memory and if so, the calling process will compact until a suitable
      page is freed.  Compaction by moving pages in memory is considerably
      cheaper than paging out to disk and works where there are locked pages or
      no swap.  If compaction fails to free a page of a suitable size, then
      reclaim will still occur.
      
      Direct compaction returns as soon as possible.  As each block is
      compacted, it is checked if a suitable page has been freed and if so, it
      returns.
      
      [akpm@linux-foundation.org: Fix build errors]
      [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56de7263
    • Mel Gorman's avatar
      mm: compaction: memory compaction core · 748446bb
      Mel Gorman authored
      
      
      This patch is the core of a mechanism which compacts memory in a zone by
      relocating movable pages towards the end of the zone.
      
      A single compaction run involves a migration scanner and a free scanner.
      Both scanners operate on pageblock-sized areas in the zone.  The migration
      scanner starts at the bottom of the zone and searches for all movable
      pages within each area, isolating them onto a private list called
      migratelist.  The free scanner starts at the top of the zone and searches
      for suitable areas and consumes the free pages within making them
      available for the migration scanner.  The pages isolated for migration are
      then migrated to the newly isolated free pages.
      
      [aarcange@redhat.com: Fix unsafe optimisation]
      [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      748446bb
    • David Rientjes's avatar
      mm: default to node zonelist ordering when nodes have only lowmem · e325c90f
      David Rientjes authored
      
      
      There are two types of zonelist ordering methodologies:
      
       - node order, preferring allocations on a node to stay local to and
      
       - zone order, preferring allocations come from a higher zone to avoid
         allocating in lowmem zones even though they may not be local.
      
      The ordering technique used by the kernel is configurable on the command
      line, but also has some logic to determine what the default should be.
      
      This logic currently lacks knowledge of systems where a node may only have
      lowmem.  For such systems, it is necessary to use node order so that
      GFP_KERNEL allocations may be satisfied by nodes consisting of only
      lowmem.
      
      If zone order is used, GFP_KERNEL allocations to such nodes are actually
      allocated on a node with local affinity that includes ZONE_NORMAL.
      
      This change defaults to node zonelist ordering if any node lacks
      ZONE_NORMAL.
      
      To force zone order, append 'numa_zonelist_order=zone' to the kernel
      command line.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e325c90f
    • Miao Xie's avatar
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie authored
      
      
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
    • Corrado Zoccolo's avatar
      page allocator: reduce fragmentation in buddy allocator by adding buddies that... · 6dda9d55
      Corrado Zoccolo authored
      
      page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists
      
      In order to reduce fragmentation, this patch classifies freed pages in two
      groups according to their probability of being part of a high order merge.
       Pages belonging to a compound whose next-highest buddy is free are more
      likely to be part of a high order merge in the near future, so they will
      be added at the tail of the freelist.  The remaining pages are put at the
      front of the freelist.
      
      In this way, the pages that are more likely to cause a big merge are kept
      free longer.  Consequently there is a tendency to aggregate the
      long-living allocations on a subset of the compounds, reducing the
      fragmentation.
      
      This heuristic was tested on three machines, x86, x86-64 and ppc64 with
      3GB of RAM in each machine.  The tests were kernbench, netperf, sysbench
      and STREAM for performance and a high-order stress test for huge page
      allocations.
      
      KernBench X86
      Elapsed mean     374.77 ( 0.00%)   375.10 (-0.09%)
      User    mean     649.53 ( 0.00%)   650.44 (-0.14%)
      System  mean      54.75 ( 0.00%)    54.18 ( 1.05%)
      CPU     mean     187.75 ( 0.00%)   187.25 ( 0.27%)
      
      KernBench X86-64
      Elapsed mean      94.45 ( 0.00%)    94.01 ( 0.47%)
      User    mean     323.27 ( 0.00%)   322.66 ( 0.19%)
      System  mean      36.71 ( 0.00%)    36.50 ( 0.57%)
      CPU     mean     380.75 ( 0.00%)   381.75 (-0.26%)
      
      KernBench PPC64
      Elapsed mean     173.45 ( 0.00%)   173.74 (-0.17%)
      User    mean     587.99 ( 0.00%)   587.95 ( 0.01%)
      System  mean      60.60 ( 0.00%)    60.57 ( 0.05%)
      CPU     mean     373.50 ( 0.00%)   372.75 ( 0.20%)
      
      Nothing notable for kernbench.
      
      NetPerf UDP X86
            64    42.68 ( 0.00%)     42.77 ( 0.21%)
           128    85.62 ( 0.00%)     85.32 (-0.35%)
           256   170.01 ( 0.00%)    168.76 (-0.74%)
          1024   655.68 ( 0.00%)    652.33 (-0.51%)
          2048  1262.39 ( 0.00%)   1248.61 (-1.10%)
          3312  1958.41 ( 0.00%)   1944.61 (-0.71%)
          4096  2345.63 ( 0.00%)   2318.83 (-1.16%)
          8192  4132.90 ( 0.00%)   4089.50 (-1.06%)
         16384  6770.88 ( 0.00%)   6642.05 (-1.94%)*
      
      NetPerf UDP X86-64
            64   148.82 ( 0.00%)    154.92 ( 3.94%)
           128   298.96 ( 0.00%)    312.95 ( 4.47%)
           256   583.67 ( 0.00%)    626.39 ( 6.82%)
          1024  2293.18 ( 0.00%)   2371.10 ( 3.29%)
          2048  4274.16 ( 0.00%)   4396.83 ( 2.79%)
          3312  6356.94 ( 0.00%)   6571.35 ( 3.26%)
          4096  7422.68 ( 0.00%)   7635.42 ( 2.79%)*
          8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
         16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
                   1.64%             2.73%
      
      NetPerf UDP PPC64
            64    49.98 ( 0.00%)     50.25 ( 0.54%)
           128    98.66 ( 0.00%)    100.95 ( 2.27%)
           256   197.33 ( 0.00%)    191.03 (-3.30%)
          1024   761.98 ( 0.00%)    785.07 ( 2.94%)
          2048  1493.50 ( 0.00%)   1510.85 ( 1.15%)
          3312  2303.95 ( 0.00%)   2271.72 (-1.42%)
          4096  2774.56 ( 0.00%)   2773.06 (-0.05%)
          8192  4918.31 ( 0.00%)   4793.59 (-2.60%)
         16384  7497.98 ( 0.00%)   7749.52 ( 3.25%)
      
      The tests are run to have confidence limits within 1%.  Results marked
      with a * were not confident although in this case, it's only outside by
      small amounts.  Even with some results that were not confident, the
      netperf UDP results were generally positive.
      
      NetPerf TCP X86
            64   652.25 ( 0.00%)*   648.12 (-0.64%)*
                  23.80%            22.82%
           128  1229.98 ( 0.00%)*  1220.56 (-0.77%)*
                  21.03%            18.90%
           256  2105.88 ( 0.00%)   1872.03 (-12.49%)*
                   1.00%            16.46%
          1024  3476.46 ( 0.00%)*  3548.28 ( 2.02%)*
                  13.37%            11.39%
          2048  4023.44 ( 0.00%)*  4231.45 ( 4.92%)*
                   9.76%            12.48%
          3312  4348.88 ( 0.00%)*  4396.96 ( 1.09%)*
                   6.49%             8.75%
          4096  4726.56 ( 0.00%)*  4877.71 ( 3.10%)*
                   9.85%             8.50%
          8192  4732.28 ( 0.00%)*  5777.77 (18.10%)*
                   9.13%            13.04%
         16384  5543.05 ( 0.00%)*  5906.24 ( 6.15%)*
                   7.73%             8.68%
      
      NETPERF TCP X86-64
                  netperf-tcp-vanilla-netperf       netperf-tcp
                         tcp-vanilla     pgalloc-delay
            64  1895.87 ( 0.00%)*  1775.07 (-6.81%)*
                   5.79%             4.78%
           128  3571.03 ( 0.00%)*  3342.20 (-6.85%)*
                   3.68%             6.06%
           256  5097.21 ( 0.00%)*  4859.43 (-4.89%)*
                   3.02%             2.10%
          1024  8919.10 ( 0.00%)*  8892.49 (-0.30%)*
                   5.89%             6.55%
          2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
                   7.08%             7.44%
          3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
                   6.87%             7.33%
          4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
                   6.86%             8.18%
          8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
                   7.49%             5.55%
         16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
                   7.36%             6.49%
      
      NETPERF TCP PPC64
                  netperf-tcp-vanilla-netperf       netperf-tcp
                         tcp-vanilla     pgalloc-delay
            64   594.17 ( 0.00%)    596.04 ( 0.31%)*
                   1.00%             2.29%
           128  1064.87 ( 0.00%)*  1074.77 ( 0.92%)*
                   1.30%             1.40%
           256  1852.46 ( 0.00%)*  1856.95 ( 0.24%)
                   1.25%             1.00%
          1024  3839.46 ( 0.00%)*  3813.05 (-0.69%)
                   1.02%             1.00%
          2048  4885.04 ( 0.00%)*  4881.97 (-0.06%)*
                   1.15%             1.04%
          3312  5506.90 ( 0.00%)   5459.72 (-0.86%)
          4096  6449.19 ( 0.00%)   6345.46 (-1.63%)
          8192  7501.17 ( 0.00%)   7508.79 ( 0.10%)
         16384  9618.65 ( 0.00%)   9490.10 (-1.35%)
      
      There was a distinct lack of confidence in the X86* figures so I included
      what the devation was where the results were not confident.  Many of the
      results, whether gains or losses were within the standard deviation so no
      solid conclusion can be reached on performance impact.  Looking at the
      figures, only the X86-64 ones look suspicious with a few losses that were
      outside the noise.  However, the results were so unstable that without
      knowing why they vary so much, a solid conclusion cannot be reached.
      
      SYSBENCH X86
                    sysbench-vanilla     pgalloc-delay
                 1  7722.85 ( 0.00%)  7756.79 ( 0.44%)
                 2 14901.11 ( 0.00%) 13683.44 (-8.90%)
                 3 15171.71 ( 0.00%) 14888.25 (-1.90%)
                 4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
                 5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
                 6 14870.33 ( 0.00%) 14845.57 (-0.17%)
                 7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
                 8 14354.35 ( 0.00%) 14362.31 ( 0.06%)
      
      SYSBENCH X86-64
                 1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
                 2 34276.39 ( 0.00%) 34251.00 (-0.07%)
                 3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
                 4 66667.10 ( 0.00%) 66174.69 (-0.74%)
                 5 66003.91 ( 0.00%) 65685.25 (-0.49%)
                 6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
                 7 64933.16 ( 0.00%) 64379.23 (-0.86%)
                 8 63353.30 ( 0.00%) 63281.22 (-0.11%)
                 9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
                10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
                11 62092.81 ( 0.00%) 61787.75 (-0.49%)
                12 61330.11 ( 0.00%) 61036.34 (-0.48%)
                13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
                14 62304.48 ( 0.00%) 62064.90 (-0.39%)
                15 63296.48 ( 0.00%) 62875.16 (-0.67%)
                16 63951.76 ( 0.00%) 63769.09 (-0.29%)
      
      SYSBENCH PPC64
                                   -sysbench-pgalloc-delay-sysbench
                    sysbench-vanilla     pgalloc-delay
                 1  7645.08 ( 0.00%)  7467.43 (-2.38%)
                 2 14856.67 ( 0.00%) 14558.73 (-2.05%)
                 3 21952.31 ( 0.00%) 21683.64 (-1.24%)
                 4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
                 5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
                 6 27477.10 ( 0.00%) 27337.45 (-0.51%)
                 7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
                 8 26642.91 ( 0.00%) 25274.33 (-5.41%)
                 9 25137.27 ( 0.00%) 24810.06 (-1.32%)
                10 24451.99 ( 0.00%) 24275.85 (-0.73%)
                11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
                12 24234.81 ( 0.00%) 23640.89 (-2.51%)
                13 24577.75 ( 0.00%) 24433.50 (-0.59%)
                14 25640.19 ( 0.00%) 25116.52 (-2.08%)
                15 26188.84 ( 0.00%) 26181.36 (-0.03%)
                16 26782.37 ( 0.00%) 26255.99 (-2.00%)
      
      Again, there is little to conclude here.  While there are a few losses,
      the results vary by +/- 8% in some cases.  They are the results of most
      concern as there are some large losses but it's also within the variance
      typically seen between kernel releases.
      
      The STREAM results varied so little and are so verbose that I didn't
      include them here.
      
      The final test stressed how many huge pages can be allocated.  The
      absolute number of huge pages allocated are the same with or without the
      page.  However, the "unusability free space index" which is a measure of
      external fragmentation was slightly lower (lower is better) throughout the
      lifetime of the system.  I also measured the latency of how long it took
      to successfully allocate a huge page.  The latency was slightly lower and
      on X86 and PPC64, more huge pages were allocated almost immediately from
      the free lists.  The improvement is slight but there.
      
      [mel@csn.ul.ie: Tested, reworked for less branches]
      [czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Corrado Zoccolo <czoccolo@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6dda9d55
  4. 16 Mar, 2010 1 commit
  5. 12 Mar, 2010 2 commits
  6. 06 Mar, 2010 6 commits
    • David Rientjes's avatar
      mm: suppress pfn range output for zones without pages · 72f0ba02
      David Rientjes authored
      
      
      free_area_init_nodes() emits pfn ranges for all zones on the system.
      There may be no pages on a higher zone, however, due to memory limitations
      or the use of the mem= kernel parameter.  For example:
      
      Zone PFN ranges:
        DMA      0x00000001 -> 0x00001000
        DMA32    0x00001000 -> 0x00100000
        Normal   0x00100000 -> 0x00100000
      
      The implementation copies the previous zone's highest pfn, if any, as the
      next zone's lowest pfn.  If its highest pfn is then greater than the
      amount of addressable memory, the upper memory limit is used instead.
      Thus, both the lowest and highest possible pfn for higher zones without
      memory may be the same.
      
      The pfn range for zones without memory is now shown as "empty" instead.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72f0ba02
    • Rafael J. Wysocki's avatar
      mm/pm: force GFP_NOIO during suspend/hibernation and resume · 452aa699
      Rafael J. Wysocki authored
      
      
      There are quite a few GFP_KERNEL memory allocations made during
      suspend/hibernation and resume that may cause the system to hang, because
      the I/O operations they depend on cannot be completed due to the
      underlying devices being suspended.
      
      Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
      gfp_allowed_mask before suspend/hibernation and restoring the original
      values of these bits in gfp_allowed_mask durig the subsequent resume.
      
      [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Reported-by: default avatarMaxim Levitsky <maximlevitsky@gmail.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      452aa699
    • KOSAKI Motohiro's avatar
      mm: restore zone->all_unreclaimable to independence word · 93e4a89a
      KOSAKI Motohiro authored
      commit e815af95
      
       ("change all_unreclaimable zone member to flags") changed
      all_unreclaimable member to bit flag.  But it had an undesireble side
      effect.  free_one_page() is one of most hot path in linux kernel and
      increasing atomic ops in it can reduce kernel performance a bit.
      
      Thus, this patch revert such commit partially. at least
      all_unreclaimable shouldn't share memory word with other zone flags.
      
      [akpm@linux-foundation.org: fix patch interaction]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Huang Shijie <shijie8@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93e4a89a
    • Li Hong's avatar
      mm: remove free_hot_page() · fc91668e
      Li Hong authored
      
      
      free_hot_page() is just a wrapper around free_hot_cold_page() with
      parameter 'cold = 0'.  After adding a clear comment for
      free_hot_cold_page(), it is reasonable to remove a level of call.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarLi Hong <lihong.hi@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc91668e
    • Li Hong's avatar
      mm/page_alloc.c: adjust a call site to trace_mm_page_free_direct · c475dab6
      Li Hong authored
      
      
      Move a call of trace_mm_page_free_direct() from free_hot_page() to
      free_hot_cold_page().  It is clearer and close to kmemcheck_free_shadow(),
      as it is done in function __free_pages_ok().
      Signed-off-by: default avatarLi Hong <lihong.hi@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c475dab6
    • Li Hong's avatar
      mm/page_alloc.c: remove duplicate call to trace_mm_page_free_direct · f650316c
      Li Hong authored
      
      
      trace_mm_page_free_direct() is called in function __free_pages().  But it
      is called again in free_hot_page() if order == 0 and produce duplicate
      records in trace file for mm_page_free_direct event.  As below:
      
      K-PID    CPU#    TIMESTAMP  FUNCTION
        gnome-terminal-1567  [000]  4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
        gnome-terminal-1567  [000]  4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
        gnome-terminal-1567  [000]  4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
        gnome-terminal-1567  [000]  4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
        gnome-terminal-1567  [000]  4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
      
      This patch removes the first call and adds a call to
      trace_mm_page_free_direct() in __free_pages_ok().
      Signed-off-by: default avatarLi Hong <lihong.hi@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f650316c
  7. 22 Feb, 2010 1 commit
    • Yinghai Lu's avatar
      x86: Fix non-bootmem compilation on PowerPC · 2ee78f7b
      Yinghai Lu authored
      
      
      These build errors on some non-x86 platforms (PowerPC for example):
      
       mm/page_alloc.c: In function '__alloc_memory_core_early':
         mm/page_alloc.c:3468: error: implicit declaration of function 'find_early_area'
         mm/page_alloc.c:3483: error: implicit declaration of function 'reserve_early_without_check'
      
      The function is only needed on CONFIG_NO_BOOTMEM.
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      LKML-Reference: <4B747239.4070907@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      2ee78f7b
  8. 12 Feb, 2010 1 commit
  9. 29 Jan, 2010 1 commit
    • Hugh Dickins's avatar
      mm: fix migratetype bug which slowed swapping · a7016235
      Hugh Dickins authored
      After memory pressure has forced it to dip into the reserves, 2.6.32's
      5f8dcc21
      
       "page-allocator: split per-cpu
      list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
      pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.
      
      Fix that in the most straightforward way (which, considering the overheads
      of alternative approaches, is Mel's preference): the right migratetype is
      already in page_private(page), but free_pcppages_bulk() wasn't using it.
      
      How did this bug show up?  As a 20% slowdown in my tmpfs loop kbuild
      swapping tests, on PowerMac G5 with SLUB allocator.  Bisecting to that
      commit was easy, but explaining the magnitude of the slowdown not easy.
      
      The same effect appears, but much less markedly, with SLAB, and even
      less markedly on other machines (the PowerMac divides into fewer zones
      than x86, I think that may be a factor).  We guess that lumpy reclaim
      of short-lived high-order pages is implicated in some way, and probably
      this bug has been tickling a poor decision somewhere in page reclaim.
      
      But instrumentation hasn't told me much, I've run out of time and
      imagination to determine exactly what's going on, and shouldn't hold up
      the fix any longer: it's valid, and might even fix other misbehaviours.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7016235
  10. 17 Jan, 2010 1 commit
  11. 16 Jan, 2010 1 commit
  12. 05 Jan, 2010 1 commit
    • Christoph Lameter's avatar
      this_cpu: Page allocator conversion · 99dcc3e5
      Christoph Lameter authored
      
      
      Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
      
      This drastically reduces the size of struct zone for systems with large
      amounts of processors and allows placement of critical variables of struct
      zone in one cacheline even on very large systems.
      
      Another effect is that the pagesets of one processor are placed near one
      another. If multiple pagesets from different zones fit into one cacheline
      then additional cacheline fetches can be avoided on the hot paths when
      allocating memory from multiple zones.
      
      Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
      are reduced and we can drop the zone_pcp macro.
      
      Hotplug handling is also simplified since cpu alloc can bring up and
      shut down cpu areas for a specific cpu as a whole. So there is no need to
      allocate or free individual pagesets.
      
      V7-V8:
      - Explain chicken egg dilemmna with percpu allocator.
      
      V4-V5:
      - Fix up cases where per_cpu_ptr is called before irq disable
      - Integrate the bootstrap logic that was separate before.
      
      tj: Build failure in pageset_cpuup_callback() due to missing ret
          variable fixed.
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      99dcc3e5
  13. 23 Dec, 2009 1 commit
  14. 18 Dec, 2009 1 commit
    • Robert Jennings's avatar
      mm: Add notifier in pageblock isolation for balloon drivers · 925cc71e
      Robert Jennings authored and Benjamin Herrenschmidt's avatar Benjamin Herrenschmidt committed
      
      
      Memory balloon drivers can allocate a large amount of memory which is not
      movable but could be freed to accomodate memory hotplug remove.
      
      Prior to calling the memory hotplug notifier chain the memory in the
      pageblock is isolated.  Currently, if the migrate type is not
      MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
      for that page range to fail.
      
      Rather than failing pageblock isolation if the migrateteype is not
      MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
      and not on the LRU, are owned by a registered balloon driver (or other
      entity) using a notifier chain.  If all of the non-movable pages are owned
      by a balloon, they can be freed later through the memory notifier chain
      and the range can still be isolated in set_migratetype_isolate().
      Signed-off-by: default avatarRobert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Brian King <brking@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Gerald Schaefer <geralds@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: Benjamin Herrenschmidt's avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      925cc71e
  15. 17 Dec, 2009 1 commit
    • Yinghai Lu's avatar
      x86: Fix checking of SRAT when node 0 ram is not from 0 · 32996250
      Yinghai Lu authored
      Found one system that boot from socket1 instead of socket0, SRAT get rejected...
      
      [    0.000000] SRAT: Node 1 PXM 0 0-a0000
      [    0.000000] SRAT: Node 1 PXM 0 100000-80000000
      [    0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
      [    0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
      [    0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
      [    0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
      [    0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
      [    0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
      [    0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
      [    0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
      ...
      [    0.000000] NUMA: Allocated memnodemap from 500000 - 701040
      [    0.000000] NUMA: Using 20 for the hash shift.
      [    0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
      [    0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
      [    0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
      [    0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
      [    0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
      [    0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
      [    0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
      [    0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
      [    0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
      [    0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
      [    0.000000] SRAT: SRAT not used.
      
      the early_node_map is not sorted because node0 with non zero start come first.
      
      so try to sort it right away after all regions are registered.
      
      also fixs refression by 8716273c
      
       (x86: Export srat physical topology)
      
      -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
      -v3: update comments.
      Reported-and-tested-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B2579D2.3010201@kernel.org>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      32996250
  16. 16 Dec, 2009 2 commits
    • KAMEZAWA Hiroyuki's avatar
      oom-kill: fix NUMA constraint check with nodemask · 4365a567
      KAMEZAWA Hiroyuki authored
      
      
      Fix node-oriented allocation handling in oom-kill.c I myself think of this
      as a bugfix not as an ehnancement.
      
      In these days, things are changed as
        - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
        - mempolicy don't maintain its own private zonelists.
        (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
      
      So, current oom-killer's check function is wrong.
      
      This patch does
        - check nodemask, if nodemask && nodemask doesn't cover all
          node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
        - Scan all zonelist under nodemask, if it hits cpuset's wall
          this faiulre is from cpuset.
      And
        - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
          This doesn't change "current" behavior. If callers use __GFP_THISNODE
          it should handle "page allocation failure" by itself.
      
        - handle __GFP_NOFAIL+__GFP_THISNODE path.
          This is something like a FIXME but this gfpmask is not used now.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4365a567
    • Wu Fengguang's avatar
      HWPOISON: detect free buddy pages explicitly · 8d22ba1b
      Wu Fengguang authored
      
      
      Most free pages in the buddy system have no PG_buddy set.
      Introduce is_free_buddy_page() for detecting them reliably.
      
      CC: Nick Piggin <npiggin@suse.de>
      CC: Mel Gorman <mel@linux.vnet.ibm.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      8d22ba1b
  17. 15 Dec, 2009 1 commit
    • Hugh Dickins's avatar
      mm: CONFIG_MMU for PG_mlocked · af8e3354
      Hugh Dickins authored
      
      
      Remove three degrees of obfuscation, left over from when we had
      CONFIG_UNEVICTABLE_LRU.  MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
      CONFIG_HAVE_MLOCK is CONFIG_MMU.  rmap.o (and memory-failure.o) are only
      built when CONFIG_MMU, so don't need such conditions at all.
      
      Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
      169 defconfigs: leave those to evolve in due course.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af8e3354
  18. 26 Nov, 2009 2 commits
    • Ingo Molnar's avatar
      tracing: Fix kmem event exports · 4d795fb1
      Ingo Molnar authored
      Commit 53d0422c
      
       ("tracing: Convert some kmem events to DEFINE_EVENT")
      moved the kmem tracepoint creation from util.c to page_alloc.c,
      but forgot to move the exports.
      
      Move them back.
      
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      LKML-Reference: <4B0E286A.2000405@cn.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      4d795fb1
    • Li Zefan's avatar
      tracing: Convert some kmem events to DEFINE_EVENT · 53d0422c
      Li Zefan authored
      
      
      Use DECLARE_EVENT_CLASS to remove duplicate code:
      
         text    data     bss     dec     hex filename
       333987   69800   27228  431015   693a7 mm/built-in.o.old
       330030   69800   27228  427058   68432 mm/built-in.o
      
      8 events are converted:
      
        kmem_alloc: kmalloc, kmem_cache_alloc
        kmem_alloc_node: kmalloc_node, kmem_cache_alloc_node
        kmem_free: kfree, kmem_cache_free
        mm_page: mm_page_alloc_zone_locked, mm_page_pcpu_drain
      
      No change in functionality.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      LKML-Reference: <4B0E286A.2000405@cn.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      53d0422c
  19. 12 Nov, 2009 2 commits
  20. 29 Oct, 2009 1 commit
    • Andrew Morton's avatar
      revert "mm: oom analysis: add buffer cache information to show_free_areas()" · b76146ed
      Andrew Morton authored
      Revert
      
          commit 71de1ccb
      
      
          Author:     KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
          AuthorDate: Mon Sep 21 17:01:31 2009 -0700
          Commit:     Linus Torvalds <torvalds@linux-foundation.org>
          CommitDate: Tue Sep 22 07:17:27 2009 -0700
      
              mm: oom analysis: add buffer cache information to show_free_areas()
      
      show_free_areas() is called during page allocation failures, and page
      allocation failures can occur in any calling context.
      
      But nr_blockdev_pages() takes VFS locks which should not be taken from
      hard IRQ context (at least).  The result is lockdep warnings (and
      deadlockability) during page allocation failures.
      
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b76146ed
  21. 24 Sep, 2009 1 commit