1. 26 Jul, 2016 2 commits
  2. 20 May, 2016 3 commits
    • Mel Gorman's avatar
      mm, page_alloc: avoid looking up the first zone in a zonelist twice · c33d6c06
      Mel Gorman authored
      The allocator fast path looks up the first usable zone in a zonelist and
      then get_page_from_freelist does the same job in the zonelist iterator.
      This patch preserves the necessary information.
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                              fastmark-v1r20             initonce-v1r20
        Min      alloc-odr0-1               364.00 (  0.00%)           359.00 (  1.37%)
        Min      alloc-odr0-2               262.00 (  0.00%)           260.00 (  0.76%)
        Min      alloc-odr0-4               214.00 (  0.00%)           214.00 (  0.00%)
        Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
        Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
        Min      alloc-odr0-32              165.00 (  0.00%)           165.00 (  0.00%)
        Min      alloc-odr0-64              161.00 (  0.00%)           162.00 ( -0.62%)
        Min      alloc-odr0-128             159.00 (  0.00%)           161.00 ( -1.26%)
        Min      alloc-odr0-256             168.00 (  0.00%)           170.00 ( -1.19%)
        Min      alloc-odr0-512             180.00 (  0.00%)           181.00 ( -0.56%)
        Min      alloc-odr0-1024            190.00 (  0.00%)           190.00 (  0.00%)
        Min      alloc-odr0-2048            196.00 (  0.00%)           196.00 (  0.00%)
        Min      alloc-odr0-4096            202.00 (  0.00%)           202.00 (  0.00%)
        Min      alloc-odr0-8192            206.00 (  0.00%)           205.00 (  0.49%)
        Min      alloc-odr0-16384           206.00 (  0.00%)           205.00 (  0.49%)
      
      The benefit is negligible and the results are within the noise but each
      cycle counts.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c33d6c06
    • Andrew Morton's avatar
      mm/mempolicy.c:offset_il_node() document and clarify · fee83b3a
      Andrew Morton authored
      This code was pretty obscure and was relying upon obscure side-effects
      of next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to
      -1.
      
      Clean that all up and document the function's intent.
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fee83b3a
    • Andrew Morton's avatar
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton authored
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
  3. 17 Mar, 2016 1 commit
  4. 15 Mar, 2016 1 commit
  5. 09 Mar, 2016 1 commit
  6. 16 Feb, 2016 1 commit
    • Dave Hansen's avatar
      mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm · d4edcf0d
      Dave Hansen authored
      We will soon modify the vanilla get_user_pages() so it can no
      longer be used on mm/tasks other than 'current/current->mm',
      which is by far the most common way it is called.  For now,
      we allow the old-style calls, but warn when they are used.
      (implemented in previous patch)
      
      This patch switches all callers of:
      
      	get_user_pages()
      	get_user_pages_unlocked()
      	get_user_pages_locked()
      
      to stop passing tsk/mm so they will no longer see the warnings.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: jack@suse.cz
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.comSigned-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      d4edcf0d
  7. 06 Feb, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mempolicy: do not try to queue pages from !vma_migratable() · 77bf45e7
      Kirill A. Shutemov authored
      Maybe I miss some point, but I don't see a reason why we try to queue
      pages from non migratable VMAs.
      
      This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page():
      
          #include <fcntl.h>
          #include <unistd.h>
          #include <stdio.h>
          #include <sys/mman.h>
          #include <numaif.h>
      
          #define SIZE 0x2000
      
          int foo;
      
          int main()
          {
              int fd;
              char *p;
              unsigned long mask = 2;
      
              fd = open("/dev/sg0", O_RDWR);
              p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
              /* Faultin pages */
              foo = p[0] + p[0x1000];
              mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT);
              return 0;
          }
      
      The only case when we can queue pages from such VMA is MPOL_MF_STRICT
      plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU,
      but gfp mask is not sutable for migaration (see mapping_gfp_mask() check
      in vma_migratable()).  That's looks like a bug to me.
      
      Let's filter out non-migratable vma at start of queue_pages_test_walk()
      and go to queue_pages_pte_range() only if MPOL_MF_MOVE or
      MPOL_MF_MOVE_ALL flag is set.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77bf45e7
  8. 16 Jan, 2016 3 commits
  9. 15 Jan, 2016 1 commit
    • Nathan Zimmer's avatar
      mm/mempolicy.c: convert the shared_policy lock to a rwlock · 4a8c7bb5
      Nathan Zimmer authored
      When running the SPECint_rate gcc on some very large boxes it was
      noticed that the system was spending lots of time in
      mpol_shared_policy_lookup().  The gamess benchmark can also show it and
      is what I mostly used to chase down the issue since the setup for that I
      found to be easier.
      
      To be clear the binaries were on tmpfs because of disk I/O requirements.
      We then used text replication to avoid icache misses and having all the
      copies banging on the memory where the instruction code resides.  This
      results in us hitting a bottleneck in mpol_shared_policy_lookup() since
      lookup is serialised by the shared_policy lock.
      
      I have only reproduced this on very large (3k+ cores) boxes.  The
      problem starts showing up at just a few hundred ranks getting worse
      until it threatens to livelock once it gets large enough.  For example
      on the gamess benchmark at 128 ranks this area consumes only ~1% of
      time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
      90%.
      
      To alleviate the contention in this area I converted the spinlock to an
      rwlock.  This allows a large number of lookups to happen simultaneously.
      The results were quite good reducing this consumtion at max ranks to
      around 2%.
      
      [akpm@linux-foundation.org: tidy up code comments]
      Signed-off-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a8c7bb5
  10. 08 Sep, 2015 2 commits
    • Vlastimil Babka's avatar
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka authored
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRobin Holt <robinmholt@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
    • Aristeu Rozanski's avatar
      mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in queue_pages_range() · acda0c33
      Aristeu Rozanski authored
      This check was introduced as part of
         6f4576e3 ("mempolicy: apply page table walker on queue_pages_range()")
      
      which got duplicated by
         48684a65 ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)")
      
      by reintroducing it earlier on queue_page_test_walk()
      Signed-off-by: default avatarAristeu Rozanski <aris@redhat.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      acda0c33
  11. 04 Sep, 2015 1 commit
  12. 25 Jun, 2015 1 commit
    • Vlastimil Babka's avatar
      mm, thp: respect MPOL_PREFERRED policy with non-local node · 0867a57c
      Vlastimil Babka authored
      Since commit 077fcf11 ("mm/thp: allocate transparent hugepages on
      local node"), we handle THP allocations on page fault in a special way -
      for non-interleave memory policies, the allocation is only attempted on
      the node local to the current CPU, if the policy's nodemask allows the
      node.
      
      This is motivated by the assumption that THP benefits cannot offset the
      cost of remote accesses, so it's better to fallback to base pages on the
      local node (which might still be available, while huge pages are not due
      to fragmentation) than to allocate huge pages on a remote node.
      
      The nodemask check prevents us from violating e.g.  MPOL_BIND policies
      where the local node is not among the allowed nodes.  However, the
      current implementation can still give surprising results for the
      MPOL_PREFERRED policy when the preferred node is different than the
      current CPU's local node.
      
      In such case we should honor the preferred node and not use the local
      node, which is what this patch does.  If hugepage allocation on the
      preferred node fails, we fall back to base pages and don't try other
      nodes, with the same motivation as is done for the local node hugepage
      allocations.  The patch also moves the MPOL_INTERLEAVE check around to
      simplify the hugepage specific test.
      
      The difference can be demonstrated using in-tree transhuge-stress test
      on the following 2-node machine where half memory on one node was
      occupied to show the difference.
      
      > numactl --hardware
      available: 2 nodes (0-1)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
      node 0 size: 7878 MB
      node 0 free: 3623 MB
      node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
      node 1 size: 8045 MB
      node 1 free: 7818 MB
      node distances:
      node   0   1
        0:  10  21
        1:  21  10
      
      Before the patch:
      > numactl -p0 -C0 ./transhuge-stress
      transhuge-stress: 2.197 s/loop, 0.276 ms/page,   7249.168 MiB/s 7962 succeed,    0 failed, 1786 different pages
      
      > numactl -p0 -C12 ./transhuge-stress
      transhuge-stress: 2.962 s/loop, 0.372 ms/page,   5376.172 MiB/s 7962 succeed,    0 failed, 3873 different pages
      
      Number of successful THP allocations corresponds to free memory on node 0 in
      the first case and node 1 in the second case, i.e. -p parameter is ignored and
      cpu binding "wins".
      
      After the patch:
      > numactl -p0 -C0 ./transhuge-stress
      transhuge-stress: 2.183 s/loop, 0.274 ms/page,   7295.516 MiB/s 7962 succeed,    0 failed, 1760 different pages
      
      > numactl -p0 -C12 ./transhuge-stress
      transhuge-stress: 2.878 s/loop, 0.361 ms/page,   5533.638 MiB/s 7962 succeed,    0 failed, 1750 different pages
      
      > numactl -p1 -C0 ./transhuge-stress
      transhuge-stress: 4.628 s/loop, 0.581 ms/page,   3440.893 MiB/s 7962 succeed,    0 failed, 3918 different pages
      
      The -p parameter is respected regardless of cpu binding.
      
      > numactl -C0 ./transhuge-stress
      transhuge-stress: 2.202 s/loop, 0.277 ms/page,   7230.003 MiB/s 7962 succeed,    0 failed, 1750 different pages
      
      > numactl -C12 ./transhuge-stress
      transhuge-stress: 3.020 s/loop, 0.379 ms/page,   5273.324 MiB/s 7962 succeed,    0 failed, 3916 different pages
      
      Without -p parameter, hugepage restriction to CPU-local node works as before.
      
      Fixes: 077fcf11 ("mm/thp: allocate transparent hugepages on local node")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0867a57c
  13. 15 May, 2015 1 commit
  14. 14 Apr, 2015 2 commits
    • David Rientjes's avatar
      mm, thp: really limit transparent hugepage allocation to local node · 5265047a
      David Rientjes authored
      Commit 077fcf11 ("mm/thp: allocate transparent hugepages on local
      node") restructured alloc_hugepage_vma() with the intent of only
      allocating transparent hugepages locally when there was not an effective
      interleave mempolicy.
      
      alloc_pages_exact_node() does not limit the allocation to the single node,
      however, but rather prefers it.  This is because __GFP_THISNODE is not set
      which would cause the node-local nodemask to be passed.  Without it, only
      a nodemask that prefers the local node is passed.
      
      Fix this by passing __GFP_THISNODE and falling back to small pages when
      the allocation fails.
      
      Commit 9f1b868a ("mm: thp: khugepaged: add policy for finding target
      node") suffers from a similar problem for khugepaged, which is also fixed.
      
      Fixes: 077fcf11 ("mm/thp: allocate transparent hugepages on local node")
      Fixes: 9f1b868a ("mm: thp: khugepaged: add policy for finding target node")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Jarno Rajahalme <jrajahalme@nicira.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5265047a
    • David Rientjes's avatar
      mm, mempolicy: migrate_to_node should only migrate to node · b360edb4
      David Rientjes authored
      migrate_to_node() is intended to migrate a page from one source node to
      a target node.
      
      Today, migrate_to_node() could end up migrating to any node, not only
      the target node.  This is because the page migration allocator,
      new_node_page() does not pass __GFP_THISNODE to
      alloc_pages_exact_node().  This causes the target node to be preferred
      but allows fallback to any other node in order of affinity.
      
      Prevent this by allocating with __GFP_THISNODE.  If memory is not
      available, -ENOMEM will be returned as appropriate.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b360edb4
  15. 14 Feb, 2015 1 commit
  16. 13 Feb, 2015 1 commit
  17. 12 Feb, 2015 4 commits
    • Naoya Horiguchi's avatar
      mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) · 48684a65
      Naoya Horiguchi authored
      walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
      undesirable behaviour at client end (who called walk_page_range).  For
      example for pagemap_read(), when no callbacks are called against VM_PFNMAP
      vma, pagemap_read() may prepare pagemap data for next virtual address
      range at wrong index.  That could confuse and/or break userspace
      applications.
      
      This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
      - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
        over vma(VM_PFNMAP),
      - for clear_refs and queue_pages which have their own ->tests_walk,
        just return 1 and skip vma(VM_PFNMAP). This is no problem because
        these are not interested in hole regions,
      - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarShiraz Hashim <shashim@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48684a65
    • Naoya Horiguchi's avatar
      mempolicy: apply page table walker on queue_pages_range() · 6f4576e3
      Naoya Horiguchi authored
      queue_pages_range() does page table walking in its own way now, but there
      is some code duplicate.  This patch applies page table walker to reduce
      lines of code.
      
      queue_pages_range() has to do some precheck to determine whether we really
      walk over the vma or just skip it.  Now we have test_walk() callback in
      mm_walk for this purpose, so we can do this replacement cleanly.
      queue_pages_test_walk() depends on not only the current vma but also the
      previous one, so queue_pages->prev is introduced to remember it.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f4576e3
    • Vlastimil Babka's avatar
      mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma · be97a41b
      Vlastimil Babka authored
      The previous commit ("mm/thp: Allocate transparent hugepages on local
      node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a
      special policy for THP allocations.  The function has the same interface
      as alloc_pages_vma(), shares a lot of boilerplate code and a long
      comment.
      
      This patch merges the hugepage special case into alloc_pages_vma.  The
      extra if condition should be cheap enough price to pay.  We also prevent
      a (however unlikely) race with parallel mems_allowed update, which could
      make hugepage allocation restart only within the fallback call to
      alloc_hugepage_vma() and not reconsider the special rule in
      alloc_hugepage_vma().
      
      Also by making sure mpol_cond_put(pol) is always called before actual
      allocation attempt, we can use a single exit path within the function.
      
      Also update the comment for missing node parameter and obsolete reference
      to mm_sem.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be97a41b
    • Aneesh Kumar K.V's avatar
      mm/thp: allocate transparent hugepages on local node · 077fcf11
      Aneesh Kumar K.V authored
      This make sure that we try to allocate hugepages from local node if
      allowed by mempolicy.  If we can't, we fallback to small page allocation
      based on mempolicy.  This is based on the observation that allocating
      pages on local node is more beneficial than allocating hugepages on remote
      node.
      
      With this patch applied we may find transparent huge page allocation
      failures if the current node doesn't have enough freee hugepages.  Before
      this patch such failures result in us retrying the allocation on other
      nodes in the numa node mask.
      
      [akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      077fcf11
  18. 19 Dec, 2014 1 commit
  19. 17 Dec, 2014 1 commit
  20. 10 Oct, 2014 8 commits
  21. 25 Jun, 2014 1 commit
    • Gu Zheng's avatar
      cpuset,mempolicy: fix sleeping function called from invalid context · 391acf97
      Gu Zheng authored
      When runing with the kernel(3.15-rc7+), the follow bug occurs:
      [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
      [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
      [ 9969.441175] INFO: lockdep is turned off.
      [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G       A      3.15.0-rc7+ #85
      [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
      [ 9969.706052]  ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
      [ 9969.795323]  ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
      [ 9969.884710]  ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
      [ 9969.974071] Call Trace:
      [ 9970.003403]  [<ffffffff8162f523>] dump_stack+0x4d/0x66
      [ 9970.065074]  [<ffffffff8109995a>] __might_sleep+0xfa/0x130
      [ 9970.130743]  [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
      [ 9970.200638]  [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
      [ 9970.272610]  [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
      [ 9970.344584]  [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
      [ 9970.409282]  [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
      [ 9970.471897]  [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
      [ 9970.536585]  [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
      [ 9970.613763]  [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
      [ 9970.683660]  [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
      [ 9970.759795]  [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
      [ 9970.834885]  [<ffffffff8106a598>] do_fork+0xd8/0x380
      [ 9970.894375]  [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
      [ 9970.969470]  [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
      [ 9971.030011]  [<ffffffff81642009>] stub_clone+0x69/0x90
      [ 9971.091573]  [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
      
      The cause is that cpuset_mems_allowed() try to take
      mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
      __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
      under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
      protection region to protect the access to cpuset only in
      current_cpuset_is_being_rebound(). So that we can avoid this bug.
      
      This patch is a temporary solution that just addresses the bug
      mentioned above, can not fix the long-standing issue about cpuset.mems
      rebinding on fork():
      
      "When the forker's task_struct is duplicated (which includes
       ->mems_allowed) and it races with an update to cpuset_being_rebound
       in update_tasks_nodemask() then the task's mems_allowed doesn't get
       updated. And the child task's mems_allowed can be wrong if the
       cpuset's nodemask changes before the child has been added to the
       cgroup's tasklist."
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable <stable@vger.kernel.org>
      391acf97
  22. 23 Jun, 2014 1 commit
  23. 06 Jun, 2014 1 commit