Skip to content
Snippets Groups Projects
  1. May 13, 2022
  2. May 10, 2022
  3. Apr 29, 2022
  4. Apr 24, 2022
  5. Apr 15, 2022
    • Juergen Gross's avatar
      mm, page_alloc: fix build_zonerefs_node() · e553f62f
      Juergen Gross authored
      Since commit 6aa303de ("mm, vmscan: only allocate and reclaim from
      zones with pages managed by the buddy allocator") only zones with free
      memory are included in a built zonelist.  This is problematic when e.g.
      all memory of a zone has been ballooned out when zonelists are being
      rebuilt.
      
      The decision whether to rebuild the zonelists when onlining new memory
      is done based on populated_zone() returning 0 for the zone the memory
      will be added to.  The new zone is added to the zonelists only, if it
      has free memory pages (managed_zone() returns a non-zero value) after
      the memory has been onlined.  This implies, that onlining memory will
      always free the added pages to the allocator immediately, but this is
      not true in all cases: when e.g. running as a Xen guest the onlined new
      memory will be added only to the ballooned memory list, it will be freed
      only when the guest is being ballooned up afterwards.
      
      Another problem with using managed_zone() for the decision whether a
      zone is being added to the zonelists is, that a zone with all memory
      used will in fact be removed from all zonelists in case the zonelists
      happen to be rebuilt.
      
      Use populated_zone() when building a zonelist as it has been done before
      that commit.
      
      There was a report that QubesOS (based on Xen) is hitting this problem.
      Xen has switched to use the zone device functionality in kernel 5.9 and
      QubesOS wants to use memory hotplugging for guests in order to be able
      to start a guest with minimal memory and expand it as needed.  This was
      the report leading to the patch.
      
      Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
      
      
      Fixes: 6aa303de ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reported-by: default avatarMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e553f62f
  6. Apr 05, 2022
  7. Apr 01, 2022
  8. Mar 30, 2022
  9. Mar 25, 2022
  10. Mar 22, 2022
    • Michal Hocko's avatar
      mm: make free_area_init_node aware of memory less nodes · 7c30daac
      Michal Hocko authored
      free_area_init_node is also called from memory less node initialization
      path (free_area_init_memoryless_node).  It doesn't really make much sense
      to display the physical memory range for those nodes: Initmem setup node
      XX [mem 0x0000000000000000-0x0000000000000000]
      
      Instead be explicit that the node is memoryless: Initmem setup node XX as
      memoryless
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-6-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c30daac
    • Michal Hocko's avatar
      mm, memory_hotplug: reorganize new pgdat initialization · 70b5b46a
      Michal Hocko authored
      When a !node_online node is brought up it needs a hotplug specific
      initialization because the node could be either uninitialized yet or it
      could have been recycled after previous hotremove.  hotadd_init_pgdat is
      responsible for that.
      
      Internal pgdat state is initialized at two places currently
      	- hotadd_init_pgdat
      	- free_area_init_core_hotplug
      
      There is no real clear cut what should go where but this patch's chosen to
      move the whole internal state initialization into
      free_area_init_core_hotplug.  hotadd_init_pgdat is still responsible to
      pull all the parts together - most notably to initialize zonelists because
      those depend on the overall topology.
      
      This patch doesn't introduce any functional change.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-5-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70b5b46a
    • Michal Hocko's avatar
      mm: handle uninitialized numa nodes gracefully · 09f49dca
      Michal Hocko authored
      We have had several reports [1][2][3] that page allocator blows up when an
      allocation from a possible node is requested.  The underlying reason is
      that NODE_DATA for the specific node is not allocated.
      
      NUMA specific initialization is arch specific and it can vary a lot.  E.g.
      x86 tries to initialize all nodes that have some cpu affinity (see
      init_cpu_to_node) but this can be insufficient because the node might be
      cpuless for example.
      
      One way to address this problem would be to check for !node_online nodes
      when trying to get a zonelist and silently fall back to another node.
      That is unfortunately adding a branch into allocator hot path and it
      doesn't handle any other potential NODE_DATA users.
      
      This patch takes a different approach (following a lead of [3]) and it pre
      allocates pgdat for all possible nodes in an arch indipendent code -
      free_area_init.  All uninitialized nodes are treated as memoryless nodes.
      node_state of the node is not changed because that would lead to other
      side effects - e.g.  sysfs representation of such a node and from past
      discussions [4] it is known that some tools might have problems digesting
      that.
      
      Newly allocated pgdat only gets a minimal initialization and the rest of
      the work is expected to be done by the memory hotplug - hotadd_new_pgdat
      (renamed to hotadd_init_pgdat).
      
      generic_alloc_nodedata is changed to use the memblock allocator because
      neither page nor slab allocators are available at the stage when all
      pgdats are allocated.  Hotplug doesn't allocate pgdat anymore so we can
      use the early boot allocator.  The only arch specific implementation is
      ia64 and that is changed to use the early allocator as well.
      
      [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
      [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
      [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
      [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
      
      [akpm@linux-foundation.org: replace comment, per Mike]
      
      Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz
      
      
      Reported-by: default avatarAlexey Makhalov <amakhalov@vmware.com>
      Tested-by: default avatarAlexey Makhalov <amakhalov@vmware.com>
      Reported-by: default avatarNico Pache <npache@redhat.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Tested-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09f49dca
    • Huang Ying's avatar
      NUMA balancing: optimize page placement for memory tiering system · c574bbe9
      Huang Ying authored
      With the advent of various new memory types, some machines will have
      multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
      memory subsystem of these machines can be called memory tiering system,
      because the performance of the different types of memory are usually
      different.
      
      In such system, because of the memory accessing pattern changing etc,
      some pages in the slow memory may become hot globally.  So in this
      patch, the NUMA balancing mechanism is enhanced to optimize the page
      placement among the different memory types according to hot/cold
      dynamically.
      
      In a typical memory tiering system, there are CPUs, fast memory and slow
      memory in each physical NUMA node.  The CPUs and the fast memory will be
      put in one logical node (called fast memory node), while the slow memory
      will be put in another (faked) logical node (called slow memory node).
      That is, the fast memory is regarded as local while the slow memory is
      regarded as remote.  So it's possible for the recently accessed pages in
      the slow memory node to be promoted to the fast memory node via the
      existing NUMA balancing mechanism.
      
      The original NUMA balancing mechanism will stop to migrate pages if the
      free memory of the target node becomes below the high watermark.  This
      is a reasonable policy if there's only one memory type.  But this makes
      the original NUMA balancing mechanism almost do not work to optimize
      page placement among different memory types.  Details are as follows.
      
      It's the common cases that the working-set size of the workload is
      larger than the size of the fast memory nodes.  Otherwise, it's
      unnecessary to use the slow memory at all.  So, there are almost always
      no enough free pages in the fast memory nodes, so that the globally hot
      pages in the slow memory node cannot be promoted to the fast memory
      node.  To solve the issue, we have 2 choices as follows,
      
      a. Ignore the free pages watermark checking when promoting hot pages
         from the slow memory node to the fast memory node.  This will
         create some memory pressure in the fast memory node, thus trigger
         the memory reclaiming.  So that, the cold pages in the fast memory
         node will be demoted to the slow memory node.
      
      b. Define a new watermark called wmark_promo which is higher than
         wmark_high, and have kswapd reclaiming pages until free pages reach
         such watermark.  The scenario is as follows: when we want to promote
         hot-pages from a slow memory to a fast memory, but fast memory's free
         pages would go lower than high watermark with such promotion, we wake
         up kswapd with wmark_promo watermark in order to demote cold pages and
         free us up some space.  So, next time we want to promote hot-pages we
         might have a chance of doing so.
      
      The choice "a" may create high memory pressure in the fast memory node.
      If the memory pressure of the workload is high, the memory pressure
      may become so high that the memory allocation latency of the workload
      is influenced, e.g.  the direct reclaiming may be triggered.
      
      The choice "b" works much better at this aspect.  If the memory
      pressure of the workload is high, the hot pages promotion will stop
      earlier because its allocation watermark is higher than that of the
      normal memory allocation.  So in this patch, choice "b" is implemented.
      A new zone watermark (WMARK_PROMO) is added.  Which is larger than the
      high watermark and can be controlled via watermark_scale_factor.
      
      In addition to the original page placement optimization among sockets,
      the NUMA balancing mechanism is extended to be used to optimize page
      placement according to hot/cold among different memory types.  So the
      sysctl user space interface (numa_balancing) is extended in a backward
      compatible way as follow, so that the users can enable/disable these
      functionality individually.
      
      The sysctl is converted from a Boolean value to a bits field.  The
      definition of the flags is,
      
      - 0: NUMA_BALANCING_DISABLED
      - 1: NUMA_BALANCING_NORMAL
      - 2: NUMA_BALANCING_MEMORY_TIERING
      
      We have tested the patch with the pmbench memory accessing benchmark
      with the 80:20 read/write ratio and the Gauss access address
      distribution on a 2 socket Intel server with Optane DC Persistent
      Memory Model.  The test results shows that the pmbench score can
      improve up to 95.9%.
      
      Thanks Andrew Morton to help fix the document format error.
      
      Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Tested-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c574bbe9
    • Miaohe Lin's avatar
      mm/hwpoison-inject: support injecting hwpoison to free page · a581865e
      Miaohe Lin authored
      memory_failure() can handle free buddy page.  Support injecting hwpoison
      to free page by adding is_free_buddy_page check when hwpoison filter is
      disabled.
      
      [akpm@linux-foundation.org: export is_free_buddy_page() to modules]
      
      Link: https://lkml.kernel.org/r/20220218092052.3853-1-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a581865e
    • Mel Gorman's avatar
      mm/page_alloc: check high-order pages for corruption during PCP operations · 77fe7f13
      Mel Gorman authored
      Eric Dumazet pointed out that commit 44042b44 ("mm/page_alloc: allow
      high-order pages to be stored on the per-cpu lists") only checks the
      head page during PCP refill and allocation operations.  This was an
      oversight and all pages should be checked.  This will incur a small
      performance penalty but it's necessary for correctness.
      
      Link: https://lkml.kernel.org/r/20220310092456.GJ15701@techsingularity.net
      
      
      Fixes: 44042b44 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77fe7f13
    • Eric Dumazet's avatar
      mm/page_alloc: call check_new_pages() while zone spinlock is not held · 3313204c
      Eric Dumazet authored
      For high order pages not using pcp, rmqueue() is currently calling the
      costly check_new_pages() while zone spinlock is held, and hard irqs
      masked.
      
      This is not needed, we can release the spinlock sooner to reduce zone
      spinlock contention.
      
      Note that after this patch, we call __mod_zone_freepage_state() before
      deciding to leak the page because it is in bad state.
      
      Link: https://lkml.kernel.org/r/20220304170215.1868106-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3313204c
    • Suren Baghdasaryan's avatar
      mm: count time in drain_all_pages during direct reclaim as memory pressure · fa7fc75f
      Suren Baghdasaryan authored
      When page allocation in direct reclaim path fails, the system will make
      one attempt to shrink per-cpu page lists and free pages from high alloc
      reserves.  Draining per-cpu pages into buddy allocator can be a very
      slow operation because it's done using workqueues and the task in direct
      reclaim waits for all of them to finish before proceeding.  Currently
      this time is not accounted as psi memory stall.
      
      While testing mobile devices under extreme memory pressure, when
      allocations are failing during direct reclaim, we notices that psi
      events which would be expected in such conditions were not triggered.
      After profiling these cases it was determined that the reason for
      missing psi events was that a big chunk of time spent in direct reclaim
      is not accounted as memory stall, therefore psi would not reach the
      levels at which an event is generated.  Further investigation revealed
      that the bulk of that unaccounted time was spent inside drain_all_pages
      call.
      
      A typical captured case when drain_all_pages path gets activated:
      
      __alloc_pages_slowpath  took 44.644.613ns
          __perform_reclaim   took    751.668ns (1.7%)
          drain_all_pages     took 43.887.167ns (98.3%)
      
      PSI in this case records the time spent in __perform_reclaim but ignores
      drain_all_pages, IOW it misses 98.3% of the time spent in
      __alloc_pages_slowpath.
      
      Annotate __alloc_pages_direct_reclaim in its entirety so that delays
      from handling page allocation failure in the direct reclaim path are
      accounted as memory stall.
      
      Link: https://lkml.kernel.org/r/20220223194812.1299646-1-surenb@google.com
      
      
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: default avatarTim Murray <timmurray@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa7fc75f
    • Oscar Salvador's avatar
      arch/x86/mm/numa: Do not initialize nodes twice · 1ca75fa7
      Oscar Salvador authored
      On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA
      nodes could be allocated at three different places.
      
       - numa_register_memblks
       - init_cpu_to_node
       - init_gi_nodes
      
      All these calls happen at setup_arch, and have the following order:
      
      setup_arch
        ...
        x86_numa_init
         numa_init
          numa_register_memblks
        ...
        init_cpu_to_node
         init_memory_less_node
          alloc_node_data
          free_area_init_memoryless_node
        init_gi_nodes
         init_memory_less_node
          alloc_node_data
          free_area_init_memoryless_node
      
      numa_register_memblks() is only interested in those nodes which have
      memory, so it skips over any memoryless node it founds.  Later on, when
      we have read ACPI's SRAT table, we call init_cpu_to_node() and
      init_gi_nodes(), which initialize any memoryless node we might have that
      have either CPU or Initiator affinity, meaning we allocate pg_data_t
      struct for them and we mark them as ONLINE.
      
      So far so good, but the thing is that after ("mm: handle uninitialized
      numa nodes gracefully"), we allocate all possible NUMA nodes in
      free_area_init(), meaning we have a picture like the following:
      
      setup_arch
        x86_numa_init
         numa_init
          numa_register_memblks  <-- allocate non-memoryless node
        x86_init.paging.pagetable_init
         ...
          free_area_init
           free_area_init_memoryless <-- allocate memoryless node
        init_cpu_to_node
         alloc_node_data             <-- allocate memoryless node with CPU
         free_area_init_memoryless_node
        init_gi_nodes
         alloc_node_data             <-- allocate memoryless node with Initiator
         free_area_init_memoryless_node
      
      free_area_init() already allocates all possible NUMA nodes, but
      init_cpu_to_node() and init_gi_nodes() are clueless about that, so they
      go ahead and allocate a new pg_data_t struct without checking anything,
      meaning we end up allocating twice.
      
      It should be mad clear that this only happens in the case where
      memoryless NUMA node happens to have a CPU/Initiator affinity.
      
      So get rid of init_memory_less_node() and just set the node online.
      
      Note that setting the node online is needed, otherwise we choke down the
      chain when bringup_nonboot_cpus() ends up calling
      __try_online_node()->register_one_node()->...  and we blow up in
      bus_add_device().  As can be seen here:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000060
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
        RIP: 0010:bus_add_device+0x5a/0x140
        Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
        RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
        RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
        RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
        R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
        R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
        Call Trace:
         device_add+0x4c0/0x910
         __register_one_node+0x97/0x2d0
         __try_online_node+0x85/0xc0
         try_online_node+0x25/0x40
         cpu_up+0x4f/0x100
         bringup_nonboot_cpus+0x4f/0x60
         smp_init+0x26/0x79
         kernel_init_freeable+0x130/0x2f1
         kernel_init+0x17/0x150
         ret_from_fork+0x22/0x30
      
      The reason is simple, by the time bringup_nonboot_cpus() gets called, we
      did not register the node_subsys bus yet, so we crash when
      bus_add_device() tries to dereference bus()->p.
      
      The following shows the order of the calls:
      
      kernel_init_freeable
       smp_init
        bringup_nonboot_cpus
         ...
           bus_add_device()      <- we did not register node_subsys yet
       do_basic_setup
        do_initcalls
         postcore_initcall(register_node_type);
          register_node_type
           subsys_system_register
            subsys_register
             bus_register         <- register node_subsys bus
      
      Why setting the node online saves us then? Well, simply because
      __try_online_node() backs off when the node is online, meaning we do not
      end up calling register_one_node() in the first place.
      
      This is subtle, broken and deserves a deep analysis and thought about
      how to put this into shape, but for now let us have this easy fix for
      the leaking memory issue.
      
      [osalvador@suse.de: add comments]
        Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de
      
      Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de
      
      
      Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Rafael Aquini <raquini@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ca75fa7
    • Mel Gorman's avatar
      mm/page_alloc: do not prefetch buddies during bulk free · 2a791f44
      Mel Gorman authored
      free_pcppages_bulk() has taken two passes through the pcp lists since
      commit 0a5f4e5b ("mm/free_pcppages_bulk: do not hold lock when
      picking pages to free") due to deferring the cost of selecting PCP lists
      until the zone lock is held.
      
      As the list processing now takes place under the zone lock, it's less
      clear that this will always benefit for two reasons.
      
      1. There is a guaranteed cost to calculating the buddy which definitely
         has to be calculated again. However, as the zone lock is held and
         there is no deferring of buddy merging, there is no guarantee that the
         prefetch will have completed when the second buddy calculation takes
         place and buddies are being merged.  With or without the prefetch, there
         may be further stalls depending on how many pages get merged. In other
         words, a stall due to merging is inevitable and at best only one stall
         might be avoided at the cost of calculating the buddy location twice.
      
      2. As the zone lock is held, prefetch_nr makes less sense as once
         prefetch_nr expires, the cache lines of interest have already been
         merged.
      
      The main concern is that there is a definite cost to calculating the
      buddy location early for the prefetch and it is a "maybe win" depending
      on whether the CPU prefetch logic and memory is fast enough.  Remove the
      prefetch logic on the basis that reduced instructions in a path is
      always a saving where as the prefetch might save one memory stall
      depending on the CPU and memory.
      
      In most cases, this has marginal benefit as the calculations are a small
      part of the overall freeing of pages.  However, it was detectable on at
      least one machine.
      
                                    5.17.0-rc3             5.17.0-rc3
                          mm-highpcplimit-v2r1     mm-noprefetch-v1r1
      Min       elapsed      630.00 (   0.00%)      610.00 (   3.17%)
      Amean     elapsed      639.00 (   0.00%)      623.00 *   2.50%*
      Max       elapsed      660.00 (   0.00%)      660.00 (   0.00%)
      
      Link: https://lkml.kernel.org/r/20220221094119.15282-2-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Suggested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a791f44
    • Mel Gorman's avatar
      mm/page_alloc: limit number of high-order pages on PCP during bulk free · f26b3fa0
      Mel Gorman authored
      When a PCP is mostly used for frees then high-order pages can exist on
      PCP lists for some time.  This is problematic when the allocation
      pattern is all allocations from one CPU and all frees from another
      resulting in colder pages being used.  When bulk freeing pages, limit
      the number of high-order pages that are stored on the PCP lists.
      
      Netperf running on localhost exhibits this pattern and while it does not
      matter for some machines, it does matter for others with smaller caches
      where cache misses cause problems due to reduced page reuse.  Pages
      freed directly to the buddy list may be reused quickly while still cache
      hot where as storing on the PCP lists may be cold by the time
      free_pcppages_bulk() is called.
      
      Using perf kmem:mm_page_alloc, the 5 most used page frames were
      
      5.17-rc3
        13041 pfn=0x111a30
        13081 pfn=0x5814d0
        13097 pfn=0x108258
        13121 pfn=0x689598
        13128 pfn=0x5814d8
      
      5.17-revert-highpcp
       192009 pfn=0x54c140
       195426 pfn=0x1081d0
       200908 pfn=0x61c808
       243515 pfn=0xa9dc20
       402523 pfn=0x222bb8
      
      5.17-full-series
       142693 pfn=0x346208
       162227 pfn=0x13bf08
       166413 pfn=0x2711e0
       166950 pfn=0x2702f8
      
      The spread is wider as there is still time before pages freed to one PCP
      get released with a tradeoff between fast reuse and reduced zone lock
      acquisition.
      
      On the machine used to gather the traces, the headline performance was
      equivalent.
      
      netperf-tcp
                                  5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
                                     vanilla  mm-reverthighpcp-v1r1     mm-highpcplimit-v2
      Hmean     64         839.93 (   0.00%)      840.77 (   0.10%)      841.02 (   0.13%)
      Hmean     128       1614.22 (   0.00%)     1622.07 *   0.49%*     1636.41 *   1.37%*
      Hmean     256       2952.00 (   0.00%)     2953.19 (   0.04%)     2977.76 *   0.87%*
      Hmean     1024     10291.67 (   0.00%)    10239.17 (  -0.51%)    10434.41 *   1.39%*
      Hmean     2048     17335.08 (   0.00%)    17399.97 (   0.37%)    17134.81 *  -1.16%*
      Hmean     3312     22628.15 (   0.00%)    22471.97 (  -0.69%)    22422.78 (  -0.91%)
      Hmean     4096     25009.50 (   0.00%)    24752.83 *  -1.03%*    24740.41 (  -1.08%)
      Hmean     8192     32745.01 (   0.00%)    31682.63 *  -3.24%*    32153.50 *  -1.81%*
      Hmean     16384    39759.59 (   0.00%)    36805.78 *  -7.43%*    38948.13 *  -2.04%*
      
      On a 1-socket skylake machine with a small CPU cache that suffers more if
      cache misses are too high
      
      netperf-tcp
                                  5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
                                     vanilla    mm-reverthighpcp-v1     mm-highpcplimit-v2
      Hmean     64         938.95 (   0.00%)      941.50 *   0.27%*      943.61 *   0.50%*
      Hmean     128       1843.10 (   0.00%)     1857.58 *   0.79%*     1861.09 *   0.98%*
      Hmean     256       3573.07 (   0.00%)     3667.45 *   2.64%*     3674.91 *   2.85%*
      Hmean     1024     13206.52 (   0.00%)    13487.80 *   2.13%*    13393.21 *   1.41%*
      Hmean     2048     22870.23 (   0.00%)    23337.96 *   2.05%*    23188.41 *   1.39%*
      Hmean     3312     31001.99 (   0.00%)    32206.50 *   3.89%*    31863.62 *   2.78%*
      Hmean     4096     35364.59 (   0.00%)    36490.96 *   3.19%*    36112.54 *   2.11%*
      Hmean     8192     48497.71 (   0.00%)    49954.05 *   3.00%*    49588.26 *   2.25%*
      Hmean     16384    58410.86 (   0.00%)    60839.80 *   4.16%*    62282.96 *   6.63%*
      
      Note that this was a machine that did not benefit from caching high-order
      pages and performance is almost restored with the series applied.  It's
      not fully restored as cache misses are still higher.  This is a trade-off
      between optimising for a workload that does all allocs on one CPU and
      frees on another or more general workloads that need high-order pages for
      SLUB and benefit from avoiding zone->lock for every SLUB refill/drain.
      
      Link: https://lkml.kernel.org/r/20220217002227.5739-7-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f26b3fa0
Loading