Skip to content
Snippets Groups Projects
  1. Nov 06, 2021
  2. Oct 18, 2021
  3. Sep 03, 2021
  4. Aug 20, 2021
    • Mike Kravetz's avatar
      hugetlb: don't pass page cache pages to restore_reserve_on_error · c7b1850d
      Mike Kravetz authored
      syzbot hit kernel BUG at fs/hugetlbfs/inode.c:532 as described in [1].
      This BUG triggers if the HPageRestoreReserve flag is set on a page in
      the page cache.  It should never be set, as the routine
      huge_add_to_page_cache explicitly clears the flag after adding a page to
      the cache.
      
      The only code other than huge page allocation which sets the flag is
      restore_reserve_on_error.  It will potentially set the flag in rare out
      of memory conditions.  syzbot was injecting errors to cause memory
      allocation errors which exercised this specific path.
      
      The code in restore_reserve_on_error is doing the right thing.  However,
      there are instances where pages in the page cache were being passed to
      restore_reserve_on_error.  This is incorrect, as once a page goes into
      the cache reservation information will not be modified for the page
      until it is removed from the cache.  Error paths do not remove pages
      from the cache, so even in the case of error, the page will remain in
      the cache and no reservation adjustment is needed.
      
      Modify routines that potentially call restore_reserve_on_error with a
      page cache page to no longer do so.
      
      Note on fixes tag: Prior to commit 846be085 ("mm/hugetlb: expand
      restore_reserve_on_error functionality") the routine would not process
      page cache pages because the HPageRestoreReserve flag is not set on such
      pages.  Therefore, this issue could not be trigggered.  The code added
      by commit 846be085 ("mm/hugetlb: expand restore_reserve_on_error
      functionality") is needed and correct.  It exposed incorrect calls to
      restore_reserve_on_error which is the root cause addressed by this
      commit.
      
      [1] https://lore.kernel.org/linux-mm/00000000000050776d05c9b7c7f0@google.com/
      
      Link: https://lkml.kernel.org/r/20210818213304.37038-1-mike.kravetz@oracle.com
      
      
      Fixes: 846be085 ("mm/hugetlb: expand restore_reserve_on_error functionality")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatar <syzbot+67654e51e54455f1c585@syzkaller.appspotmail.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7b1850d
  5. Jul 15, 2021
  6. Jul 01, 2021
    • Alistair Popple's avatar
      mm/swapops: rework swap entry manipulation code · 4dd845b5
      Alistair Popple authored
      Both migration and device private pages use special swap entries that are
      manipluated by a range of inline functions.  The arguments to these are
      somewhat inconsistent so rework them to remove flag type arguments and to
      make the arguments similar for both read and write entry creation.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com
      
      
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4dd845b5
    • Mike Kravetz's avatar
      hugetlb: address ref count racing in prep_compound_gigantic_page · 7118fc29
      Mike Kravetz authored
      In [1], Jann Horn points out a possible race between
      prep_compound_gigantic_page and __page_cache_add_speculative.  The root
      cause of the possible race is prep_compound_gigantic_page uncondittionally
      setting the ref count of pages to zero.  It does this because
      prep_compound_gigantic_page is handed a 'group' of pages from an allocator
      and needs to convert that group of pages to a compound page.  The ref
      count of each page in this 'group' is one as set by the allocator.
      However, the ref count of compound page tail pages must be zero.
      
      The potential race comes about when ref counted pages are returned from
      the allocator.  When this happens, other mm code could also take a
      reference on the page.  __page_cache_add_speculative is one such example.
      Therefore, prep_compound_gigantic_page can not just set the ref count of
      pages to zero as it does today.  Doing so would lose the reference taken
      by any other code.  This would lead to BUGs in code checking ref counts
      and could possibly even lead to memory corruption.
      
      There are two possible ways to address this issue.
      
      1) Make all allocators of gigantic groups of pages be able to return a
         properly constructed compound page.
      
      2) Make prep_compound_gigantic_page be more careful when constructing a
         compound page.
      
      This patch takes approach 2.
      
      In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
      if it is one.  If the cmpxchg fails, call synchronize_rcu() in the hope
      that the extra ref count will be driopped during a rcu grace period.  This
      is not a performance critical code path and the wait should be
      accceptable.  If the ref count is still inflated after the grace period,
      then undo any modifications made and return an error.
      
      Currently prep_compound_gigantic_page is type void and does not return
      errors.  Modify the two callers to check for and handle error returns.  On
      error, the caller must free the 'group' of pages as they can not be used
      to form a gigantic page.  After freeing pages, the runtime caller
      (alloc_fresh_huge_page) will retry the allocation once.  Boot time
      allocations can not be retried.
      
      The routine prep_compound_page also unconditionally sets the ref count of
      compound page tail pages to zero.  However, in this case the buddy
      allocator is constructing a compound page from freshly allocated pages.
      The ref count on those freshly allocated pages is already zero, so the
      set_page_count(p, 0) is unnecessary and could lead to confusion.  Just
      remove it.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
      
      
      Fixes: 58a84aa9 ("thp: set compound tail page _count to zero")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarJann Horn <jannh@google.com>
      Cc: Youquan Song <youquan.song@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7118fc29
    • Mike Kravetz's avatar
      hugetlb: remove prep_compound_huge_page cleanup · 48b8d744
      Mike Kravetz authored
      Patch series "Fix prep_compound_gigantic_page ref count adjustment".
      
      These patches address the possible race between
      prep_compound_gigantic_page and __page_cache_add_speculative as described
      by Jann Horn in [1].
      
      The first patch simply removes the unnecessary/obsolete helper routine
      prep_compound_huge_page to make the actual fix a little simpler.
      
      The second patch is the actual fix and has a detailed explanation in the
      commit message.
      
      This potential issue has existed for almost 10 years and I am unaware of
      anyone actually hitting the race.  I did not cc stable, but would be happy
      to squash the patches and send to stable if anyone thinks that is a good
      idea.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
      
      This patch (of 2):
      
      I could not think of a reliable way to recreate the issue for testing.
      Rather, I 'simulated errors' to exercise all the error paths.
      
      The routine prep_compound_huge_page is a simple wrapper to call either
      prep_compound_gigantic_page or prep_compound_page.  However, it is only
      called from gather_bootmem_prealloc which only processes gigantic pages.
      Eliminate the routine and call prep_compound_gigantic_page directly.
      
      Link: https://lkml.kernel.org/r/20210622021423.154662-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20210622021423.154662-2-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Youquan Song <youquan.song@intel.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48b8d744
    • Mina Almasry's avatar
      mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY · 8cc5fcbb
      Mina Almasry authored
      On UFFDIO_COPY, if we fail to copy the page contents while holding the
      hugetlb_fault_mutex, we will drop the mutex and return to the caller after
      allocating a page that consumed a reservation.  In this case there may be
      a fault that double consumes the reservation.  To handle this, we free the
      allocated page, fix the reservations, and allocate a temporary hugetlb
      page and return that to the caller.  When the caller does the copy outside
      of the lock, we again check the cache, and allocate a page consuming the
      reservation, and copy over the contents.
      
      Test:
      Hacked the code locally such that resv_huge_pages underflows produce
      a warning and the copy_huge_page_from_user() always fails, then:
      
      ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
              2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      ./tools/testing/selftests/vm/userfaultfd hugetlb 10
      	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      
      Both tests succeed and produce no warnings. After the
      test runs number of free/resv hugepages is correct.
      
      [yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
        Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
      [almasrymina@google.com: fix allocation error check and copy func name]
        Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com
      
      Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.com
      
      
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cc5fcbb
    • Christophe Leroy's avatar
      mm/hugetlb: change parameters of arch_make_huge_pte() · 79c1c594
      Christophe Leroy authored
      Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2.
      
      This series implements huge VMAP and VMALLOC on powerpc 8xx.
      
      Powerpc 8xx has 4 page sizes:
      - 4k
      - 16k
      - 512k
      - 8M
      
      At the time being, vmalloc and vmap only support huge pages which are
      leaf at PMD level.
      
      Here the PMD level is 4M, it doesn't correspond to any supported
      page size.
      
      For now, implement use of 16k and 512k pages which is done
      at PTE level.
      
      Support of 8M pages will be implemented later, it requires use of
      hugepd tables.
      
      To allow this, the architecture provides two functions:
      - arch_vmap_pte_range_map_size() which tells vmap_pte_range() what
      page size to use. A stub returning PAGE_SIZE is provided when the
      architecture doesn't provide this function.
      - arch_vmap_pte_supported_shift() which tells __vmalloc_node_range()
      what page shift to use for a given area size. A stub returning
      PAGE_SHIFT is provided when the architecture doesn't provide this
      function.
      
      This patch (of 5):
      
      At the time being, arch_make_huge_pte() has the following prototype:
      
        pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
      			   struct page *page, int writable);
      
      vma is used to get the pages shift or size.
      vma is also used on Sparc to get vm_flags.
      page is not used.
      writable is not used.
      
      In order to use this function without a vma, replace vma by shift and
      flags.  Also remove the used parameters.
      
      Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu
      Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu
      
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79c1c594
    • Muchun Song's avatar
      mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate · 77490587
      Muchun Song authored
      All the infrastructure is ready, so we introduce nr_free_vmemmap_pages
      field in the hstate to indicate how many vmemmap pages associated with a
      HugeTLB page that can be freed to buddy allocator.  And initialize it in
      the hugetlb_vmemmap_init().  This patch is actual enablement of the
      feature.
      
      There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct page
      structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a
      BUILD_BUG_ON to catch invalid usage of the tail struct page.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-10-songmuchun@bytedance.com
      
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77490587
    • Muchun Song's avatar
      mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page · ad2fa371
      Muchun Song authored
      When we free a HugeTLB page to the buddy allocator, we need to allocate
      the vmemmap pages associated with it.  However, we may not be able to
      allocate the vmemmap pages when the system is under memory pressure.  In
      this case, we just refuse to free the HugeTLB page.  This changes behavior
      in some corner cases as listed below:
      
       1) Failing to free a huge page triggered by the user (decrease nr_pages).
      
          User needs to try again later.
      
       2) Failing to free a surplus huge page when freed by the application.
      
          Try again later when freeing a huge page next time.
      
       3) Failing to dissolve a free huge page on ZONE_MOVABLE via
          offline_pages().
      
          This can happen when we have plenty of ZONE_MOVABLE memory, but
          not enough kernel memory to allocate vmemmmap pages.  We may even
          be able to migrate huge page contents, but will not be able to
          dissolve the source huge page.  This will prevent an offline
          operation and is unfortunate as memory offlining is expected to
          succeed on movable zones.  Users that depend on memory hotplug
          to succeed for movable zones should carefully consider whether the
          memory savings gained from this feature are worth the risk of
          possibly not being able to offline memory in certain situations.
      
       4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
          alloc_contig_range() - once we have that handling in place. Mainly
          affects CMA and virtio-mem.
      
          Similar to 3). virito-mem will handle migration errors gracefully.
          CMA might be able to fallback on other free areas within the CMA
          region.
      
      Vmemmap pages are allocated from the page freeing context.  In order for
      those allocations to be not disruptive (e.g.  trigger oom killer)
      __GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
      a non sleeping allocation would be too fragile and it could fail too
      easily under memory pressure.  GFP_ATOMIC or other modes to access memory
      reserves is not used because we want to prevent consuming reserves under
      heavy hugetlb freeing.
      
      [mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
        Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
      [willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
      
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad2fa371
    • Muchun Song's avatar
      mm: hugetlb: defer freeing of HugeTLB pages · b65d4adb
      Muchun Song authored
      In the subsequent patch, we should allocate the vmemmap pages when freeing
      a HugeTLB page.  But update_and_free_page() can be called under any
      context, so we cannot use GFP_KERNEL to allocate vmemmap pages.  However,
      we can defer the actual freeing in a kworker to prevent from using
      GFP_ATOMIC to allocate the vmemmap pages.
      
      The __update_and_free_page() is where the call to allocate vmemmmap pages
      will be inserted.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-6-songmuchun@bytedance.com
      
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b65d4adb
    • Muchun Song's avatar
      mm: hugetlb: free the vmemmap pages associated with each HugeTLB page · f41f2ed4
      Muchun Song authored
      Every HugeTLB has more than one struct page structure.  We __know__ that
      we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
      store metadata associated with each HugeTLB.
      
      There are a lot of struct page structures associated with each HugeTLB
      page.  For tail pages, the value of compound_head is the same.  So we can
      reuse first page of tail page structures.  We map the virtual addresses of
      the remaining pages of tail page structures to the first tail page struct,
      and then free these page frames.  Therefore, we need to reserve two pages
      as vmemmap areas.
      
      When we allocate a HugeTLB page from the buddy, we can free some vmemmap
      pages associated with each HugeTLB page.  It is more appropriate to do it
      in the prep_new_huge_page().
      
      The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages
      associated with a HugeTLB page can be freed, returns zero for now, which
      means the feature is disabled.  We will enable it once all the
      infrastructure is there.
      
      [willy@infradead.org: fix documentation warning]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com
      
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: default avatarChen Huang <chenhuang5@huawei.com>
      Tested-by: default avatarBodeddula Balasubramaniam <bodeddub@amazon.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f41f2ed4
  7. Jun 29, 2021
  8. Jun 25, 2021
    • Hugh Dickins's avatar
      mm, futex: fix shared futex pgoff on shmem huge page · fe19bd3d
      Hugh Dickins authored
      If more than one futex is placed on a shmem huge page, it can happen
      that waking the second wakes the first instead, and leaves the second
      waiting: the key's shared.pgoff is wrong.
      
      When 3.11 commit 13d60f4b ("futex: Take hugepages into account when
      generating futex_key"), the only shared huge pages came from hugetlbfs,
      and the code added to deal with its exceptional page->index was put into
      hugetlb source.  Then that was missed when 4.8 added shmem huge pages.
      
      page_to_pgoff() is what others use for this nowadays: except that, as
      currently written, it gives the right answer on hugetlbfs head, but
      nonsense on hugetlbfs tails.  Fix that by calling hugetlbfs-specific
      hugetlb_basepage_index() on PageHuge tails as well as on head.
      
      Yes, it's unconventional to declare hugetlb_basepage_index() there in
      pagemap.h, rather than in hugetlb.h; but I do not expect anything but
      page_to_pgoff() ever to need it.
      
      [akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
      
      Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
      
      
      Fixes: 800d8c63 ("shmem: add huge pages support")
      Reported-by: default avatarNeel Natu <neelnatu@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Zhang Yi <wetpzy@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe19bd3d
  9. Jun 16, 2021
    • Mike Kravetz's avatar
      mm/hugetlb: expand restore_reserve_on_error functionality · 846be085
      Mike Kravetz authored
      The routine restore_reserve_on_error is called to restore reservation
      information when an error occurs after page allocation.  The routine
      alloc_huge_page modifies the mapping reserve map and potentially the
      reserve count during allocation.  If code calling alloc_huge_page
      encounters an error after allocation and needs to free the page, the
      reservation information needs to be adjusted.
      
      Currently, restore_reserve_on_error only takes action on pages for which
      the reserve count was adjusted(HPageRestoreReserve flag).  There is
      nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
      modifies the reserve map during allocation even if the reserve count is
      not adjusted.  This can cause issues as observed during development of
      this patch [1].
      
      One specific series of operations causing an issue is:
      
       - Create a shared hugetlb mapping
         Reservations for all pages created by default
      
       - Fault in a page in the mapping
         Reservation exists so reservation count is decremented
      
       - Punch a hole in the file/mapping at index previously faulted
         Reservation and any associated pages will be removed
      
       - Allocate a page to fill the hole
         No reservation entry, so reserve count unmodified
         Reservation entry added to map by alloc_huge_page
      
       - Error after allocation and before instantiating the page
         Reservation entry remains in map
      
       - Allocate a page to fill the hole
         Reservation entry exists, so decrement reservation count
      
      This will cause a reservation count underflow as the reservation count
      was decremented twice for the same index.
      
      A user would observe a very large number for HugePages_Rsvd in
      /proc/meminfo.  This would also likely cause subsequent allocations of
      hugetlb pages to fail as it would 'appear' that all pages are reserved.
      
      This sequence of operations is unlikely to happen, however they were
      easily reproduced and observed using hacked up code as described in [1].
      
      Address the issue by having the routine restore_reserve_on_error take
      action on pages where HPageRestoreReserve is not set.  In this case, we
      need to remove any reserve map entry created by alloc_huge_page.  A new
      helper routine vma_del_reservation assists with this operation.
      
      There are three callers of alloc_huge_page which do not currently call
      restore_reserve_on error before freeing a page on error paths.  Add
      those missing calls.
      
      [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/
      
      Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
      
      
      Fixes: 96b96a96 ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMina Almasry <almasrymina@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      846be085
    • Naoya Horiguchi's avatar
      mm,hwpoison: fix race with hugetlb page allocation · 25182f05
      Naoya Horiguchi authored
      When hugetlb page fault (under overcommitting situation) and
      memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
      race:
      
          CPU0:                           CPU1:
      
                                          gather_surplus_pages()
                                            page = alloc_surplus_huge_page()
          memory_failure_hugetlb()
            get_hwpoison_page(page)
              __get_hwpoison_page(page)
                get_page_unless_zero(page)
                                            zero = put_page_testzero(page)
                                            VM_BUG_ON_PAGE(!zero, page)
                                            enqueue_huge_page(h, page)
            put_page(page)
      
      __get_hwpoison_page() only checks the page refcount before taking an
      additional one for memory error handling, which is not enough because
      there's a time window where compound pages have non-zero refcount during
      hugetlb page initialization.
      
      So make __get_hwpoison_page() check page status a bit more for hugetlb
      pages with get_hwpoison_huge_page().  Checking hugetlb-specific flags
      under hugetlb_lock makes sure that the hugetlb page is not transitive.
      It's notable that another new function, HWPoisonHandlable(), is helpful
      to prevent a race against other transitive page states (like a generic
      compound page just before PageHuge becomes true).
      
      Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
      
      
      Fixes: ead07f6a ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25182f05
  10. Jun 05, 2021
  11. May 15, 2021
  12. May 07, 2021
  13. May 05, 2021
    • Pavel Tatashin's avatar
      mm: honor PF_MEMALLOC_PIN for all movable pages · 8e3560d9
      Pavel Tatashin authored
      PF_MEMALLOC_PIN is only honored for CMA pages, extend this flag to work
      for any allocations from ZONE_MOVABLE by removing __GFP_MOVABLE from
      gfp_mask when this flag is passed in the current context.
      
      Add is_pinnable_page() to return true if page is in a pinnable page.  A
      pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-8-pasha.tatashin@soleen.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e3560d9
    • Pavel Tatashin's avatar
      mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN · 1a08ae36
      Pavel Tatashin authored
      PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not
      return pages that might belong to CMA region.  This is currently used
      for long term gup to make sure that such pins are not going to be done
      on any CMA pages.
      
      When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it
      is focusing on CMA pages too much and that there is larger class of
      pages that need the same treatment.  MOVABLE zone cannot contain any
      long term pins as well so it makes sense to reuse and redefine this flag
      for that usecase as well.  Rename the flag to PF_MEMALLOC_PIN which
      defines an allocation context which can only get pages suitable for
      long-term pins.
      
      Also rename: memalloc_nocma_save()/memalloc_nocma_restore to
      memalloc_pin_save()/memalloc_pin_restore() and make the new functions
      common.
      
      [rppt@linux.ibm.com: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN]
        Link: https://lkml.kernel.org/r/20210331163816.11517-1-rppt@kernel.org
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-6-pasha.tatashin@soleen.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a08ae36
    • Axel Rasmussen's avatar
      userfaultfd: add UFFDIO_CONTINUE ioctl · f6191471
      Axel Rasmussen authored
      This ioctl is how userspace ought to resolve "minor" userfaults.  The
      idea is, userspace is notified that a minor fault has occurred.  It
      might change the contents of the page using its second non-UFFD mapping,
      or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
      ensured the page contents are correct, carry on setting up the mapping".
      
      Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
      MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
      the minor fault case, we already have some pre-existing underlying page.
      Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
      We'd just use memcpy() or similar instead.
      
      It turns out hugetlb_mcopy_atomic_pte() already does very close to what
      we want, if an existing page is provided via `struct page **pagep`.  We
      already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
      just extend that design: add an enum for the three modes of operation,
      and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
      case.  (Basically, look up the existing page, and avoid adding the
      existing page to the page cache or calling set_page_huge_active() on
      it.)
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
      
      
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6191471
Loading