Skip to content
Snippets Groups Projects
  1. Oct 03, 2022
    • Mike Kravetz's avatar
      hugetlb: use new vma_lock for pmd sharing synchronization · 40549ba8
      Mike Kravetz authored
      The new hugetlb vma lock is used to address this race:
      
      Faulting thread                                 Unsharing thread
      ...                                                  ...
      ptep = huge_pte_offset()
            or
      ptep = huge_pte_alloc()
      ...
                                                      i_mmap_lock_write
                                                      lock page table
      ptep invalid   <------------------------        huge_pmd_unshare()
      Could be in a previously                        unlock_page_table
      sharing process or worse                        i_mmap_unlock_write
      ...
      
      The vma_lock is used as follows:
      - During fault processing. The lock is acquired in read mode before
        doing a page table lock and allocation (huge_pte_alloc).  The lock is
        held until code is finished with the page table entry (ptep).
      - The lock must be held in write mode whenever huge_pmd_unshare is
        called.
      
      Lock ordering issues come into play when unmapping a page from all
      vmas mapping the page.  The i_mmap_rwsem must be held to search for the
      vmas, and the vma lock must be held before calling unmap which will
      call huge_pmd_unshare.  This is done today in:
      - try_to_migrate_one and try_to_unmap_ for page migration and memory
        error handling.  In these routines we 'try' to obtain the vma lock and
        fail to unmap if unsuccessful.  Calling routines already deal with the
        failure of unmapping.
      - hugetlb_vmdelete_list for truncation and hole punch.  This routine
        also tries to acquire the vma lock.  If it fails, it skips the
        unmapping.  However, we can not have file truncation or hole punch
        fail because of contention.  After hugetlb_vmdelete_list, truncation
        and hole punch call remove_inode_hugepages.  remove_inode_hugepages
        checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
        hugetlb_unmap_file_page is designed to drop locks and reacquire in the
        correct order to guarantee unmap success.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      40549ba8
    • Mike Kravetz's avatar
      hugetlb: add vma based lock for pmd sharing · 8d9bfb26
      Mike Kravetz authored
      Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for
      synchronization use by vmas that could be involved in pmd sharing.  This
      data structure contains a rw semaphore that is the primary tool used for
      synchronization.
      
      This new structure is ref counted, so that it can exist when NOT attached
      to a vma.  This is only helpful in resolving lock ordering issues where
      code may need to obtain the vma_lock while there are no guarantees the vma
      may go away.  By obtaining a ref on the structure, it can be guaranteed
      that at least the rw semaphore will not go away.
      
      Only add infrastructure for the new lock here.  Actual use will be added
      in subsequent patches.
      
      [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release]
        Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey
      Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d9bfb26
    • Mike Kravetz's avatar
      hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization · 3a47c54f
      Mike Kravetz authored
      Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
      synchronization") added code to take i_mmap_rwsem in read mode for the
      duration of fault processing.  However, this has been shown to cause
      performance/scaling issues.  Revert the code and go back to only taking
      the semaphore in huge_pmd_share during the fault path.
      
      Keep the code that takes i_mmap_rwsem in write mode before calling
      try_to_unmap as this is required if huge_pmd_unshare is called.
      
      NOTE: Reverting this code does expose the following race condition.
      
      Faulting thread                                 Unsharing thread
      ...                                                  ...
      ptep = huge_pte_offset()
            or
      ptep = huge_pte_alloc()
      ...
                                                      i_mmap_lock_write
                                                      lock page table
      ptep invalid   <------------------------        huge_pmd_unshare()
      Could be in a previously                        unlock_page_table
      sharing process or worse                        i_mmap_unlock_write
      ...
      ptl = huge_pte_lock(ptep)
      get/update pte
      set_pte_at(pte, ptep)
      
      It is unknown if the above race was ever experienced by a user.  It was
      discovered via code inspection when initially addressed.
      
      In subsequent patches, a new synchronization mechanism will be added to
      coordinate pmd sharing and eliminate this race.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a47c54f
    • Matthew Wilcox (Oracle)'s avatar
      rmap: remove page_unlock_anon_vma_read() · 0c826c0b
      Matthew Wilcox (Oracle) authored
      This was simply an alias for anon_vma_unlock_read() since 2011.
      
      Link: https://lkml.kernel.org/r/20220902194653.1739778-56-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0c826c0b
    • Matthew Wilcox (Oracle)'s avatar
      mm: convert page_get_anon_vma() to folio_get_anon_vma() · 29eea9b5
      Matthew Wilcox (Oracle) authored
      With all callers now passing in a folio, rename the function and convert
      all callers.  Removes a couple of calls to compound_head() and a reference
      to page->mapping.
      
      Link: https://lkml.kernel.org/r/20220902194653.1739778-55-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      29eea9b5
    • Matthew Wilcox (Oracle)'s avatar
      rmap: convert page_move_anon_rmap() to use a folio · 595af4c9
      Matthew Wilcox (Oracle) authored
      Removes one call to compound_head() and a reference to page->mapping.
      
      Link: https://lkml.kernel.org/r/20220902194653.1739778-50-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      595af4c9
  2. Sep 27, 2022
    • Yu Zhao's avatar
      mm: multi-gen LRU: exploit locality in rmap · 018ee47f
      Yu Zhao authored
      Searching the rmap for PTEs mapping each page on an LRU list (to test and
      clear the accessed bit) can be expensive because pages from different VMAs
      (PA space) are not cache friendly to the rmap (VA space).  For workloads
      mostly using mapped pages, searching the rmap can incur the highest CPU
      cost in the reclaim path.
      
      This patch exploits spatial locality to reduce the trips into the rmap. 
      When shrink_page_list() walks the rmap and finds a young PTE, a new
      function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
      PTEs.  On finding another young PTE, it clears the accessed bit and
      updates the gen counter of the page mapped by this PTE to
      (max_seq%MAX_NR_GENS)+1.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[3, 5]%
                      Ops/sec      KB/sec
            patch1-6: 1106168.46   43025.04
            patch1-7: 1147696.57   44640.29
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
        Configurations:
          no change
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.com
      
      
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBarry Song <baohua@kernel.org>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      018ee47f
    • Peter Xu's avatar
      mm: remember young/dirty bit for page migrations · 2e346877
      Peter Xu authored
      When page migration happens, we always ignore the young/dirty bit settings
      in the old pgtable, and marking the page as old in the new page table
      using either pte_mkold() or pmd_mkold(), and keeping the pte clean.
      
      That's fine from functional-wise, but that's not friendly to page reclaim
      because the moving page can be actively accessed within the procedure. 
      Not to mention hardware setting the young bit can bring quite some
      overhead on some systems, e.g.  x86_64 needs a few hundreds nanoseconds to
      set the bit.  The same slowdown problem to dirty bits when the memory is
      first written after page migration happened.
      
      Actually we can easily remember the A/D bit configuration and recover the
      information after the page is migrated.  To achieve it, define a new set
      of bits in the migration swap offset field to cache the A/D bits for old
      pte.  Then when removing/recovering the migration entry, we can recover
      the A/D bits even if the page changed.
      
      One thing to mention is that here we used max_swapfile_size() to detect
      how many swp offset bits we have, and we'll only enable this feature if we
      know the swp offset is big enough to store both the PFN value and the A/D
      bits.  Otherwise the A/D bits are dropped like before.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e346877
  3. Sep 12, 2022
    • David Hildenbrand's avatar
      mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast · 088b8aa5
      David Hildenbrand authored
      commit 6c287605 ("mm: remember exclusively mapped anonymous pages with
      PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
      cleared during temporary unmapping of a page, that the PTE is
      cleared/invalidated and that the TLB is flushed.
      
      What we want to achieve in all cases is that we cannot end up with a pin on
      an anonymous page that may be shared, because such pins would be
      unreliable and could result in memory corruptions when the mapped page
      and the pin go out of sync due to a write fault.
      
      That TLB flush handling was inspired by an outdated comment in
      mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
      the past to synchronize with GUP-fast. However, ever since general RCU GUP
      fast was introduced in commit 2667f50e ("mm: introduce a general RCU
      get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
      concurrent GUP-fast in all cases -- it only handles traditional IPI-based
      GUP-fast correctly.
      
      Peter Xu (thankfully) questioned whether that TLB flush is really
      required. On architectures that send an IPI broadcast on TLB flush,
      it works as expected. To synchronize with RCU GUP-fast properly, we're
      conceptually fine, however, we have to enforce a certain memory order and
      are missing memory barriers.
      
      Let's document that, avoid the TLB flush where possible and use proper
      explicit memory barriers where required. We shouldn't really care about the
      additional memory barriers here, as we're not on extremely hot paths --
      and we're getting rid of some TLB flushes.
      
      We use a smp_mb() pair for handling concurrent pinning and a
      smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
      PTE changes but permanent PageAnonExclusive changes.
      
      One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
      convert an exclusive anonymous page to a KSM page, and that page is already
      mapped write-protected (-> no PTE change) would be:
      
      	Thread 0 (KSM)			Thread 1 (GUP-fast)
      
      					(B1) Read the PTE
      					# (B2) skipped without FOLL_WRITE
      	(A1) Clear PTE
      	smp_mb()
      	(A2) Check pinned
      					(B3) Pin the mapped page
      					smp_mb()
      	(A3) Clear PageAnonExclusive
      	smp_wmb()
      	(A4) Restore PTE
      					(B4) Check if the PTE changed
      					smp_rmb()
      					(B5) Check PageAnonExclusive
      
      Thread 1 will properly detect that PageAnonExclusive was cleared and
      back off.
      
      Note that we don't need a memory barrier between checking if the page is
      pinned and clearing PageAnonExclusive, because stores are not
      speculated.
      
      The possible issues due to reordering are of theoretical nature so far
      and attempts to reproduce the race failed.
      
      Especially the "no PTE change" case isn't the common case, because we'd
      need an exclusive anonymous page that's mapped R/O and the PTE is clean
      in KSM code -- and using KSM with page pinning isn't extremely common.
      Further, the clear+TLB flush we used for now implies a memory barrier.
      So the problematic missing part should be the missing memory barrier
      after pinning but before checking if the PTE changed.
      
      Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
      
      
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Christoph von Recklinghausen <crecklin@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      088b8aa5
    • Zach O'Keefe's avatar
      mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage · 50722804
      Zach O'Keefe authored
      When scanning an anon pmd to see if it's eligible for collapse, return
      SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
      SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
      file-collapse path, since the latter might identify pte-mapped compound
      pages.  This is required by MADV_COLLAPSE which necessarily needs to know
      what hugepage-aligned/sized regions are already pmd-mapped.
      
      In order to determine if a pmd already maps a hugepage, refactor
      mm_find_pmd():
      
      Return mm_find_pmd() to it's pre-commit f72e7dcd ("mm: let mm_find_pmd
      fix buggy race with THP fault") behavior.  ksm was the only caller that
      explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
      there (pmd_present() and pmd_trans_huge() checks).
      
      Undo revert change in commit f72e7dcd ("mm: let mm_find_pmd fix buggy
      race with THP fault") that open-coded split_huge_pmd_address() pmd lookup
      and use mm_find_pmd() instead.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-9-zokeefe@google.com
      
      
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50722804
  4. Aug 31, 2022
    • Jann Horn's avatar
      mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse · 2555283e
      Jann Horn authored
      
      anon_vma->degree tracks the combined number of child anon_vmas and VMAs
      that use the anon_vma as their ->anon_vma.
      
      anon_vma_clone() then assumes that for any anon_vma attached to
      src->anon_vma_chain other than src->anon_vma, it is impossible for it to
      be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
      elevated by 1 because of a child anon_vma, meaning that if ->degree
      equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.
      
      This assumption is wrong because the ->degree optimization leads to leaf
      nodes being abandoned on anon_vma_clone() - an existing anon_vma is
      reused and no new parent-child relationship is created.  So it is
      possible to reuse an anon_vma for one VMA while it is still tied to
      another VMA.
      
      This is an issue because is_mergeable_anon_vma() and its callers assume
      that if two VMAs have the same ->anon_vma, the list of anon_vmas
      attached to the VMAs is guaranteed to be the same.  When this assumption
      is violated, vma_merge() can merge pages into a VMA that is not attached
      to the corresponding anon_vma, leading to dangling page->mapping
      pointers that will be dereferenced during rmap walks.
      
      Fix it by separately tracking the number of child anon_vmas and the
      number of VMAs using the anon_vma as their ->anon_vma.
      
      Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
      Cc: stable@kernel.org
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2555283e
  5. Jul 18, 2022
  6. Jul 04, 2022
  7. Jul 03, 2022
    • David Hildenbrand's avatar
      mm/rmap: fix dereferencing invalid subpage pointer in try_to_migrate_one() · 1118234e
      David Hildenbrand authored
      The subpage we calculate is an invalid pointer for device private pages,
      because device private pages are mapped via non-present device private
      entries, not ordinary present PTEs.
      
      Let's just not compute broken pointers and fixup later.  Move the proper
      assignment of the correct subpage to the beginning of the function and
      assert that we really only have a single page in our folio.
      
      This currently results in a BUG when tying to compute anon_exclusive,
      because:
      
      [  528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0
      [  528.739585] #PF: supervisor read access in kernel mode
      [  528.745324] #PF: error_code(0x0000) - not-present page
      [  528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
      [  528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI
      [  528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257
      [  528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021
      [  528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
      [  528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
      c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
      0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
      [  528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
      [  528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0
      [  528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8
      [  528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000
      [  528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540
      [  528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff
      [  528.850865] FS:  00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000
      [  528.859891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0
      [  528.874275] PKRU: 55555554
      [  528.877286] Call Trace:
      [  528.880016]  <TASK>
      [  528.882356]  ? lock_is_held_type+0xdf/0x130
      [  528.887033]  rmap_walk_anon+0x167/0x410
      [  528.891316]  try_to_migrate+0x90/0xd0
      [  528.895405]  ? try_to_unmap_one+0xe10/0xe10
      [  528.900074]  ? anon_vma_ctor+0x50/0x50
      [  528.904260]  ? put_anon_vma+0x10/0x10
      [  528.908347]  ? invalid_mkclean_vma+0x20/0x20
      [  528.913114]  migrate_vma_setup+0x5f4/0x750
      [  528.917691]  dmirror_devmem_fault+0x8c/0x250 [test_hmm]
      [  528.923532]  do_swap_page+0xac0/0xe50
      [  528.927623]  ? __lock_acquire+0x4b2/0x1ac0
      [  528.932199]  __handle_mm_fault+0x949/0x1440
      [  528.936876]  handle_mm_fault+0x13f/0x3e0
      [  528.941256]  do_user_addr_fault+0x215/0x740
      [  528.945928]  exc_page_fault+0x75/0x280
      [  528.950115]  asm_exc_page_fault+0x27/0x30
      [  528.954593] RIP: 0033:0x40366b
      ...
      
      Link: https://lkml.kernel.org/r/20220623205332.319257-1-david@redhat.com
      
      
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatar"Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Tested-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1118234e
  8. Jun 27, 2022
  9. May 19, 2022
    • Minchan Kim's avatar
      mm: don't be stuck to rmap lock on reclaim path · 6d4675e6
      Minchan Kim authored
      The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
      under memory pressure if processes keep working on their vmas(e.g., fork,
      mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
      we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
      makes other processes entering direct reclaim, which were also stuck on
      the lock.
      
      This patch makes lru aging path try_lock mode like shink_page_list so the
      reclaim context will keep working with next lru pages without being stuck.
      if it found the rmap lock contended, it rotates the page back to head of
      lru in both active/inactive lrus to make them consistent behavior, which
      is basic starting point rather than adding more heristic.
      
      Since this patch introduces a new "contended" field as out-param along
      with try_lock in-param in rmap_walk_control, it's not immutable any longer
      if the try_lock is set so remove const keywords on rmap related functions.
      Since rmap walking is already expensive operation, I doubt the const
      would help sizable benefit( And we didn't have it until 5.17).
      
      In a heavy app workload in Android, trace shows following statistics.  It
      almost removes rmap lock contention from reclaim path.
      
      Martin Liu reported:
      
      Before:
      
         max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
               1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
                601            0             601   145.544681        28817    198  rmap_walk_file
      
      After:
      
         max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
                NaN          NaN              NaN          NaN          NaN    0.0             NaN
                  0            0                0     0.127645            1     12  rmap_walk_file
      
      [minchan@kernel.org: add comment, per Matthew]
        Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
      Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
      
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Martin Liu <liumartin@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d4675e6
  10. May 13, 2022
  11. May 10, 2022
    • David Hildenbrand's avatar
      mm/swap: remember PG_anon_exclusive via a swp pte bit · 1493a191
      David Hildenbrand authored
      Patch series "mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages", v2.
      
      This series fixes memory corruptions when a GUP R/W reference (FOLL_WRITE
      | FOLL_GET) was taken on an anonymous page and COW logic fails to detect
      exclusivity of the page to then replacing the anonymous page by a copy in
      the page table: The GUP reference lost synchronicity with the pages mapped
      into the page tables.  This series focuses on x86, arm64, s390x and
      ppc64/book3s -- other architectures are fairly easy to support by
      implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
      
      This primarily fixes the O_DIRECT memory corruptions that can happen on
      concurrent swapout, whereby we lose DMA reads to a page (modifying the
      user page by writing to it).
      
      O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM) DMA
      from/to a user page.  In the long run, we want to convert it to properly
      use FOLL_PIN, and John is working on it, but that might take a while and
      might not be easy to backport.  In the meantime, let's restore what used
      to work before we started modifying our COW logic: make R/W FOLL_GET
      references reliable as long as there is no fork() after GUP involved.
      
      This is just the natural follow-up of part 2, that will also further
      reduce "wrong COW" on the swapin path, for example, when we cannot remove
      a page from the swapcache due to concurrent writeback, or if we have two
      threads faulting on the same swapped-out page.  Fixing O_DIRECT is just a
      nice side-product
      
      This issue, including other related COW issues, has been summarized in [3]
      under 2):
      "
        2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)
      
        It was discovered that we can create a memory corruption by reading a
        file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
        concurrently writing to an unrelated part (e.g., last byte) of the same
        page, and concurrently write-protecting the page via clear_refs
        SOFTDIRTY tracking [6].
      
        For the reproducer, the issue is that O_DIRECT grabs a reference of the
        target page (via FOLL_GET) and clear_refs write-protects the relevant
        page table entry. On successive write access to the page from the
        process itself, we wrongly COW the page when resolving the write fault,
        resulting in a loss of synchronicity and consequently a memory corruption.
      
        While some people might think that using clear_refs in this combination
        is a corner cases, it turns out to be a more generic problem unfortunately.
      
        For example, it was just recently discovered that we can similarly
        create a memory corruption without clear_refs, simply by concurrently
        swapping out the buffer pages [7]. Note that we nowadays even use the
        swap infrastructure in Linux without an actual swap disk/partition: the
        prime example is zram which is enabled as default under Fedora [10].
      
        The root issue is that a write-fault on a page that has additional
        references results in a COW and thereby a loss of synchronicity
        and consequently a memory corruption if two parties believe they are
        referencing the same page.
      "
      
      We don't particularly care about R/O FOLL_GET references: they were never
      reliable and O_DIRECT doesn't expect to observe modifications from a page
      after DMA was started.
      
      Note that:
      * this only fixes the issue on x86, arm64, s390x and ppc64/book3s
        ("enterprise architectures"). Other architectures have to implement
        __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
      * this does *not * consider any kind of fork() after taking the reference:
        fork() after GUP never worked reliably with FOLL_GET.
      * Not losing PG_anon_exclusive during swapout was the last remaining
        piece. KSM already makes sure that there are no other references on
        a page before considering it for sharing. Page migration maintains
        PG_anon_exclusive and simply fails when there are additional references
        (freezing the refcount fails). Only swapout code dropped the
        PG_anon_exclusive flag because it requires more work to remember +
        restore it.
      
      With this series in place, most COW issues of [3] are fixed on said
      architectures. Other architectures can implement
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.
      
      [1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      
      
      This patch (of 8):
      
      Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
      it.  We do this, to keep fork() logic on swap entries easy and efficient:
      for example, if we wouldn't clear it when unmapping, we'd have to lookup
      the page in the swapcache for each and every swap entry during fork() and
      clear PG_anon_exclusive if set.
      
      Instead, we want to store that information directly in the swap pte,
      protected by the page table lock, similarly to how we handle
      SWP_MIGRATION_READ_EXCLUSIVE for migration entries.  However, for actual
      swap entries, we don't want to mess with the swap type (e.g., still one
      bit) because it overcomplicates swap code.
      
      In try_to_unmap(), we already reject to unmap in case the page might be
      pinned, because we must not lose PG_anon_exclusive on pinned pages ever. 
      Checking if there are other unexpected references reliably *before*
      completely unmapping a page is unfortunately not really possible: THP
      heavily overcomplicate the situation.  Once fully unmapped it's easier --
      we, for example, make sure that there are no unexpected references *after*
      unmapping a page before starting writeback on that page.
      
      So, we currently might end up unmapping a page and clearing
      PG_anon_exclusive if that page has additional references, for example, due
      to a FOLL_GET.
      
      do_swap_page() has to re-determine if a page is exclusive, which will
      easily fail if there are other references on a page, most prominently GUP
      references via FOLL_GET.  This can currently result in memory corruptions
      when taking a FOLL_GET | FOLL_WRITE reference on a page even when fork()
      is never involved: try_to_unmap() will succeed, and when refaulting the
      page, it cannot be marked exclusive and will get replaced by a copy in the
      page tables on the next write access, resulting in writes via the GUP
      reference to the page being lost.
      
      In an ideal world, everybody that uses GUP and wants to modify page
      content, such as O_DIRECT, would properly use FOLL_PIN.  However, that
      conversion will take a while.  It's easier to fix what used to work in the
      past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive.  In addition,
      by remembering PG_anon_exclusive we can further reduce unnecessary COW in
      some cases, so it's the natural thing to do.
      
      So let's transfer the PG_anon_exclusive information to the swap pte and
      store it via an architecture-dependant pte bit; use that information when
      restoring the swap pte in do_swap_page() and unuse_pte().  During fork(),
      we simply have to clear the pte bit and are done.
      
      Of course, there is one corner case to handle: swap backends that don't
      support concurrent page modifications while the page is under writeback. 
      Special case these, and drop the exclusive marker.  Add a comment why that
      is just fine (also, reuse_swap_page() would have done the same in the
      past).
      
      In the future, we'll hopefully have all architectures support
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty stubs
      and the define completely.  Then, we can also convert
      SWP_MIGRATION_READ_EXCLUSIVE.  For architectures it's fairly easy to
      support: either simply use a yet unused pte bit that can be used for swap
      entries, steal one from the arch type bits if they exceed 5, or steal one
      from the offset bits.
      
      Note: R/O FOLL_GET references were never really reliable, especially when
      taking one on a shared page and then writing to the page (e.g., GUP after
      fork()).  FOLL_GET, including R/W references, were never really reliable
      once fork was involved (e.g., GUP before fork(), GUP during fork()).  KSM
      steps back in case it stumbles over unexpected references and is,
      therefore, fine.
      
      [david@redhat.com: fix SWP_STABLE_WRITES test]
        Link: https://lkml.kernel.org/r/ac725bcb-313a-4fff-250a-68ba9a8f85fb@redhat.comLink: https://lkml.kernel.org/r/20220329164329.208407-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220329164329.208407-2-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1493a191
    • David Hildenbrand's avatar
      mm/rmap: fail try_to_migrate() early when setting a PMD migration entry fails · 7f5abe60
      David Hildenbrand authored
      Let's fail right away in case we cannot clear PG_anon_exclusive because
      the anon THP may be pinned.  Right now, we continue trying to install
      migration entries and the caller of try_to_migrate() will realize that the
      page is still mapped and has to restore the migration entries.  Let's just
      fail fast just like for PTE migration entries.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-14-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7f5abe60
    • David Hildenbrand's avatar
      mm: remember exclusively mapped anonymous pages with PG_anon_exclusive · 6c287605
      David Hildenbrand authored
      Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
      exclusive, and use that information to make GUP pins reliable and stay
      consistent with the page mapped into the page table even if the page table
      entry gets write-protected.
      
      With that information at hand, we can extend our COW logic to always reuse
      anonymous pages that are exclusive.  For anonymous pages that might be
      shared, the existing logic applies.
      
      As already documented, PG_anon_exclusive is usually only expressive in
      combination with a page table entry.  Especially PTE vs.  PMD-mapped
      anonymous pages require more thought, some examples: due to mremap() we
      can easily have a single compound page PTE-mapped into multiple page
      tables exclusively in a single process -- multiple page table locks apply.
      Further, due to MADV_WIPEONFORK we might not necessarily write-protect
      all PTEs, and only some subpages might be pinned.  Long story short: once
      PTE-mapped, we have to track information about exclusivity per sub-page,
      but until then, we can just track it for the compound page in the head
      page and not having to update a whole bunch of subpages all of the time
      for a simple PMD mapping of a THP.
      
      For simplicity, this commit mostly talks about "anonymous pages", while
      it's for THP actually "the part of an anonymous folio referenced via a
      page table entry".
      
      To not spill PG_anon_exclusive code all over the mm code-base, we let the
      anon rmap code to handle all PG_anon_exclusive logic it can easily handle.
      
      If a writable, present page table entry points at an anonymous (sub)page,
      that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
      pin (FOLL_PIN) on an anonymous page references via a present page table
      entry, it must only pin if PG_anon_exclusive is set for the mapped
      (sub)page.
      
      This commit doesn't adjust GUP, so this is only implicitly handled for
      FOLL_WRITE, follow-up commits will teach GUP to also respect it for
      FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
      reliable.
      
      Whenever an anonymous page is to be shared (fork(), KSM), or when
      temporarily unmapping an anonymous page (swap, migration), the relevant
      PG_anon_exclusive bit has to be cleared to mark the anonymous page
      possibly shared.  Clearing will fail if there are GUP pins on the page:
      
      * For fork(), this means having to copy the page and not being able to
        share it.  fork() protects against concurrent GUP using the PT lock and
        the src_mm->write_protect_seq.
      
      * For KSM, this means sharing will fail.  For swap this means, unmapping
        will fail, For migration this means, migration will fail early.  All
        three cases protect against concurrent GUP using the PT lock and a
        proper clear/invalidate+flush of the relevant page table entry.
      
      This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
      pinned page gets mapped R/O and the successive write fault ends up
      replacing the page instead of reusing it.  It improves the situation for
      O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
      fork() is *not* involved, however swapout and fork() are still
      problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
      users will fix the issue for them.
      
      I. Details about basic handling
      
      I.1. Fresh anonymous pages
      
      page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
      given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
      the mechanism fresh anonymous pages come into life (besides migration code
      where we copy the page->mapping), all fresh anonymous pages will start out
      as exclusive.
      
      I.2. COW reuse handling of anonymous pages
      
      When a COW handler stumbles over a (sub)page that's marked exclusive, it
      simply reuses it.  Otherwise, the handler tries harder under page lock to
      detect if the (sub)page is exclusive and can be reused.  If exclusive,
      page_move_anon_rmap() will mark the given (sub)page exclusive.
      
      Note that hugetlb code does not yet check for PageAnonExclusive(), as it
      still uses the old COW logic that is prone to the COW security issue
      because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
      pages are a scarce resource.
      
      I.3. Migration handling
      
      try_to_migrate() has to try marking an exclusive anonymous page shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  migrate_vma_collect_pmd() and
      __split_huge_pmd_locked() are handled similarly.
      
      Writable migration entries implicitly point at shared anonymous pages. 
      For readable migration entries that information is stored via a new
      "readable-exclusive" migration entry, specific to anonymous pages.
      
      When restoring a migration entry in remove_migration_pte(), information
      about exlusivity is detected via the migration entry type, and
      RMAP_EXCLUSIVE is set accordingly for
      page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.
      
      I.4. Swapout handling
      
      try_to_unmap() has to try marking the mapped page possibly shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  For now, information about exclusivity is lost.  In
      the future, we might want to remember that information in the swap entry
      in some cases, however, it requires more thought, care, and a way to store
      that information in swap entries.
      
      I.5. Swapin handling
      
      do_swap_page() will never stumble over exclusive anonymous pages in the
      swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
      to detect manually if an anonymous page is exclusive and has to set
      RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
      
      I.6. THP handling
      
      __split_huge_pmd_locked() has to move the information about exclusivity
      from the PMD to the PTEs.
      
      a) In case we have a readable-exclusive PMD migration entry, simply
         insert readable-exclusive PTE migration entries.
      
      b) In case we have a present PMD entry and we don't want to freeze
         ("convert to migration entries"), simply forward PG_anon_exclusive to
         all sub-pages, no need to temporarily clear the bit.
      
      c) In case we have a present PMD entry and want to freeze, handle it
         similar to try_to_migrate(): try marking the page shared first.  In
         case we fail, we ignore the "freeze" instruction and simply split
         ordinarily.  try_to_migrate() will properly fail because the THP is
         still mapped via PTEs.
      
      When splitting a compound anonymous folio (THP), the information about
      exclusivity is implicitly handled via the migration entries: no need to
      replicate PG_anon_exclusive manually.
      
      I.7.  fork() handling fork() handling is relatively easy, because
      PG_anon_exclusive is only expressive for some page table entry types.
      
      a) Present anonymous pages
      
      page_try_dup_anon_rmap() will mark the given subpage shared -- which will
      fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
      PMD to handle it on the PTE level).
      
      Note that device exclusive entries are just a pointer at a PageAnon()
      page.  fork() will first convert a device exclusive entry to a present
      page table and handle it just like present anonymous pages.
      
      b) Device private entry
      
      Device private entries point at PageAnon() pages that cannot be mapped
      directly and, therefore, cannot get pinned.
      
      page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
      fail because they cannot get pinned.
      
      c) HW poison entries
      
      PG_anon_exclusive will remain untouched and is stale -- the page table
      entry is just a placeholder after all.
      
      d) Migration entries
      
      Writable and readable-exclusive entries are converted to readable entries:
      possibly shared.
      
      I.8. mprotect() handling
      
      mprotect() only has to properly handle the new readable-exclusive
      migration entry:
      
      When write-protecting a migration entry that points at an anonymous page,
      remember the information about exclusivity via the "readable-exclusive"
      migration entry type.
      
      II. Migration and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a migration entry, we have to mark the page possibly
      shared and synchronize against GUP-fast by a proper clear/invalidate+flush
      to make the following scenario impossible:
      
      1. try_to_migrate() places a migration entry after checking for GUP pins
         and marks the page possibly shared.
      
      2. GUP-fast pins the page due to lack of synchronization
      
      3. fork() converts the "writable/readable-exclusive" migration entry into a
         readable migration entry
      
      4. Migration fails due to the GUP pin (failing to freeze the refcount)
      
      5. Migration entries are restored. PG_anon_exclusive is lost
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      Note that we move information about exclusivity from the page to the
      migration entry as it otherwise highly overcomplicates fork() and
      PTE-mapping a THP.
      
      III. Swapout and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a swap entry, we have to mark the page possibly shared
      and synchronize against GUP-fast by a proper clear/invalidate+flush to
      make the following scenario impossible:
      
      1. try_to_unmap() places a swap entry after checking for GUP pins and
         clears exclusivity information on the page.
      
      2. GUP-fast pins the page due to lack of synchronization.
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      If we'd ever store information about exclusivity in the swap entry,
      similar to migration handling, the same considerations as in II would
      apply.  This is future work.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c287605
    • David Hildenbrand's avatar
      mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() · 40f2bbf7
      David Hildenbrand authored
      New anonymous pages are always mapped natively: only THP/khugepaged code
      maps a new compound anonymous page and passes "true".  Otherwise, we're
      just dealing with simple, non-compound pages.
      
      Let's give the interface clearer semantics and document these.  Remove the
      PageTransCompound() sanity check from page_add_new_anon_rmap().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-9-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      40f2bbf7
    • David Hildenbrand's avatar
      mm/rmap: pass rmap flags to hugepage_add_anon_rmap() · 28c5209d
      David Hildenbrand authored
      Let's prepare for passing RMAP_EXCLUSIVE, similarly as we do for
      page_add_anon_rmap() now.  RMAP_COMPOUND is implicit for hugetlb pages and
      ignored.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-8-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28c5209d
    • David Hildenbrand's avatar
      mm/rmap: remove do_page_add_anon_rmap() · f1e2db12
      David Hildenbrand authored
      ... and instead convert page_add_anon_rmap() to accept flags.
      
      Passing flags instead of bools is usually nicer either way, and we want to
      more often also pass RMAP_EXCLUSIVE in follow up patches when detecting
      that an anonymous page is exclusive: for example, when restoring an
      anonymous page from a writable migration entry.
      
      This is a preparation for marking an anonymous page inside
      page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f1e2db12
    • David Hildenbrand's avatar
      mm/rmap: convert RMAP flags to a proper distinct rmap_t type · 14f9135d
      David Hildenbrand authored
      We want to pass the flags to more than one anon rmap function, getting rid
      of special "do_page_add_anon_rmap()".  So let's pass around a distinct
      __bitwise type and refine documentation.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-6-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14f9135d
    • David Hildenbrand's avatar
      mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed · 322842ea
      David Hildenbrand authored
      Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v4.
      
      This series is the result of the discussion on the previous approach [2]. 
      More information on the general COW issues can be found there.  It is
      based on latest linus/master (post v5.17, with relevant core-MM changes
      for v5.18-rc1).
      
      This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
      on an anonymous page and COW logic fails to detect exclusivity of the page
      to then replacing the anonymous page by a copy in the page table: The GUP
      pin lost synchronicity with the pages mapped into the page tables.
      
      This issue, including other related COW issues, has been summarized in [3]
      under 3):
      "
        3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
      
        page_maybe_dma_pinned() is used to check if a page may be pinned for
        DMA (using FOLL_PIN instead of FOLL_GET).  While false positives are
        tolerable, false negatives are problematic: pages that are pinned for
        DMA must not be added to the swapcache.  If it happens, the (now pinned)
        page could be faulted back from the swapcache into page tables
        read-only.  Future write-access would detect the pinning and COW the
        page, losing synchronicity.  For the interested reader, this is nicely
        documented in feb889fb ("mm: don't put pinned pages into the swap
        cache").
      
        Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
        cases and can result in a violation of the documented semantics: giving
        false negatives because of the race.
      
        There are cases where we call it without properly taking a per-process
        sequence lock, turning the usage of page_maybe_dma_pinned() racy.  While
        one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
        handle, there is especially one rmap case (shrink_page_list) that's hard
        to fix: in the rmap world, we're not limited to a single process.
      
        The shrink_page_list() issue is really subtle.  If we race with
        someone pinning a page, we can trigger the same issue as in the FOLL_GET
        case.  See the detail section at the end of this mail on a discussion
        how bad this can bite us with VFIO or other FOLL_PIN user.
      
        It's harder to reproduce, but I managed to modify the O_DIRECT
        reproducer to use io_uring fixed buffers [15] instead, which ends up
        using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
        similarly trigger a loss of synchronicity and consequently a memory
        corruption.
      
        Again, the root issue is that a write-fault on a page that has
        additional references results in a COW and thereby a loss of
        synchronicity and consequently a memory corruption if two parties
        believe they are referencing the same page.
      "
      
      This series makes GUP pins (R/O and R/W) on anonymous pages fully
      reliable, especially also taking care of concurrent pinning via GUP-fast,
      for example, also fully fixing an issue reported regarding NUMA balancing
      [4] recently.  While doing that, it further reduces "unnecessary COWs",
      especially when we don't fork()/KSM and don't swapout, and fixes the COW
      security for hugetlb for FOLL_PIN.
      
      In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
      anonymous page is exclusive.  Exclusive anonymous pages that are mapped
      R/O can directly be mapped R/W by the COW logic in the write fault
      handler.  Exclusive anonymous pages that want to be shared (fork(), KSM)
      first have to be marked shared -- which will fail if there are GUP pins on
      the page.  GUP is only allowed to take a pin on anonymous pages that are
      exclusive.  The PT lock is the primary mechanism to synchronize
      modifications of PG_anon_exclusive.  We synchronize against GUP-fast
      either via the src_mm->write_protect_seq (during fork()) or via
      clear/invalidate+flush of the relevant page table entry.
      
      Special care has to be taken about swap, migration, and THPs (whereby a
      PMD-mapping can be converted to a PTE mapping and we have to track
      information for subpages).  Besides these, we let the rmap code handle
      most magic.  For reliable R/O pins of anonymous pages, we need
      FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however,
      it's now 100% mapcount free and I further simplified it a bit.
      
        #1 is a fix
        #3-#10 are mostly rmap preparations for PG_anon_exclusive handling
        #11 introduces PG_anon_exclusive
        #12 uses PG_anon_exclusive and make R/W pins of anonymous pages
         reliable
        #13 is a preparation for reliable R/O pins
        #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins
         make R/O pins of anonymous pages reliable
        #16 adds sanity check when (un)pinning anonymous pages
      
      [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616
      
      
      This patch (of 17):
      
      In case arch_unmap_one() fails, we already did a swap_duplicate().  let's
      undo that properly via swap_free().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220428083441.37290-2-david@redhat.com
      
      
      Fixes: ca827d55 ("mm, swap: Add infrastructure for saving page metadata on swap")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      322842ea
  12. Apr 29, 2022
    • Muchun Song's avatar
      mm: rmap: introduce pfn_mkclean_range() to cleans PTEs · 6a8e0596
      Muchun Song authored
      The page_mkclean_one() is supposed to be used with the pfn that has a
      associated struct page, but not all the pfns (e.g.  DAX) have a struct
      page.  Introduce a new function pfn_mkclean_range() to cleans the PTEs
      (including PMDs) mapped with range of pfns which has no struct page
      associated with them.  This helper will be used by DAX device in the next
      patch to make pfns clean.
      
      Link: https://lkml.kernel.org/r/20220403053957.10770-4-songmuchun@bytedance.com
      
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a8e0596
    • Muchun Song's avatar
      mm: rmap: fix cache flush on THP pages · 7f9c9b60
      Muchun Song authored
      Patch series "Fix some bugs related to ramp and dax", v7.
      
      Patch 1-2 fix a cache flush bug, because subsequent patches depend on
      those on those changes, there are placed in this series.  Patch 3-4 are
      preparation for fixing a dax bug in patch 5.  Patch 6 is code cleanup
      since the previous patch removes the usage of follow_invalidate_pte().
      
      
      This patch (of 6):
      
      The flush_cache_page() only remove a PAGE_SIZE sized range from the cache.
      However, it does not cover the full pages in a THP except a head page. 
      Replace it with flush_cache_range() to fix this issue.  At least, no
      problems were found due to this.  Maybe because the architectures that
      have virtual indexed caches is less.
      
      Link: https://lkml.kernel.org/r/20220403053957.10770-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20220403053957.10770-2-songmuchun@bytedance.com
      
      
      Fixes: f27176cf ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7f9c9b60
  13. Apr 01, 2022
  14. Mar 25, 2022
    • Mauricio Faria de Oliveira's avatar
      mm: fix race between MADV_FREE reclaim and blkdev direct IO read · 6c8e2a25
      Mauricio Faria de Oliveira authored
      Problem:
      =======
      
      Userspace might read the zero-page instead of actual data from a direct IO
      read on a block device if the buffers have been called madvise(MADV_FREE)
      on earlier (this is discussed below) due to a race between page reclaim on
      MADV_FREE and blkdev direct IO read.
      
      - Race condition:
        ==============
      
      During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
      if the page is not dirty, then discards its rmap PTE(s) (vs.  remap back
      if the page is dirty).
      
      However, after try_to_unmap_one() returns to shrink_page_list(), it might
      keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
      _one_ page reference, from the isolation for page reclaim).
      
      Well, blkdev_direct_IO() gets references for all pages, and on READ
      operations it only sets them dirty _later_.
      
      So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
      IO read from block devices, and page reclaim happens during
      __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
      returns, but BEFORE the pages are set dirty, the situation happens.
      
      The direct IO read eventually completes.  Now, when userspace reads the
      buffers, the PTE is no longer there and the page fault handler
      do_anonymous_page() services that with the zero-page, NOT the data!
      
      A synthetic reproducer is provided.
      
      - Page faults:
        ===========
      
      If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
      happen, because that faults-in all pages as writeable, so
      do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
      direct IO.  The userspace reads don't fault as the PTE is there (thus
      zero-page is not used/setup).
      
      But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
      is no longer there; the subsequent page faults can't help:
      
      The data-read from the block device probably won't generate faults due to
      DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
      different virtual addresses (not user-mapped addresses) because `struct
      bio_vec` stores `struct page` to figure addresses out (which are different
      from user-mapped addresses) for the read.
      
      Thus userspace reads (to user-mapped addresses) still fault, then
      do_anonymous_page() gets another `struct page` that would address/ map to
      other memory than the `struct page` used by `struct bio_vec` for the read.
      (The original `struct page` is not available, since it wasn't freed, as
      page_ref_freeze() failed due to more page refs.  And even if it were
      available, its data cannot be trusted anymore.)
      
      Solution:
      ========
      
      One solution is to check for the expected page reference count in
      try_to_unmap_one().
      
      There should be one reference from the isolation (that is also checked in
      shrink_page_list() with page_ref_freeze()) plus one or more references
      from page mapping(s) (put in discard: label).  Further references mean
      that rmap/PTE cannot be unmapped/nuked.
      
      (Note: there might be more than one reference from mapping due to
      fork()/clone() without CLONE_VM, which use the same `struct page` for
      references, until the copy-on-write page gets copied.)
      
      So, additional page references (e.g., from direct IO read) now prevent the
      rmap/PTE from being unmapped/dropped; similarly to the page is not freed
      per shrink_page_list()/page_ref_freeze()).
      
      - Races and Barriers:
        ==================
      
      The new check in try_to_unmap_one() should be safe in races with
      bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
      done under the PTE lock.
      
      The fast path doesn't take the lock, but it checks if the PTE has changed
      and if so, it drops the reference and leaves the page for the slow path
      (which does take that lock).
      
      The fast path requires synchronization w/ full memory barrier: it writes
      the page reference count first then it reads the PTE later, while
      try_to_unmap() writes PTE first then it reads page refcount.
      
      And a second barrier is needed, as the page dirty flag should not be read
      before the page reference count (as in __remove_mapping()).  (This can be
      a load memory barrier only; no writes are involved.)
      
      Call stack/comments:
      
      - try_to_unmap_one()
        - page_vma_mapped_walk()
          - map_pte()			# see pte_offset_map_lock():
              pte_offset_map()
              spin_lock()
      
        - ptep_get_and_clear()	# write PTE
        - smp_mb()			# (new barrier) GUP fast path
        - page_ref_count()		# (new check) read refcount
      
        - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
            pte_unmap()
            spin_unlock()
      
      - bio_iov_iter_get_pages()
        - __bio_iov_iter_get_pages()
          - iov_iter_get_pages()
            - get_user_pages_fast()
              - internal_get_user_pages_fast()
      
                # fast path
                - lockless_pages_from_mm()
                  - gup_{pgd,p4d,pud,pmd,pte}_range()
                      ptep = pte_offset_map()		# not _lock()
                      pte = ptep_get_lockless(ptep)
      
                      page = pte_page(pte)
                      try_grab_compound_head(page)	# inc refcount
                                                  	# (RMW/barrier
                                                   	#  on success)
      
                      if (pte_val(pte) != pte_val(*ptep)) # read PTE
                              put_compound_head(page) # dec refcount
                              			# go slow path
      
                # slow path
                - __gup_longterm_unlocked()
                  - get_user_pages_unlocked()
                    - __get_user_pages_locked()
                      - __get_user_pages()
                        - follow_{page,p4d,pud,pmd}_mask()
                          - follow_page_pte()
                              ptep = pte_offset_map_lock()
                              pte = *ptep
                              page = vm_normal_page(pte)
                              try_grab_page(page)	# inc refcount
                              pte_unmap_unlock()
      
      - Huge Pages:
        ==========
      
      Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
      (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
      (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
      thus should reach shrink_page_list() -> split_huge_page_to_list() before
      try_to_unmap[_one](), so it deals with normal pages only.
      
      (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
      which should not or be rare, the page refcount should be greater than
      mapcount: the head page is referenced by tail pages.  That also prevents
      checking the head `page` then incorrectly call page_remove_rmap(subpage)
      for a tail page, that isn't even in the shrink_page_list()'s page_list (an
      effect of split huge pmd/pmvw), as it might happen today in this unlikely
      scenario.)
      
      MADV_FREE'd buffers:
      ===================
      
      So, back to the "if MADV_FREE pages are used as buffers" note.  The case
      is arguable, and subject to multiple interpretations.
      
      The madvise(2) manual page on the MADV_FREE advice value says:
      
      1) 'After a successful MADV_FREE ... data will be lost when
         the kernel frees the pages.'
      2) 'the free operation will be canceled if the caller writes
         into the page' / 'subsequent writes ... will succeed and
         then [the] kernel cannot free those dirtied pages'
      3) 'If there is no subsequent write, the kernel can free the
         pages at any time.'
      
      Thoughts, questions, considerations... respectively:
      
      1) Since the kernel didn't actually free the page (page_ref_freeze()
         failed), should the data not have been lost? (on userspace read.)
      2) Should writes performed by the direct IO read be able to cancel
         the free operation?
         - Should the direct IO read be considered as 'the caller' too,
           as it's been requested by 'the caller'?
         - Should the bio technique to dirty pages on return to userspace
           (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
           be considered in another/special way here?
      3) Should an upcoming write from a previously requested direct IO
         read be considered as a subsequent write, so the kernel should
         not free the pages? (as it's known at the time of page reclaim.)
      
      And lastly:
      
      Technically, the last point would seem a reasonable consideration and
      balance, as the madvise(2) manual page apparently (and fairly) seem to
      assume that 'writes' are memory access from the userspace process (not
      explicitly considering writes from the kernel or its corner cases; again,
      fairly)..  plus the kernel fix implementation for the corner case of the
      largely 'non-atomic write' encompassed by a direct IO read operation, is
      relatively simple; and it helps.
      
      Reproducer:
      ==========
      
      @ test.c (simplified, but works)
      
      	#define _GNU_SOURCE
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main() {
      		int fd, i;
      		char *buf;
      
      		fd = open(DEV, O_RDONLY | O_DIRECT);
      
      		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                      	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			buf[i] = 1; // init to non-zero
      
      		madvise(buf, BUF_SIZE, MADV_FREE);
      
      		read(fd, buf, BUF_SIZE);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			printf("%p: 0x%x\n", &buf[i], buf[i]);
      
      		return 0;
      	}
      
      @ block/fops.c (formerly fs/block_dev.c)
      
      	+#include <linux/swap.h>
      	...
      	... __blkdev_direct_IO[_simple](...)
      	{
      	...
      	+	if (!strcmp(current->comm, "good"))
      	+		shrink_all_memory(ULONG_MAX);
      	+
               	ret = bio_iov_iter_get_pages(...);
      	+
      	+	if (!strcmp(current->comm, "bad"))
      	+		shrink_all_memory(ULONG_MAX);
      	...
      	}
      
      @ shell
      
              # NUM_PAGES=4
              # PAGE_SIZE=$(getconf PAGE_SIZE)
      
              # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
              # DEV=$(losetup -f --show test.img)
      
              # gcc -DDEV=\"$DEV\" \
                    -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
                    -DPAGE_SIZE=${PAGE_SIZE} \
                     test.c -o test
      
              # od -tx1 $DEV
              0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
              *
              0040000
      
              # mv test good
              # ./good
              0x7f7c10418000: 0x79
              0x7f7c10419000: 0x79
              0x7f7c1041a000: 0x79
              0x7f7c1041b000: 0x79
      
              # mv good bad
              # ./bad
              0x7fa1b8050000: 0x0
              0x7fa1b8051000: 0x0
              0x7fa1b8052000: 0x0
              0x7fa1b8053000: 0x0
      
      Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
      support of MADV_FREE on v4.5 (60%-70% error; needs swap).  [wrap
      do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
      
      - v5.17-rc3:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x0
      
              # free | grep Swap
              Swap:             0           0           0
      
      - v4.5:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 2702  0x0
                 1298  0x79
      
              # swapoff -av
              swapoff /swap
      
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
      Ceph/TCMalloc:
      =============
      
      For documentation purposes, the use case driving the analysis/fix is Ceph
      on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
      release unused memory to the system from the mmap'ed page heap (might be
      committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
      -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
      TCMalloc_SystemCommit() -> do nothing.
      
      Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
      release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
      just 'disappeared' on Ceph on later Ubuntu releases but is still present
      in the kernel, and can be hit by other use cases.
      
      The observed issue seems to be the old Ceph bug #22464 [1], where checksum
      mismatches are observed (and instrumentation with buffer dumps shows
      zero-pages read from mmap'ed/MADV_FREE'd page ranges).
      
      The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
      mostly worked around with a retry mechanism, but other parts of Ceph could
      still hit that (rocksdb).  Anyway, it's less likely to be hit again as
      TCMalloc switched out of MADV_FREE by default.
      
      (Some kernel versions/reports from the Ceph bug, and relation with
      the MADV_FREE introduction/changes; TCMalloc versions not checked.)
      - 4.4 good
      - 4.5 (madv_free: introduction)
      - 4.9 bad
      - 4.10 good? maybe a swapless system
      - 4.12 (madv_free: no longer free instantly on swapless systems)
      - 4.13 bad
      
      [1] https://tracker.ceph.com/issues/22464
      
      Thanks:
      ======
      
      Several people contributed to analysis/discussions/tests/reproducers in
      the first stages when drilling down on ceph/tcmalloc/linux kernel:
      
      - Dan Hill
      - Dan Streetman
      - Dongdong Tao
      - Gavin Guo
      - Gerald Yang
      - Heitor Alves de Siqueira
      - Ioanna Alifieraki
      - Jay Vosburgh
      - Matthew Ruffell
      - Ponnuvel Palaniyappan
      
      Reviews, suggestions, corrections, comments:
      
      - Minchan Kim
      - Yu Zhao
      - Huang, Ying
      - John Hubbard
      - Christoph Hellwig
      
      [mfo@canonical.com: v4]
        Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
      
      
      
      Fixes: 802a3a92 ("mm: reclaim MADV_FREE pages")
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Hill <daniel.hill@canonical.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Dongdong Tao <dongdong.tao@canonical.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Gerald Yang <gerald.yang@canonical.com>
      Cc: Heitor Alves de Siqueira <halves@canonical.com>
      Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
      Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
      Cc: <stable@vger.kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c8e2a25
    • Anshuman Khandual's avatar
      mm/migration: add trace events for base page and HugeTLB migrations · 4cc79b33
      Anshuman Khandual authored
      This adds two trace events for base page and HugeTLB page migrations.
      These events, closely follow the implementation details like setting and
      removing of PTE migration entries, which are essential operations for
      migration.  The new CREATE_TRACE_POINTS in <mm/rmap.c> covers both
      <events/migration.h> and <events/tlb.h> based trace events.  Hence drop
      redundant CREATE_TRACE_POINTS from other places which could have otherwise
      conflicted during build.
      
      Link: https://lkml.kernel.org/r/1643368182-9588-3-git-send-email-anshuman.khandual@arm.com
      
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4cc79b33
    • Hugh Dickins's avatar
      mm/thp: fix NR_FILE_MAPPED accounting in page_*_file_rmap() · 5d543f13
      Hugh Dickins authored
      NR_FILE_MAPPED accounting in mm/rmap.c (for /proc/meminfo "Mapped" and
      /proc/vmstat "nr_mapped" and the memcg's memory.stat "mapped_file") is
      slightly flawed for file or shmem huge pages.
      
      It is well thought out, and looks convincing, but there's a racy case when
      the careful counting in page_remove_file_rmap() (without page lock) gets
      discarded.  So that in a workload like two "make -j20" kernel builds under
      memory pressure, with cc1 on hugepage text, "Mapped" can easily grow by a
      spurious 5MB or more on each iteration, ending up implausibly bigger than
      most other numbers in /proc/meminfo.  And, hypothetically, might grow to
      the point of seriously interfering in mm/vmscan.c's heuristics, which do
      take NR_FILE_MAPPED into some consideration.
      
      Fixed by moving the __mod_lruvec_page_state() down to where it will not be
      missed before return (and I've grown a bit tired of that oft-repeated
      but-not-everywhere comment on the __ness: it gets lost in the move here).
      
      Does page_add_file_rmap() need the same change?  I suspect not, because
      page lock is held in all relevant cases, and its skipping case looks safe;
      but it's much easier to be sure, if we do make the same change.
      
      Link: https://lkml.kernel.org/r/e02e52a1-8550-a57c-ed29-f51191ea2375@google.com
      
      
      Fixes: dd78fedd ("rmap: support file thp")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d543f13
  15. Mar 22, 2022
Loading