- Jan 26, 2024
-
-
Ryan Roberts authored
The addition of commit efa7df3e ("mm: align larger anonymous mappings on THP boundaries") caused the "virtual_address_range" mm selftest to start failing on arm64. Let's fix that regression. There were 2 visible problems when running the test; 1) it takes much longer to execute, and 2) the test fails. Both are related: The (first part of the) test allocates as many 1GB anonymous blocks as it can in the low 256TB of address space, passing NULL as the addr hint to mmap. Before the faulty patch, all allocations were abutted and contained in a single, merged VMA. However, after this patch, each allocation is in its own VMA, and there is a 2M gap between each VMA. This causes the 2 problems in the test: 1) mmap becomes MUCH slower because there are so many VMAs to check to find a new 1G gap. 2) mmap fails once it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then causes a subsequent calloc() to fail, which causes the test to fail. The problem is that arm64 (unlike x86) selects ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates len+2M then always aligns to the bottom of the discovered gap. That causes the 2M hole. Fix this by detecting cases where we can still achive the alignment goal when moved to the top of the allocated area, if configured to prefer top-down allocation. While we are at it, fix thp_get_unmapped_area's use of pgoff, which should always be zero for anonymous mappings. Prior to the faulty change, while it was possible for user space to pass in pgoff!=0, the old mm->get_unmapped_area() handler would not use it. thp_get_unmapped_area() does use it, so let's explicitly zero it before calling the handler. This should also be the correct behavior for arches that define their own get_unmapped_area() handler. Link: https://lkml.kernel.org/r/20240123171420.3970220-1-ryan.roberts@arm.com Fixes: efa7df3e ("mm: align larger anonymous mappings on THP boundaries") Closes: https://lore.kernel.org/linux-mm/1e8f5ac7-54ce-433a-ae53-81522b2320e1@arm.com/ Signed-off-by:
Ryan Roberts <ryan.roberts@arm.com> Reviewed-by:
Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Yang Shi authored
commit efa7df3e ("mm: align larger anonymous mappings on THP boundaries") caused two issues [1] [2] reported on 32 bit system or compat userspace. It doesn't make too much sense to force huge page alignment on 32 bit system due to the constrained virtual address space. [1] https://lore.kernel.org/linux-mm/d0a136a0-4a31-46bc-adf4-2db109a61672@kernel.org/ [2] https://lore.kernel.org/linux-mm/CAJuCfpHXLdQy1a2B6xN2d7quTYwg2OoZseYPZTRpU0eHHKD-sQ@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240118180505.2914778-1-shy828301@gmail.com Fixes: efa7df3e ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by:
Yang Shi <yang@os.amperecomputing.com> Reported-by:
Jiri Slaby <jirislaby@kernel.org> Reported-by:
Suren Baghdasaryan <surenb@google.com> Tested-by:
Jiri Slaby <jirislaby@kernel.org> Tested-by:
Suren Baghdasaryan <surenb@google.com> Reviewed-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Christopher Lameter <cl@linux.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Lokesh Gidra authored
In mfill_atomic_hugetlb(), mmap_changing isn't being checked again if we drop mmap_lock and reacquire it. When the lock is not held, mmap_changing could have been incremented. This is also inconsistent with the behavior in mfill_atomic(). Link: https://lkml.kernel.org/r/20240117223729.1444522-1-lokeshgidra@google.com Fixes: df2cc96e ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Signed-off-by:
Lokesh Gidra <lokeshgidra@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
The correct folio replacement for "set_page_dirty()" is "folio_mark_dirty()", not "folio_set_dirty()". Using the latter won't properly inform the FS using the dirty_folio() callback. This has been found by code inspection, but likely this can result in some real trouble when zapping dirty PTEs that point at clean pagecache folios. Yuezhang Mo said: "Without this fix, testing the latest exfat with xfstests, test cases generic/029 and generic/030 will fail." Link: https://lkml.kernel.org/r/20240122171751.272074-1-david@redhat.com Fixes: c4626503 ("mm/memory: page_remove_rmap() -> folio_remove_rmap_pte()") Signed-off-by:
David Hildenbrand <david@redhat.com> Reported-by:
Ryan Roberts <ryan.roberts@arm.com> Closes: https://lkml.kernel.org/r/2445cedb-61fb-422c-8bfb-caf0a2beed62@arm.com Reviewed-by:
Ryan Roberts <ryan.roberts@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
The correct folio replacement for "set_page_dirty()" is "folio_mark_dirty()", not "folio_set_dirty()". Using the latter won't properly inform the FS using the dirty_folio() callback. This has been found by code inspection, but likely this can result in some real trouble. Link: https://lkml.kernel.org/r/20240122175407.307992-1-david@redhat.com Fixes: a8e61d58 ("mm/huge_memory: page_remove_rmap() -> folio_remove_rmap_pmd()") Signed-off-by:
David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Zach O'Keefe authored
(struct dirty_throttle_control *)->thresh is an unsigned long, but is passed as the u32 divisor argument to div_u64(). On architectures where unsigned long is 64 bytes, the argument will be implicitly truncated. Use div64_u64() instead of div_u64() so that the value used in the "is this a safe division" check is the same as the divisor. Also, remove redundant cast of the numerator to u64, as that should happen implicitly. This would be difficult to exploit in memcg domain, given the ratio-based arithmetic domain_drity_limits() uses, but is much easier in global writeback domain with a BDI_CAP_STRICTLIMIT-backing device, using e.g. vm.dirty_bytes=(1<<32)*PAGE_SIZE so that dtc->thresh == (1<<32) Link: https://lkml.kernel.org/r/20240118181954.1415197-1-zokeefe@google.com Fixes: f6789593 ("mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()") Signed-off-by:
Zach O'Keefe <zokeefe@google.com> Cc: Maxim Patlasov <MPatlasov@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Johannes Weiner authored
While investigating hosts with high cgroup memory pressures, Tejun found culprit zombie tasks that had were holding on to a lot of memory, had SIGKILL pending, but were stuck in memory.high reclaim. In the past, we used to always force-charge allocations from tasks that were exiting in order to accelerate them dying and freeing up their rss. This changed for memory.max in a4ebf1b6 ("memcg: prohibit unconditional exceeding the limit of dying tasks"); it noted that this can cause (userspace inducable) containment failures, so it added a mandatory reclaim and OOM kill cycle before forcing charges. At the time, memory.high enforcement was handled in the userspace return path, which isn't reached by dying tasks, and so memory.high was still never enforced by dying tasks. When c9afe31e ("memcg: synchronously enforce memory.high for large overcharges") added synchronous reclaim for memory.high, it added unconditional memory.high enforcement for dying tasks as well. The callstack shows that this path is where the zombie is stuck in. We need to accelerate dying tasks getting past memory.high, but we cannot do it quite the same way as we do for memory.max: memory.max is enforced strictly, and tasks aren't allowed to move past it without FIRST reclaiming and OOM killing if necessary. This ensures very small levels of excess. With memory.high, though, enforcement happens lazily after the charge, and OOM killing is never triggered. A lot of concurrent threads could have pushed, or could actively be pushing, the cgroup into excess. The dying task will enter reclaim on every allocation attempt, with little hope of restoring balance. To fix this, skip synchronous memory.high enforcement on dying tasks altogether again. Update memory.high path documentation while at it. [hannes@cmpxchg.org: also handle tasks are being killed during the reclaim] Link: https://lkml.kernel.org/r/20240111192807.GA424308@cmpxchg.org Link: https://lkml.kernel.org/r/20240111132902.389862-1-hannes@cmpxchg.org Fixes: c9afe31e ("memcg: synchronously enforce memory.high for large overcharges") Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Reported-by:
Tejun Heo <tj@kernel.org> Reviewed-by:
Yosry Ahmed <yosryahmed@google.com> Acked-by:
Shakeel Butt <shakeelb@google.com> Acked-by:
Roman Gushchin <roman.gushchin@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Sidhartha Kumar authored
has_extra_refcount() makes the assumption that the page cache adds a ref count of 1 and subtracts this in the extra_pins case. Commit a08c7193 (mm/filemap: remove hugetlb special casing in filemap.c) modifies __filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases (including hugtetlb) where nr is the number of pages in the folio. We should adjust the number of references coming from the page cache by subtracing the number of pages rather than 1. In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong flag as, in the hugetlb case, memory-failure code calls folio_test_set_hwpoison() to indicate poison. folio_test_hwpoison() is the correct function to test for that flag. After these fixes, the hugetlb hwpoison read selftest passes all cases. Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com Fixes: a08c7193 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by:
Sidhartha Kumar <sidhartha.kumar@oracle.com> Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519 Reported-by:
Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by:
Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by:
Miaohe Lin <linmiaohe@huawei.com> Acked-by:
Muchun Song <muchun.song@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> [6.7+] Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Jan Kara authored
ra_alloc_folio() marks a page that should trigger next round of async readahead. However it rounds up computed index to the order of page being allocated. This can however lead to multiple consecutive pages being marked with readahead flag. Consider situation with index == 1, mark == 1, order == 0. We insert order 0 page at index 1 and mark it. Then we bump order to 1, index to 2, mark (still == 1) is rounded up to 2 so page at index 2 is marked as well. Then we bump order to 2, index is incremented to 4, mark gets rounded to 4 so page at index 4 is marked as well. The fact that multiple pages get marked within a single readahead window confuses the readahead logic and results in readahead window being trimmed back to 1. This situation is triggered in particular when maximum readahead window size is not a power of two (in the observed case it was 768 KB) and as a result sequential read throughput suffers. Fix the problem by rounding 'mark' down instead of up. Because the index is naturally aligned to 'order', we are guaranteed 'rounded mark' == index iff 'mark' is within the page we are allocating at 'index' and thus exactly one page is marked with readahead flag as required by the readahead code and sequential read performance is restored. This effectively reverts part of commit b9ff43dd ("mm/readahead: Fix readahead with large folios"). The commit changed the rounding with the rationale: "... we were setting the readahead flag on the folio which contains the last byte read from the block. This is wrong because we will trigger readahead at the end of the read without waiting to see if a subsequent read is going to use the pages we just read." Although this is true, the fact is this was always the case with read sizes not aligned to folio boundaries and large folios in the page cache just make the situation more obvious (and frequent). Also for sequential read workloads it is better to trigger the readahead earlier rather than later. It is true that the difference in the rounding and thus earlier triggering of the readahead can result in reading more for semi-random workloads. However workloads really suffering from this seem to be rare. In particular I have verified that the workload described in commit b9ff43dd ("mm/readahead: Fix readahead with large folios") of reading random 100k blocks from a file like: [reader] bs=100k rw=randread numjobs=1 size=64g runtime=60s is not impacted by the rounding change and achieves ~70MB/s in both cases. [jack@suse.cz: fix one more place where mark rounding was done as well] Link: https://lkml.kernel.org/r/20240123153254.5206-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240104085839.21029-1-jack@suse.cz Fixes: b9ff43dd ("mm/readahead: Fix readahead with large folios") Signed-off-by:
Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Guo Xuenan <guoxuenan@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Jan 19, 2024
-
-
Yajun Deng authored
After commit 61167ad5 ("mm: pass nid to reserve_bootmem_region()") nid of a reserved region is used by init_reserved_page() (with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y) to access node strucure. In many cases the nid of the reserved memory is not set and this causes a crash. When the nid of a reserved region is not set, fall back to early_pfn_to_nid(), so that nid of the first_online_node will be passed to init_reserved_page(). Fixes: 61167ad5 ("mm: pass nid to reserve_bootmem_region()") Signed-off-by:
Yajun Deng <yajun.deng@linux.dev> Link: https://lore.kernel.org/r/20240118061853.2652295-1-yajun.deng@linux.dev [rppt: massaged the commit message] Signed-off-by:
Mike Rapoport (IBM) <rppt@kernel.org>
-
- Jan 12, 2024
-
-
Suren Baghdasaryan authored
While testing UFFDIO_MOVE ioctl, syzbot triggered VM_BUG_ON_PAGE caused by a call to PageAnonExclusive() with a huge_zero_page as a parameter. UFFDIO_MOVE does not yet handle zeropages and returns EBUSY when one is encountered. Add an early huge_zero_page check in the PMD move path to avoid this situation. Link: https://lkml.kernel.org/r/20240112013935.1474648-1-surenb@google.com Fixes: adef4406 ("userfaultfd: UFFDIO_MOVE uABI") Reported-by:
<syzbot+705209281e36404998f6@syzkaller.appspotmail.com> Signed-off-by:
Suren Baghdasaryan <surenb@google.com> Acked-by:
David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Sumanth Korikkar authored
set_memmap_mode() stores the kernel parameter memmap mode as an integer. However, the get_memmap_mode() function utilizes param_get_bool() to fetch the value as a boolean, leading to potential endianness issue. On Big-endian architectures, the memmap_on_memory is consistently displayed as 'N' regardless of its actual status. To address this endianness problem, the solution involves obtaining the mode as an integer. This adjustment ensures the proper display of the memmap_on_memory parameter, presenting it as one of the following options: Force, Y, or N. Link: https://lkml.kernel.org/r/20240110140127.241451-1-sumanthk@linux.ibm.com Fixes: 2d1f649c ("mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks") Signed-off-by:
Sumanth Korikkar <sumanthk@linux.ibm.com> Suggested-by:
Gerald Schaefer <gerald.schaefer@linux.ibm.com> Acked-by:
David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Ma Wupeng authored
If the system has no mirrored memory or uses crashkernel.high while kernelcore=mirror is enabled on the command line then during crashkernel, there will be limited mirrored memory and this usually leads to OOM. To solve this problem, disable the mirror feature during crashkernel. Link: https://lkml.kernel.org/r/20240109041536.3903042-1-mawupeng1@huawei.com Signed-off-by:
Ma Wupeng <mawupeng1@huawei.com> Acked-by:
Mike Rapoport (IBM) <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
With commit 63b85ac5 ("kasan: stop leaking stack trace handles"), KASAN zeroes out alloc meta when an object is freed. The zeroed out data purposefully includes alloc and auxiliary stack traces but also accidentally includes aux_lock. As aux_lock is only initialized for each object slot during slab creation, when the freed slot is reallocated, saving auxiliary stack traces for the new object leads to lockdep reports when taking the zeroed out aux_lock. Arguably, we could reinitialize aux_lock when the object is reallocated, but a simpler solution is to avoid zeroing out aux_lock when an object gets freed. Link: https://lkml.kernel.org/r/20240109221234.90929-1-andrey.konovalov@linux.dev Fixes: 63b85ac5 ("kasan: stop leaking stack trace handles") Signed-off-by:
Andrey Konovalov <andreyknvl@gmail.com> Reported-by:
Paul E. McKenney <paulmck@kernel.org> Closes: https://lore.kernel.org/linux-next/5cc0f83c-e1d6-45c5-be89-9b86746fe731@paulmck-laptop/ Reviewed-by:
Marco Elver <elver@google.com> Tested-by:
Paul E. McKenney <paulmck@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Jan 08, 2024
-
-
Kirill A. Shutemov authored
commit 23baf831 ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Kirill A. Shutemov authored
NR_PAGE_ORDERS defines the number of page orders supported by the page allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total. NR_PAGE_ORDERS assists in defining arrays of page orders and allows for more natural iteration over them. [kirill.shutemov@linux.intel.com: fixup for kerneldoc warning] Link: https://lkml.kernel.org/r/20240101111512.7empzyifq7kxtzk3@box Link: https://lkml.kernel.org/r/20231228144704.14033-1-kirill.shutemov@linux.intel.com Signed-off-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by:
Zi Yan <ziy@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Jan 05, 2024
-
-
Li Zhijian authored
Demotion can work well without CONFIG_NUMA_BALANCING. But the commit 23e9f013 ("mm/vmstat: move pgdemote_* to per-node stats") wrongly hid it behind CONFIG_NUMA_BALANCING. Fix it by moving them out of CONFIG_NUMA_BALANCING. Link: https://lkml.kernel.org/r/20231229022651.3229174-1-lizhijian@fujitsu.com Fixes: 23e9f013 ("mm/vmstat: move pgdemote_* to per-node stats") Signed-off-by:
Li Zhijian <lizhijian@fujitsu.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Barry Song authored
This is the case the "compressed" data is larger than the original data, it is better to return -ENOSPC which can help zswap record a poor compr rather than an invalid request. Then we get more friendly counting for reject_compress_poor in debugfs. bool zswap_store(struct folio *folio) { ... ret = zpool_malloc(zpool, dlen, gfp, &handle); if (ret == -ENOSPC) { zswap_reject_compress_poor++; goto put_dstmem; } if (ret) { zswap_reject_alloc_fail++; goto put_dstmem; } ... } Also, zbud_alloc() and z3fold_alloc() are returning ENOSPC in the same case, eg static int z3fold_alloc(struct z3fold_pool *pool, size_t size, gfp_t gfp, unsigned long *handle) { ... if (!size || (gfp & __GFP_HIGHMEM)) return -EINVAL; if (size > PAGE_SIZE) return -ENOSPC; ... } Link: https://lkml.kernel.org/r/20231228061802.25280-1-v-songbaohua@oppo.com Signed-off-by:
Barry Song <v-songbaohua@oppo.com> Reviewed-by:
Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by:
Nhat Pham <nphamcs@gmail.com> Acked-by:
Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
There are no more callers of __mod_lruvec_page_state(), so convert the implementation to __lruvec_stat_mod_folio(), removing two calls to compound_head() (one explicit, one hidden inside page_memcg()). Link: https://lkml.kernel.org/r/20231228085748.1083901-7-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Zi Yan <ziy@nvidia.com> Acked-by:
Shakeel Butt <shakeelb@google.com> Reviewed-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
This function is not yet fully converted to the folio API, but this removes a few uses of old APIs. Link: https://lkml.kernel.org/r/20231228085748.1083901-6-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Zi Yan <ziy@nvidia.com> Reviewed-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Mirror the code in free_large_kmalloc() and alloc_pages_node() and use a folio directly. Avoid the use of folio_alloc() as that will set up an rmappable folio which we do not want here. Link: https://lkml.kernel.org/r/20231228085748.1083901-5-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by:
David Rientjes <rientjes@google.com> Reviewed-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Save a few calls to compound_head() by using the folio APIs directly. Link: https://lkml.kernel.org/r/20231228085748.1083901-4-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by:
David Rientjes <rientjes@google.com> Reviewed-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
For no apparent reason, we were open-coding alloc_pages_node() in this function. Link: https://lkml.kernel.org/r/20231228085748.1083901-3-willy@infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by:
David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Shakeel Butt authored
One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer upstream kernel and on further investigation, it seems like the cause is the always synchronous rstat flush in the count_shadow_nodes() added by the commit f82e6bf9 ("mm: memcg: use rstat for non-hierarchical stats"). On further inspection it seems like we don't really need accurate stats in this function as it was already approximating the amount of appropriate shadow entries to keep for maintaining the refault information. Since there is already 2 sec periodic rstat flush, we don't need exact stats here. Let's ratelimit the rstat flush in this code path. Link: https://lkml.kernel.org/r/20231228073055.4046430-1-shakeelb@google.com Fixes: f82e6bf9 ("mm: memcg: use rstat for non-hierarchical stats") Signed-off-by:
Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Commit 773688a6 ("kasan: use stack_depot_put for Generic mode") added support for stack trace eviction for Generic KASAN. However, that commit didn't evict stack traces when the object is not put into quarantine. As a result, some stack traces are never evicted from the stack depot. In addition, with the "kasan: save mempool stack traces" series, the free stack traces for mempool objects are also not properly evicted from the stack depot. Fix both issues by: 1. Evicting all stack traces when an object if freed if it was not put into quarantine; 2. Always evicting an existing free stack trace when a new one is saved. Also do a few related clean-ups: - Do not zero out free track when initializing/invalidating free meta: set a value in shadow memory instead; - Rename KASAN_SLAB_FREETRACK to KASAN_SLAB_FREE_META; - Drop the kasan_init_cache_meta function as it's not used by KASAN; - Add comments for the kasan_alloc_meta and kasan_free_meta structs. [akpm@linux-foundation.org: make release_free_meta() and release_alloc_meta() static] Link: https://lkml.kernel.org/r/20231226225121.235865-1-andrey.konovalov@linux.dev Fixes: 773688a6 ("kasan: use stack_depot_put for Generic mode") Signed-off-by:
Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Kinsey Ho authored
Improve code readability by removing CONFIG_TRANSPARENT_HUGEPAGE, since the compiler should be able to automatically optimize out the code that promotes THPs during page table walks. No functional changes. Link: https://lkml.kernel.org/r/20231227141205.2200125-6-kinseyho@google.com Signed-off-by:
Kinsey Ho <kinseyho@google.com> Co-developed-by:
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by:
Donet Tom <donettom@linux.vnet.ibm.com> Acked-by:
Yu Zhao <yuzhao@google.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Kinsey Ho authored
Remove CONFIG_MEMCG in a refactoring to improve code readability at the cost of a few bytes in struct lru_gen_folio per node when CONFIG_MEMCG=n. Link: https://lkml.kernel.org/r/20231227141205.2200125-4-kinseyho@google.com Signed-off-by:
Kinsey Ho <kinseyho@google.com> Co-developed-by:
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by:
Donet Tom <donettom@linux.vnet.ibm.com> Acked-by:
Yu Zhao <yuzhao@google.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Kinsey Ho authored
Add CONFIG_LRU_GEN_WALKS_MMU such that if disabled, the code that walks page tables to promote pages into the youngest generation will not be built. Also improves code readability by adding two helper functions get_mm_state() and get_next_mm(). Link: https://lkml.kernel.org/r/20231227141205.2200125-3-kinseyho@google.com Signed-off-by:
Kinsey Ho <kinseyho@google.com> Co-developed-by:
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by:
Donet Tom <donettom@linux.vnet.ibm.com> Acked-by:
Yu Zhao <yuzhao@google.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Suren Baghdasaryan authored
While testing the split PMD path with lockdep enabled I've got an "Invalid wait context" error caused by split_huge_page_to_list() trying to lock anon_vma->rwsem while inside RCU read section. The issues is due to move_pages_pte() calling split_folio() under RCU read lock. Fix this by unmapping the PTEs and exiting RCU read section before splitting the folio and then retrying. The same retry pattern is used when locking the folio or anon_vma in this function. After splitting the large folio we unlock and release it because after the split the old folio might not be the one that contains the src_addr. Link: https://lkml.kernel.org/r/20240102233256.1077959-1-surenb@google.com Fixes: adef4406 ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by:
Suren Baghdasaryan <surenb@google.com> Reviewed-by:
Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Tetsuo Handa authored
syzbot is reporting uninit-value at shrinker_alloc(), for commit 307becec ("mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}") which assumed that the ->unit was allocated with __GFP_ZERO forgot to replace kvmalloc_node() in expand_one_shrinker_info() with kvzalloc_node(). Link: https://lkml.kernel.org/r/9226cc0a-10e0-4489-80c5-58c3b5b4359c@I-love.SAKURA.ne.jp Reported-by:
syzbot <syzbot+1e0ed05798af62917464@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=1e0ed05798af62917464 Fixes: 307becec ("mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}") Signed-off-by:
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by:
Qi Zheng <zhengqi.arch@bytedance.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Dec 30, 2023
-
-
Nhat Pham authored
During our experiment with zswap, we sometimes observe swap IOs due to occasional zswap store failures and writebacks-to-swap. These swapping IOs prevent many users who cannot tolerate swapping from adopting zswap to save memory and improve performance where possible. This patch adds the option to disable this behavior entirely: do not writeback to backing swapping device when a zswap store attempt fail, and do not write pages in the zswap pool back to the backing swap device (both when the pool is full, and when the new zswap shrinker is called). This new behavior can be opted-in/out on a per-cgroup basis via a new cgroup file. By default, writebacks to swap device is enabled, which is the previous behavior. Initially, writeback is enabled for the root cgroup, and a newly created cgroup will inherit the current setting of its parent. Note that this is subtly different from setting memory.swap.max to 0, as it still allows for pages to be stored in the zswap pool (which itself consumes swap space in its current form). This patch should be applied on top of the zswap shrinker series: https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/ as it also disables the zswap shrinker, a major source of zswap writebacks. For the most part, this feature is motivated by internal parties who have already established their opinions regarding swapping - the workloads that are highly sensitive to IO, and especially those who are using servers with really slow disk performance (for instance, massive but slow HDDs). For these folks, it's impossible to convince them to even entertain zswap if swapping also comes as a packaged deal. Writeback disabling is quite a useful feature in these situations - on a mixed workloads deployment, they can disable writeback for the more IO-sensitive workloads, and enable writeback for other background workloads. For instance, on a server with HDD, I allocate memories and populate them with random values (so that zswap store will always fail), and specify memory.high low enough to trigger reclaim. The time it takes to allocate the memories and just read through it a couple of times (doing silly things like computing the values' average etc.): zswap.writeback disabled: real 0m30.537s user 0m23.687s sys 0m6.637s 0 pages swapped in 0 pages swapped out zswap.writeback enabled: real 0m45.061s user 0m24.310s sys 0m8.892s 712686 pages swapped in 461093 pages swapped out (the last two lines are from vmstat -s). [nphamcs@gmail.com: add a comment about recurring zswap store failures leading to reclaim inefficiency] Link: https://lkml.kernel.org/r/20231221005725.3446672-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231207192406.3809579-1-nphamcs@gmail.com Signed-off-by:
Nhat Pham <nphamcs@gmail.com> Suggested-by:
Johannes Weiner <hannes@cmpxchg.org> Reviewed-by:
Yosry Ahmed <yosryahmed@google.com> Acked-by:
Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Heidelberg <david@ixit.cz> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Dec 29, 2023
-
-
Yuntao Wang authored
When detecting an error, the current code uses kexec_dprintk() to output log message. This is not quite appropriate as kexec_dprintk() is mainly used for outputting debugging messages, rather than error messages. Replace kexec_dprintk() with pr_err(). This also makes the output method for this error log align with the output method for other error logs in this function. Additionally, the last return statement in set_page_address() is unnecessary, remove it. Link: https://lkml.kernel.org/r/20231220030124.149160-1-ytcoode@gmail.com Signed-off-by:
Yuntao Wang <ytcoode@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Tanzir Hasan authored
asm-generic/mman-common.h can be replaced by linux/mman.h and the file will still build correctly. It is an asm-generic file which should be avoided if possible. Link: https://lkml.kernel.org/r/20231221-asmgenericvaddr-v1-1-742b170c914e@google.com Fixes: 6dea8add ("mm/damon/vaddr: support DAMON-based Operation Schemes") Signed-off-by:
Tanzir Hasan <tanzirh@google.com> Suggested-by:
Al Viro <viro@zeniv.linux.org.uk> Reviewed-by:
SeongJae Park <sj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
IA64 has gone with commit cf8e8658 ("arch: Remove Itanium (IA-64) architecture"), remove unnecessary ia64 special mm code and comment too. Link: https://lkml.kernel.org/r/20231222070203.2966980-1-wangkefeng.wang@huawei.com Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by:
Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Let's fixup one remaining comment. Note that the only trace remaining of the old rmap interface is in an example in Documentation/trace/ftrace.rst, that we'll just leave alone. Link: https://lkml.kernel.org/r/20231220224504.646757-41-david@redhat.com Signed-off-by:
David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
We removed all "bool compound" and RMAP_COMPOUND parameters. Let's remove the remaining "compound" terminology by making COMPOUND_MAPPED match the "folio->_entire_mapcount" terminology, renaming it to ENTIRELY_MAPPED. ENTIRELY_MAPPED is only used when the whole folio is mapped using a single page table entry (e.g., a single PMD mapping a PMD-sized THP). For now, we don't support mapping any THP bigger than that, so ENTIRELY_MAPPED only applies to PMD-mapped PMD-sized THP only. Link: https://lkml.kernel.org/r/20231220224504.646757-40-david@redhat.com Signed-off-by:
David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Let's convert it like we converted all the other rmap functions. Don't introduce folio_try_share_anon_rmap_ptes() for now, as we don't have a user that wants rmap batching in sight. Pretty easy to add later. All users are easy to convert -- only ksm.c doesn't use folios yet but that is left for future work -- so let's just do it in a single shot. While at it, turn the BUG_ON into a WARN_ON_ONCE. Note that page_try_share_anon_rmap() so far didn't care about pte/pmd mappings (no compound parameter). We're changing that so we can perform better sanity checks and make the code actually more readable/consistent. For example, __folio_rmap_sanity_checks() will make sure that a PMD range actually falls completely into the folio. Link: https://lkml.kernel.org/r/20231220224504.646757-39-david@redhat.com Signed-off-by:
David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Let's convert copy_nonpresent_pte(). While at it, perform some more folio conversion. Link: https://lkml.kernel.org/r/20231220224504.646757-37-david@redhat.com Signed-off-by:
David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Let's convert copy_huge_pmd() and fixup the comment in copy_huge_pud(). While at it, perform more folio conversion in copy_huge_pmd(). Link: https://lkml.kernel.org/r/20231220224504.646757-36-david@redhat.com Signed-off-by:
David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Let's convert page_dup_file_rmap() like the other rmap functions. As there is only a single caller, convert that single caller right away and remove page_dup_file_rmap(). Add folio_dup_file_rmap_ptes() right away, we want to perform rmap baching during fork() soon. Link: https://lkml.kernel.org/r/20231220224504.646757-34-david@redhat.com Signed-off-by:
David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-