Commit f346b0be authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'akpm' (patches from Andrew)

Merge misc updates from Andrew Morton:

 - large KASAN update to use arm's "software tag-based mode"

 - a few misc things

 - sh updates

 - ocfs2 updates

 - just about all of MM

* emailed patches from Andrew Morton <>: (167 commits)
  kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
  memcg, oom: notify on oom killer invocation from the charge path
  mm, swap: fix swapoff with KSM pages
  include/linux/gfp.h: fix typo
  mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
  hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  memory_hotplug: add missing newlines to debugging output
  mm: remove __hugepage_set_anon_rmap()
  include/linux/vmstat.h: remove unused page state adjustment macro
  mm/page_alloc.c: allow error injection
  mm: migrate: drop unused argument of migrate_page_move_mapping()
  blkdev: avoid migration stalls for blkdev pages
  mm: migrate: provide buffer_migrate_page_norefs()
  mm: migrate: move migrate_page_lock_buffers()
  mm: migrate: lock buffers before migrate_page_move_mapping()
  mm: migration: factor out code to compute expected number of page references
  mm, page_alloc: enable pcpu_drain with zone capability
  kmemleak: add config to select auto scan
  mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
parents 00d59fde 0f4991e8
......@@ -98,3 +98,35 @@ Description:
The backing_dev file is read-write and set up backing
device for zram to write incompressible pages.
For using, user should enable CONFIG_ZRAM_WRITEBACK.
What: /sys/block/zram<id>/idle
Date: November 2018
Contact: Minchan Kim <>
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram<id>/block_state
What: /sys/block/zram<id>/writeback
Date: November 2018
Contact: Minchan Kim <>
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
What: /sys/block/zram<id>/bd_stat
Date: November 2018
Contact: Minchan Kim <>
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
What: /sys/block/zram<id>/writeback_limit
Date: November 2018
Contact: Minchan Kim <>
The writeback_limit file is read-write and specifies the maximum
amount of writeback ZRAM can do. The limit could be changed
in run time and "0" means disable the limit.
No limit is the initial state.
......@@ -164,11 +164,14 @@ reset WO trigger device reset
mem_used_max WO reset the `mem_used_max' counter (see later)
mem_limit WO specifies the maximum amount of memory ZRAM can use
to store the compressed data
writeback_limit WO specifies the maximum amount of write IO zram can
write out to backing device as 4KB unit
max_comp_streams RW the number of possible concurrent compress operations
comp_algorithm RW show and change the compression algorithm
compact WO trigger memory compaction
debug_stat RO this file is used for zram debugging purposes
backing_dev RW set up backend storage for zram to write out
idle WO mark allocated slot as idle
User space is advised to use the following files to read the device statistics.
......@@ -220,6 +223,17 @@ line of text and contains the following stats separated by whitespace:
pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages
File /sys/block/zram<id>/bd_stat
The stat file represents device's backing device statistics. It consists of
a single line of text and contains the following stats separated by whitespace:
bd_count size of data written in backing device.
Unit: 4K bytes
bd_reads the number of reads from backing device
Unit: 4K bytes
bd_writes the number of writes to backing device
Unit: 4K bytes
9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
......@@ -237,11 +251,60 @@ line of text and contains the following stats separated by whitespace:
= writeback
With incompressible pages, there is no memory saving with zram.
Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
to backing storage rather than keeping it in memory.
User should set up backing device via /sys/block/zramX/backing_dev
before disksize setting.
To use the feature, admin should set up backing device via
"echo /dev/sda5 > /sys/block/zramX/backing_dev"
before disksize setting. It supports only partition at this moment.
If admin want to use incompressible page writeback, they could do via
"echo huge > /sys/block/zramX/write"
To use idle page writeback, first, user need to declare zram pages
as idle.
"echo all > /sys/block/zramX/idle"
From now on, any pages on zram are idle pages. The idle mark
will be removed until someone request access of the block.
IOW, unless there is access request, those pages are still idle pages.
Admin can request writeback of those idle pages at right timing via
"echo idle > /sys/block/zramX/writeback"
With the command, zram writeback idle pages from memory to the storage.
If there are lots of write IO with flash device, potentially, it has
flash wearout problem so that admin needs to design write limitation
to guarantee storage health for entire product life.
To overcome the concern, zram supports "writeback_limit".
The "writeback_limit"'s default value is 0 so that it doesn't limit
any writeback. If admin want to measure writeback count in a certain
period, he could know it via /sys/block/zram0/bd_stat's 3rd column.
If admin want to limit writeback as per-day 400M, he could do it
like below.
echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
If admin want to allow further write again, he could do it like below
echo 0 > /sys/block/zram0/writeback_limit
If admin want to see remaining writeback budget since he set,
cat /sys/block/zram0/writeback_limit
The writeback_limit count will reset whenever you reset zram(e.g.,
system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
writeback happened until you reset the zram to allocate extra writeback
budget in next setting is user's job.
= memory tracking
......@@ -251,16 +314,17 @@ pages of the process with*pagemap.
If you enable the feature, you could see block state via
/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
300 75.033841 .wh
301 63.806904 s..
302 63.806919 ..h
300 75.033841 .wh.
301 63.806904 s...
302 63.806919 ..hi
First column is zram's block index.
Second column is access time since the system was booted
Third column is state of the block.
(s: same page
w: written page to backing store
h: huge page)
h: huge page
i: idle page)
First line of above example says 300th block is accessed at 75.033841sec
and the block's state is huge so it is written back to the backing
This diff is collapsed.
......@@ -182,6 +182,7 @@ read the file /proc/PID/status:
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
Threads: 1
SigQ: 0/28578
SigPnd: 0000000000000000
......@@ -256,6 +257,8 @@ Table 1-2: Contents of the status files (as of 4.8)
HugetlbPages size of hugetlb memory portions
CoreDumping process's memory is currently being dumped
(killing the process may lead to a corrupted core)
THP_enabled process is allowed to use THP (returns 0 when
PR_SET_THP_DISABLE is set on the process
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
......@@ -425,6 +428,7 @@ SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
THPeligible: 0
VmFlags: rd ex mr mw me dw
the first of these lines shows the same information as is displayed for the
......@@ -462,6 +466,8 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.
"THPeligible" indicates whether the mapping is eligible for THP pages - 1 if
true, 0 otherwise.
"VmFlags" field deserves a separate description. This member represents the kernel
flags associated with the particular virtual memory area in two letter encoded
......@@ -496,7 +502,9 @@ manner. The codes are the following:
Note that there is no guarantee that every flag and associated mnemonic will
be present in all further kernel releases. Things get changed, the flags may
be vanished or the reverse -- new added.
be vanished or the reverse -- new added. Interpretation of their meaning
might change in future as well. So each consumer of these flags has to
follow each specific kernel version for the exact semantic.
This file is only present if the CONFIG_MMU kernel configuration option is
......@@ -63,6 +63,7 @@ Currently, these files are in /proc/sys/vm:
- swappiness
- user_reserve_kbytes
- vfs_cache_pressure
- watermark_boost_factor
- watermark_scale_factor
- zone_reclaim_mode
......@@ -856,6 +857,26 @@ ten times more freeable objects than there are.
This factor controls the level of reclaim when memory is being fragmented.
It defines the percentage of the high watermark of a zone that will be
reclaimed if pages of different mobility are being mixed within pageblocks.
The intent is that compaction has less work to do in the future and to
increase the success rate of future high-order allocations such as SLUB
allocations, THP and hugetlbfs pages.
To make it sensible with respect to the watermark_scale_factor parameter,
the unit is in fractions of 10,000. The default value of 15,000 means
that up to 150% of the high watermark will be reclaimed in the event of
a pageblock being mixed due to fragmentation. The level of reclaim is
determined by the number of fragmentation events that occurred in the
recent past. If this value is smaller than a pageblock then a pageblocks
worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
of 0 will disable the feature.
This factor controls the aggressiveness of kswapd. It defines the
......@@ -110,6 +110,7 @@ config ARM64
select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
......@@ -101,10 +101,19 @@ else
TEXT_OFFSET := 0x00080000
# - (1 << (64 - KASAN_SHADOW_SCALE_SHIFT))
# in 32-bit arithmetic
KASAN_SHADOW_OFFSET := $(shell printf "0x%08x00000000\n" $$(( \
(0xffffffff & (-1 << ($(CONFIG_ARM64_VA_BITS) - 32))) \
......@@ -16,10 +16,12 @@
* 0x400: for dynamic BRK instruction
* 0x401: for compile time BRK instruction
* 0x800: kernel-mode BUG() and WARN() traps
* 0x9xx: tag-based KASAN trap (allowed values 0x900 - 0x9ff)
#define FAULT_BRK_IMM 0x100
#define KGDB_DYN_DBG_BRK_IMM 0x400
#define BUG_BRK_IMM 0x800
#define KASAN_BRK_IMM 0x900
......@@ -4,12 +4,16 @@
#ifndef __ASSEMBLY__
#include <linux/linkage.h>
#include <asm/memory.h>
#include <asm/pgtable-types.h>
#define arch_kasan_set_tag(addr, tag) __tag_set(addr, tag)
#define arch_kasan_reset_tag(addr) __tag_reset(addr)
#define arch_kasan_get_tag(addr) __tag_get(addr)
* KASAN_SHADOW_START: beginning of the kernel virtual addresses.
* KASAN_SHADOW_END: KASAN_SHADOW_START + 1/N of kernel virtual addresses,
......@@ -74,13 +74,11 @@
* KASAN requires 1/8th of the kernel virtual address space for the shadow
* region. KASAN can bloat the stack significantly, so double the (minimum)
* stack size when KASAN is in use, and then double it again if KASAN_EXTRA is
* on.
* Generic and tag-based KASAN require 1/8th and 1/16th of the kernel virtual
* address space for the shadow region respectively. They can bloat the stack
* significantly, so double the (minimum) stack size when they are in use.
......@@ -220,6 +218,26 @@ extern u64 vabits_user;
* When dealing with data aborts, watchpoints, or instruction traps we may end
* up with a tagged userland pointer. Clear the tag to get a sane pointer to
* pass on to access_ok(), for instance.
#define untagged_addr(addr) \
((__typeof__(addr))sign_extend64((u64)(addr), 55))
#define __tag_shifted(tag) ((u64)(tag) << 56)
#define __tag_set(addr, tag) (__typeof__(addr))( \
((u64)(addr) & ~__tag_shifted(0xff)) | __tag_shifted(tag))
#define __tag_reset(addr) untagged_addr(addr)
#define __tag_get(addr) (__u8)((u64)(addr) >> 56)
#define __tag_set(addr, tag) (addr)
#define __tag_reset(addr) (addr)
#define __tag_get(addr) 0
* Physical vs virtual RAM address space conversion. These are
* private definitions which should NOT be used outside memory.h
......@@ -303,7 +321,13 @@ static inline void *phys_to_virt(phys_addr_t x)
#define __virt_to_pgoff(kaddr) (((u64)(kaddr) & ~PAGE_OFFSET) / PAGE_SIZE * sizeof(struct page))
#define __page_to_voff(kaddr) (((u64)(kaddr) & ~VMEMMAP_START) * PAGE_SIZE / sizeof(struct page))
#define page_to_virt(page) ((void *)((__page_to_voff(page)) | PAGE_OFFSET))
#define page_to_virt(page) ({ \
unsigned long __addr = \
((__page_to_voff(page)) | PAGE_OFFSET); \
__addr = __tag_set(__addr, page_kasan_tag(page)); \
((void *)__addr); \
#define virt_to_page(vaddr) ((struct page *)((__virt_to_pgoff(vaddr)) | VMEMMAP_START))
#define _virt_addr_valid(kaddr) pfn_valid((((u64)(kaddr) & ~PAGE_OFFSET) \
......@@ -311,9 +335,10 @@ static inline void *phys_to_virt(phys_addr_t x)
#define _virt_addr_is_linear(kaddr) (((u64)(kaddr)) >= PAGE_OFFSET)
#define virt_addr_valid(kaddr) (_virt_addr_is_linear(kaddr) && \
#define _virt_addr_is_linear(kaddr) \
(__tag_reset((u64)(kaddr)) >= PAGE_OFFSET)
#define virt_addr_valid(kaddr) \
(_virt_addr_is_linear(kaddr) && _virt_addr_valid(kaddr))
#include <asm-generic/memory_model.h>
......@@ -299,6 +299,7 @@
#define TCR_A1 (UL(1) << 22)
#define TCR_ASID16 (UL(1) << 36)
#define TCR_TBI0 (UL(1) << 37)
#define TCR_TBI1 (UL(1) << 38)
#define TCR_HA (UL(1) << 39)
#define TCR_HD (UL(1) << 40)
#define TCR_NFD1 (UL(1) << 54)
......@@ -95,13 +95,6 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si
return ret;
* When dealing with data aborts, watchpoints, or instruction traps we may end
* up with a tagged userland pointer. Clear the tag to get a sane pointer to
* pass on to access_ok(), for instance.
#define untagged_addr(addr) sign_extend64(addr, 55)
#define access_ok(type, addr, size) __range_ok(addr, size)
#define user_addr_max get_fs
......@@ -35,6 +35,7 @@
#include <linux/sizes.h>
#include <linux/syscalls.h>
#include <linux/mm_types.h>
#include <linux/kasan.h>
#include <asm/atomic.h>
#include <asm/bug.h>
......@@ -969,6 +970,58 @@ static struct break_hook bug_break_hook = {
.fn = bug_handler,
#define KASAN_ESR_RECOVER 0x20
#define KASAN_ESR_WRITE 0x10
#define KASAN_ESR_SIZE_MASK 0x0f
#define KASAN_ESR_SIZE(esr) (1 << ((esr) & KASAN_ESR_SIZE_MASK))
static int kasan_handler(struct pt_regs *regs, unsigned int esr)
bool recover = esr & KASAN_ESR_RECOVER;
bool write = esr & KASAN_ESR_WRITE;
size_t size = KASAN_ESR_SIZE(esr);
u64 addr = regs->regs[0];
u64 pc = regs->pc;
if (user_mode(regs))
kasan_report(addr, size, write, pc);
* The instrumentation allows to control whether we can proceed after
* a crash was detected. This is done by passing the -recover flag to
* the compiler. Disabling recovery allows to generate more compact
* code.
* Unfortunately disabling recovery doesn't work for the kernel right
* now. KASAN reporting is disabled in some contexts (for example when
* the allocator accesses slab object metadata; this is controlled by
* current->kasan_depth). All these accesses are detected by the tool,
* even though the reports for them are not printed.
* This is something that might be fixed at some point in the future.
if (!recover)
die("Oops - KASAN", regs, 0);
/* If thread survives, skip over the brk instruction and continue: */
arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
#define KASAN_ESR_VAL (0xf2000000 | KASAN_BRK_IMM)
#define KASAN_ESR_MASK 0xffffff00
static struct break_hook kasan_break_hook = {
.esr_val = KASAN_ESR_VAL,
.esr_mask = KASAN_ESR_MASK,
.fn = kasan_handler,
* Initial handler for AArch64 BRK exceptions
* This handler only used until debug_traps_init().
......@@ -976,6 +1029,10 @@ static struct break_hook bug_break_hook = {
int __init early_brk64(unsigned long addr, unsigned int esr,
struct pt_regs *regs)
return kasan_handler(regs, esr) != DBG_HOOK_HANDLED;
return bug_handler(regs, esr) != DBG_HOOK_HANDLED;
......@@ -983,4 +1040,7 @@ int __init early_brk64(unsigned long addr, unsigned int esr,
void __init trap_init(void)
......@@ -40,6 +40,7 @@
#include <asm/daifflags.h>
#include <asm/debug-monitors.h>
#include <asm/esr.h>