Skip to content
Snippets Groups Projects
  1. Oct 23, 2022
  2. Oct 21, 2022
  3. Oct 20, 2022
  4. Oct 17, 2022
    • Kees Cook's avatar
      sched: Introduce struct balance_callback to avoid CFI mismatches · 8e5bad7d
      Kees Cook authored
      
      Introduce distinct struct balance_callback instead of performing function
      pointer casting which will trip CFI. Avoids warnings as found by Clang's
      future -Wcast-function-type-strict option:
      
      In file included from kernel/sched/core.c:84:
      kernel/sched/sched.h:1755:15: warning: cast from 'void (*)(struct rq *)' to 'void (*)(struct callback_head *)' converts to incompatible function type [-Wcast-function-type-strict]
              head->func = (void (*)(struct callback_head *))func;
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      No binary differences result from this change.
      
      This patch is a cleanup based on Brad Spengler/PaX Team's modifications
      to sched code in their last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code
      are mine and don't reflect the original grsecurity/PaX code.
      
      Reported-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Link: https://github.com/ClangBuiltLinux/linux/issues/1724
      Link: https://lkml.kernel.org/r/20221008000758.2957718-1-keescook@chromium.org
      8e5bad7d
    • Lin Shengwang's avatar
      sched/core: Fix comparison in sched_group_cookie_match() · e705968d
      Lin Shengwang authored
      
      In commit 97886d9d ("sched: Migration changes for core scheduling"),
      sched_group_cookie_match() was added to help determine if a cookie
      matches the core state.
      
      However, while it iterates the SMT group, it fails to actually use the
      RQ for each of the CPUs iterated, use cpu_rq(cpu) instead of rq to fix
      things.
      
      Fixes: 97886d9d ("sched: Migration changes for core scheduling")
      Signed-off-by: default avatarLin Shengwang <linshengwang1@huawei.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20221008022709.642-1-linshengwang1@huawei.com
      e705968d
    • Sumanth Korikkar's avatar
      bpf: Fix sample_flags for bpf_perf_event_output · 21da7472
      Sumanth Korikkar authored
      
      * Raw data is also filled by bpf_perf_event_output.
      * Add sample_flags to indicate raw data.
      * This eliminates the segfaults as shown below:
        Run ./samples/bpf/trace_output
        BUG pid 9 cookie 1001000000004 sized 4
        BUG pid 9 cookie 1001000000004 sized 4
        BUG pid 9 cookie 1001000000004 sized 4
        Segmentation fault (core dumped)
      
      Fixes: 838d9bb6 ("perf: Use sample_flags for raw_data")
      Signed-off-by: default avatarSumanth Korikkar <sumanthk@linux.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/r/20221007081327.1047552-1-sumanthk@linux.ibm.com
      21da7472
    • Peter Zijlstra's avatar
      perf: Fix missing SIGTRAPs · ca6c2132
      Peter Zijlstra authored
      
      Marco reported:
      
      Due to the implementation of how SIGTRAP are delivered if
      perf_event_attr::sigtrap is set, we've noticed 3 issues:
      
        1. Missing SIGTRAP due to a race with event_sched_out() (more
           details below).
      
        2. Hardware PMU events being disabled due to returning 1 from
           perf_event_overflow(). The only way to re-enable the event is
           for user space to first "properly" disable the event and then
           re-enable it.
      
        3. The inability to automatically disable an event after a
           specified number of overflows via PERF_EVENT_IOC_REFRESH.
      
      The worst of the 3 issues is problem (1), which occurs when a
      pending_disable is "consumed" by a racing event_sched_out(), observed
      as follows:
      
      		CPU0			|	CPU1
      	--------------------------------+---------------------------
      	__perf_event_overflow()		|
      	 perf_event_disable_inatomic()	|
      	  pending_disable = CPU0	| ...
      					| _perf_event_enable()
      					|  event_function_call()
      					|   task_function_call()
      					|    /* sends IPI to CPU0 */
      	<IPI>				| ...
      	 __perf_event_enable()		+---------------------------
      	  ctx_resched()
      	   task_ctx_sched_out()
      	    ctx_sched_out()
      	     group_sched_out()
      	      event_sched_out()
      	       pending_disable = -1
      	</IPI>
      	<IRQ-work>
      	 perf_pending_event()
      	  perf_pending_event_disable()
      	   /* Fails to send SIGTRAP because no pending_disable! */
      	</IRQ-work>
      
      In the above case, not only is that particular SIGTRAP missed, but also
      all future SIGTRAPs because 'event_limit' is not reset back to 1.
      
      To fix, rework pending delivery of SIGTRAP via IRQ-work by introduction
      of a separate 'pending_sigtrap', no longer using 'event_limit' and
      'pending_disable' for its delivery.
      
      Additionally; and different to Marco's proposed patch:
      
       - recognise that pending_disable effectively duplicates oncpu for
         the case where it is set. As such, change the irq_work handler to
         use ->oncpu to target the event and use pending_* as boolean toggles.
      
       - observe that SIGTRAP targets the ctx->task, so the context switch
         optimization that carries contexts between tasks is invalid. If
         the irq_work were delayed enough to hit after a context switch the
         SIGTRAP would be delivered to the wrong task.
      
       - observe that if the event gets scheduled out
         (rotation/migration/context-switch/...) the irq-work would be
         insufficient to deliver the SIGTRAP when the event gets scheduled
         back in (the irq-work might still be pending on the old CPU).
      
         Therefore have event_sched_out() convert the pending sigtrap into a
         task_work which will deliver the signal at return_to_user.
      
      Fixes: 97ba62b2 ("perf: Add support for SIGTRAP on perf events")
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Debugged-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reported-by: default avatarMarco Elver <elver@google.com>
      Debugged-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Tested-by: default avatarMarco Elver <elver@google.com>
      ca6c2132
  5. Oct 12, 2022
  6. Oct 11, 2022
    • Jason A. Donenfeld's avatar
      treewide: use get_random_bytes() when possible · 197173db
      Jason A. Donenfeld authored
      
      The prandom_bytes() function has been a deprecated inline wrapper around
      get_random_bytes() for several releases now, and compiles down to the
      exact same code. Replace the deprecated wrapper with a direct call to
      the real function. This was done as a basic find and replace.
      
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> # powerpc
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      197173db
    • Jason A. Donenfeld's avatar
      treewide: use get_random_u32() when possible · a251c17a
      Jason A. Donenfeld authored
      
      The prandom_u32() function has been a deprecated inline wrapper around
      get_random_u32() for several releases now, and compiles down to the
      exact same code. Replace the deprecated wrapper with a direct call to
      the real function. The same also applies to get_random_int(), which is
      just a wrapper around get_random_u32(). This was done as a basic find
      and replace.
      
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
      Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
      Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
      Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
      Acked-by: Helge Deller <deller@gmx.de> # for parisc
      Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      a251c17a
    • Jason A. Donenfeld's avatar
      treewide: use prandom_u32_max() when possible, part 1 · 81895a65
      Jason A. Donenfeld authored
      
      Rather than incurring a division or requesting too many random bytes for
      the given range, use the prandom_u32_max() function, which only takes
      the minimum required bytes from the RNG and avoids divisions. This was
      done mechanically with this coccinelle script:
      
      @basic@
      expression E;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u64;
      @@
      (
      - ((T)get_random_u32() % (E))
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ((E) - 1))
      + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
      |
      - ((u64)(E) * get_random_u32() >> 32)
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ~PAGE_MASK)
      + prandom_u32_max(PAGE_SIZE)
      )
      
      @multi_line@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      identifier RAND;
      expression E;
      @@
      
      -       RAND = get_random_u32();
              ... when != RAND
      -       RAND %= (E);
      +       RAND = prandom_u32_max(E);
      
      // Find a potential literal
      @literal_mask@
      expression LITERAL;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      position p;
      @@
      
              ((T)get_random_u32()@p & (LITERAL))
      
      // Add one to the literal.
      @script:python add_one@
      literal << literal_mask.LITERAL;
      RESULT;
      @@
      
      value = None
      if literal.startswith('0x'):
              value = int(literal, 16)
      elif literal[0] in '123456789':
              value = int(literal, 10)
      if value is None:
              print("I don't know how to handle %s" % (literal))
              cocci.include_match(False)
      elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
              print("Skipping 0x%x for cleanup elsewhere" % (value))
              cocci.include_match(False)
      elif value & (value + 1) != 0:
              print("Skipping 0x%x because it's not a power of two minus one" % (value))
              cocci.include_match(False)
      elif literal.startswith('0x'):
              coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
      else:
              coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
      
      // Replace the literal mask with the calculated result.
      @plus_one@
      expression literal_mask.LITERAL;
      position literal_mask.p;
      expression add_one.RESULT;
      identifier FUNC;
      @@
      
      -       (FUNC()@p & (LITERAL))
      +       prandom_u32_max(RESULT)
      
      @collapse_ret@
      type T;
      identifier VAR;
      expression E;
      @@
      
       {
      -       T VAR;
      -       VAR = (E);
      -       return VAR;
      +       return E;
       }
      
      @drop_var@
      type T;
      identifier VAR;
      @@
      
       {
      -       T VAR;
              ... when != VAR
       }
      
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarYury Norov <yury.norov@gmail.com>
      Reviewed-by: default avatarKP Singh <kpsingh@kernel.org>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
      Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
      Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
      Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      81895a65
    • Yosry Ahmed's avatar
      bpf: cgroup_iter: support cgroup1 using cgroup fd · 35256d67
      Yosry Ahmed authored
      
      Use cgroup_v1v2_get_from_fd() in cgroup_iter to support attaching to
      both cgroup v1 and v2 using fds.
      
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      35256d67
    • Yosry Ahmed's avatar
      cgroup: add cgroup_v1v2_get_from_[fd/file]() · a6d1ce59
      Yosry Ahmed authored
      
      Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that
      support both cgroup1 and cgroup2.
      
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a6d1ce59
  7. Oct 10, 2022
  8. Oct 06, 2022
    • Valentin Schneider's avatar
      sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() · 585463f0
      Valentin Schneider authored
      
      This removes the second use of the sched_core_mask temporary mask.
      
      Suggested-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      585463f0
    • Steven Rostedt (Google)'s avatar
      tracing: Do not free snapshot if tracer is on cmdline · a541a955
      Steven Rostedt (Google) authored
      The ftrace_boot_snapshot and alloc_snapshot cmdline options allocate the
      snapshot buffer at boot up for use later. The ftrace_boot_snapshot in
      particular requires the snapshot to be allocated because it will take a
      snapshot at the end of boot up allowing to see the traces that happened
      during boot so that it's not lost when user space takes over.
      
      When a tracer is registered (started) there's a path that checks if it
      requires the snapshot buffer or not, and if it does not and it was
      allocated it will do a synchronization and free the snapshot buffer.
      
      This is only required if the previous tracer was using it for "max
      latency" snapshots, as it needs to make sure all max snapshots are
      complete before freeing. But this is only needed if the previous tracer
      was using the snapshot buffer for latency (like irqoff tracer and
      friends). But it does not make sense to free it, if the previous tracer
      was not using it, and the snapshot was allocated by the cmdline
      parameters. This basically takes away the point of allocating it in the
      first place!
      
      Note, the allocated snapshot worked fine for just trace events, but fails
      when a tracer is enabled on the cmdline.
      
      Further investigation, this goes back even further and it does not require
      a tracer on the cmdline to fail. Simply enable snapshots and then enable a
      tracer, and it will remove the snapshot.
      
      Link: https://lkml.kernel.org/r/20221005113757.041df7fe@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 45ad21ca ("tracing: Have trace_array keep track if snapshot buffer is allocated")
      Reported-by: default avatarRoss Zwisler <zwisler@kernel.org>
      Tested-by: default avatarRoss Zwisler <zwisler@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      a541a955
    • Steven Rostedt (Google)'s avatar
      ftrace: Still disable enabled records marked as disabled · cf04f2d5
      Steven Rostedt (Google) authored
      Weak functions started causing havoc as they showed up in the
      "available_filter_functions" and this confused people as to why some
      functions marked as "notrace" were listed, but when enabled they did
      nothing. This was because weak functions can still have fentry calls, and
      these addresses get added to the "available_filter_functions" file.
      kallsyms is what converts those addresses to names, and since the weak
      functions are not listed in kallsyms, it would just pick the function
      before that.
      
      To solve this, there was a trick to detect weak functions listed, and
      these records would be marked as DISABLED so that they do not get enabled
      and are mostly ignored. As the processing of the list of all functions to
      figure out what is weak or not can take a long time, this process is put
      off into a kernel thread and run in parallel with the rest of start up.
      
      Now the issue happens whet function tracing is enabled via the kernel
      command line. As it starts very early in boot up, it can be enabled before
      the records that are weak are marked to be disabled. This causes an issue
      in the accounting, as the weak records are enabled by the command line
      function tracing, but after boot up, they are not disabled.
      
      The ftrace records have several accounting flags and a ref count. The
      DISABLED flag is just one. If the record is enabled before it is marked
      DISABLED it will get an ENABLED flag and also have its ref counter
      incremented. After it is marked for DISABLED, neither the ENABLED flag nor
      the ref counter is cleared. There's sanity checks on the records that are
      performed after an ftrace function is registered or unregistered, and this
      detected that there were records marked as ENABLED with ref counter that
      should not have been.
      
      Note, the module loading code uses the DISABLED flag as well to keep its
      functions from being modified while its being loaded and some of these
      flags may get set in this process. So changing the verification code to
      ignore DISABLED records is a no go, as it still needs to verify that the
      module records are working too.
      
      Also, the weak functions still are calling a trampoline. Even though they
      should never be called, it is dangerous to leave these weak functions
      calling a trampoline that is freed, so they should still be set back to
      nops.
      
      There's two places that need to not skip records that have the ENABLED
      and the DISABLED flags set. That is where the ftrace_ops is processed and
      sets the records ref counts, and then later when the function itself is to
      be updated, and the ENABLED flag gets removed. Add a helper function
      "skip_record()" that returns true if the record has the DISABLED flag set
      but not the ENABLED flag.
      
      Link: https://lkml.kernel.org/r/20221005003809.27d2b97b@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: b39181f7 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      cf04f2d5
  9. Oct 04, 2022
  10. Oct 03, 2022
    • wuchi's avatar
      relay: use kvcalloc to alloc page array in relay_alloc_page_array · 83d87a4d
      wuchi authored
      kvcalloc() is safer because it will check the integer overflows, and using
      it will simple the logic of allocation size.
      
      Link: https://lkml.kernel.org/r/20220909101025.82955-1-wuchi.zero@gmail.com
      
      
      Signed-off-by: default avatarwuchi <wuchi.zero@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      83d87a4d
    • Zach O'Keefe's avatar
      mm/madvise: add file and shmem support to MADV_COLLAPSE · 34488399
      Zach O'Keefe authored
      Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
      memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
      
      On success, the backing memory will be a hugepage.  For the memory range
      and process provided, the page tables will synchronously have a huge pmd
      installed, mapping the THP.  Other mappings of the file extent mapped by
      the memory range may be added to a set of entries that khugepaged will
      later process and attempt update their page tables to map the THP by a
      pmd.
      
      This functionality unlocks two important uses:
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  Now, we can have the best of both worlds: Peak
      	upfront performance and lower RAM footprints.
      
      (2)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.
      
      Since khugepaged is single threaded, this change now introduces
      possibility of collapse contexts racing in file collapse path.  There a
      important few places to consider:
      
      (1)	hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
      	We could have the memory collapsed out from under us, but
      	the next xas_for_each() iteration will correctly pick up the
      	hugepage.  The hugepage might not be up to date (insofar as
      	copying of small page contents might not have completed - the
      	page still may be locked), but regardless what small page index
      	we were iterating over, we'll find the hugepage and identify it
      	as a suitably aligned compound page of order HPAGE_PMD_ORDER.
      
      	In khugepaged path, we locklessly check the value of the pmd,
      	and only add it to deferred collapse array if we find pmd
      	mapping pte table. This is fine, since other values that could
      	have raced in right afterwards denote failure, or that the
      	memory was successfully collapsed, so we don't need further
      	processing.
      
      	In madvise path, we'll take mmap_lock() in write to serialize
      	against page table updates and will know what to do based on the
      	true value of the pmd: recheck all ptes if we point to a pte table,
      	directly install the pmd, if the pmd has been cleared, but
      	memory not yet faulted, or nothing at all if we find a huge pmd.
      
      	It's worth putting emphasis here on how we treat the none pmd
      	here.  If khugepaged has processed this mm's page tables
      	already, it will have left the pmd cleared (ready for refault by
      	the process).  Depending on the VMA flags and sysfs settings,
      	amount of RAM on the machine, and the current load, could be a
      	relatively common occurrence - and as such is one we'd like to
      	handle successfully in MADV_COLLAPSE.  When we see the none pmd
      	in collapse_pte_mapped_thp(), we've locked mmap_lock in write
      	and checked (a) huepaged_vma_check() to see if the backing
      	memory is appropriate still, along with VMA sizing and
      	appropriate hugepage alignment within the file, and (b) we've
      	found a hugepage head of order HPAGE_PMD_ORDER at the offset
      	in the file mapped by our hugepage-aligned virtual address.
      	Even though the common-case is likely race with khugepaged,
      	given these checks (regardless how we got here - we could be
      	operating on a completely different file than originally checked
      	in hpage_collapse_scan_file() for all we know) it should be safe
      	to directly make the pmd a huge pmd pointing to this hugepage.
      
      (2)	collapse_file() is mostly serialized on the same file extent by
      	lock sequence:
      
      		|	lock hupepage
      		|		lock mapping->i_pages
      		|			lock 1st page
      		|		unlock mapping->i_pages
      		|				<page checks>
      		|		lock mapping->i_pages
      		|				page_ref_freeze(3)
      		|				xas_store(hugepage)
      		|		unlock mapping->i_pages
      		|				page_ref_unfreeze(1)
      		|			unlock 1st page
      		V	unlock hugepage
      
      	Once a context (who already has their fresh hugepage locked)
      	locks mapping->i_pages exclusively, it will hold said lock
      	until it locks the first page, and it will hold that lock until
      	the after the hugepage has been added to the page cache (and
      	will unlock the hugepage after page table update, though that
      	isn't important here).
      
      	A racing context that loses the race for mapping->i_pages will
      	then lose the race to locking the first page.  Here - depending
      	on how far the other racing context has gotten - we might find
      	the new hugepage (in which case we'll exit cleanly when we
      	check PageTransCompound()), or we'll find the "old" 1st small
      	page (in which we'll exit cleanly when we discover unexpected
      	refcount of 2 after isolate_lru_page()).  This is assuming we
      	are able to successfully lock the page we find - in shmem path,
      	we could just fail the trylock and exit cleanly anyways.
      
      	Failure path in collapse_file() is similar: once we hold lock
      	on 1st small page, we are serialized against other collapse
      	contexts.  Before the 1st small page is unlocked, we add it
      	back to the pagecache and unfreeze the refcount appropriately.
      	Contexts who lost the race to the 1st small page will then find
      	the same 1st small page with the correct refcount and will be
      	able to proceed.
      
      [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()]
        Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com
      [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove
      	check for multi-add in khugepaged_add_pte_mapped_thp()]
        Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.com
      
      
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34488399
    • Alexander Potapenko's avatar
      bpf: kmsan: initialize BPF registers with zeroes · a6a7aaba
      Alexander Potapenko authored
      When executing BPF programs, certain registers may get passed
      uninitialized to helper functions.  E.g.  when performing a JMP_CALL,
      registers BPF_R1-BPF_R5 are always passed to the helper, no matter how
      many of them are actually used.
      
      Passing uninitialized values as function parameters is technically
      undefined behavior, so we work around it by always initializing the
      registers.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-42-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a6a7aaba
    • Alexander Potapenko's avatar
      entry: kmsan: introduce kmsan_unpoison_entry_regs() · 6cae637f
      Alexander Potapenko authored
      struct pt_regs passed into IRQ entry code is set up by uninstrumented asm
      functions, therefore KMSAN may not notice the registers are initialized.
      
      kmsan_unpoison_entry_regs() unpoisons the contents of struct pt_regs,
      preventing potential false positives.  Unlike kmsan_unpoison_memory(), it
      can be called under kmsan_in_runtime(), which is often the case in IRQ
      entry code.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-41-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cae637f
    • Alexander Potapenko's avatar
      kcov: kmsan: unpoison area->list in kcov_remote_area_put() · 74d89909
      Alexander Potapenko authored
      KMSAN does not instrument kernel/kcov.c for performance reasons (with
      CONFIG_KCOV=y virtually every place in the kernel invokes kcov
      instrumentation).  Therefore the tool may miss writes from kcov.c that
      initialize memory.
      
      When CONFIG_DEBUG_LIST is enabled, list pointers from kernel/kcov.c are
      passed to instrumented helpers in lib/list_debug.c, resulting in false
      positives.
      
      To work around these reports, we unpoison the contents of area->list after
      initializing it.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-30-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      74d89909
    • Alexander Potapenko's avatar
      dma: kmsan: unpoison DMA mappings · 7ade4f10
      Alexander Potapenko authored
      KMSAN doesn't know about DMA memory writes performed by devices.  We
      unpoison such memory when it's mapped to avoid false positive reports.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-22-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7ade4f10
    • Alexander Potapenko's avatar
      kmsan: handle task creation and exiting · 50b5e49c
      Alexander Potapenko authored
      Tell KMSAN that a new task is created, so the tool creates a backing
      metadata structure for that task.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-17-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50b5e49c
    • Alexander Potapenko's avatar
      kmsan: disable instrumentation of unsupported common kernel code · 79dbd006
      Alexander Potapenko authored
      EFI stub cannot be linked with KMSAN runtime, so we disable
      instrumentation for it.
      
      Instrumenting kcov, stackdepot or lockdep leads to infinite recursion
      caused by instrumentation hooks calling instrumented code again.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-13-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      79dbd006
    • Mike Kravetz's avatar
      hugetlb: add vma based lock for pmd sharing · 8d9bfb26
      Mike Kravetz authored
      Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for
      synchronization use by vmas that could be involved in pmd sharing.  This
      data structure contains a rw semaphore that is the primary tool used for
      synchronization.
      
      This new structure is ref counted, so that it can exist when NOT attached
      to a vma.  This is only helpful in resolving lock ordering issues where
      code may need to obtain the vma_lock while there are no guarantees the vma
      may go away.  By obtaining a ref on the structure, it can be guaranteed
      that at least the rw semaphore will not go away.
      
      Only add infrastructure for the new lock here.  Actual use will be added
      in subsequent patches.
      
      [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release]
        Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey
      Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d9bfb26
    • Matthew Wilcox (Oracle)'s avatar
Loading