Skip to content
Snippets Groups Projects
  1. Jan 06, 2022
  2. Jan 05, 2022
    • Naveen N. Rao's avatar
      tracing: Tag trace_percpu_buffer as a percpu pointer · f28439db
      Naveen N. Rao authored
      Tag trace_percpu_buffer as a percpu pointer to resolve warnings
      reported by sparse:
        /linux/kernel/trace/trace.c:3218:46: warning: incorrect type in initializer (different address spaces)
        /linux/kernel/trace/trace.c:3218:46:    expected void const [noderef] __percpu *__vpp_verify
        /linux/kernel/trace/trace.c:3218:46:    got struct trace_buffer_struct *
        /linux/kernel/trace/trace.c:3234:9: warning: incorrect type in initializer (different address spaces)
        /linux/kernel/trace/trace.c:3234:9:    expected void const [noderef] __percpu *__vpp_verify
        /linux/kernel/trace/trace.c:3234:9:    got int *
      
      Link: https://lkml.kernel.org/r/ebabd3f23101d89cb75671b68b6f819f5edc830b.1640255304.git.naveen.n.rao@linux.vnet.ibm.com
      
      
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Fixes: 07d777fe ("tracing: Add percpu buffers for trace_printk()")
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      f28439db
    • Naveen N. Rao's avatar
      tracing: Fix check for trace_percpu_buffer validity in get_trace_buf() · 823e670f
      Naveen N. Rao authored
      With the new osnoise tracer, we are seeing the below splat:
          Kernel attempted to read user page (c7d880000) - exploit attempt? (uid: 0)
          BUG: Unable to handle kernel data access on read at 0xc7d880000
          Faulting instruction address: 0xc0000000002ffa10
          Oops: Kernel access of bad area, sig: 11 [#1]
          LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
          ...
          NIP [c0000000002ffa10] __trace_array_vprintk.part.0+0x70/0x2f0
          LR [c0000000002ff9fc] __trace_array_vprintk.part.0+0x5c/0x2f0
          Call Trace:
          [c0000008bdd73b80] [c0000000001c49cc] put_prev_task_fair+0x3c/0x60 (unreliable)
          [c0000008bdd73be0] [c000000000301430] trace_array_printk_buf+0x70/0x90
          [c0000008bdd73c00] [c0000000003178b0] trace_sched_switch_callback+0x250/0x290
          [c0000008bdd73c90] [c000000000e70d60] __schedule+0x410/0x710
          [c0000008bdd73d40] [c000000000e710c0] schedule+0x60/0x130
          [c0000008bdd73d70] [c000000000030614] interrupt_exit_user_prepare_main+0x264/0x270
          [c0000008bdd73de0] [c000000000030a70] syscall_exit_prepare+0x150/0x180
          [c0000008bdd73e10] [c00000000000c174] system_call_vectored_common+0xf4/0x278
      
      osnoise tracer on ppc64le is triggering osnoise_taint() for negative
      duration in get_int_safe_duration() called from
      trace_sched_switch_callback()->thread_exit().
      
      The problem though is that the check for a valid trace_percpu_buffer is
      incorrect in get_trace_buf(). The check is being done after calculating
      the pointer for the current cpu, rather than on the main percpu pointer.
      Fix the check to be against trace_percpu_buffer.
      
      Link: https://lkml.kernel.org/r/a920e4272e0b0635cf20c444707cbce1b2c8973d.1640255304.git.naveen.n.rao@linux.vnet.ibm.com
      
      
      
      Cc: stable@vger.kernel.org
      Fixes: e2ace001 ("tracing: Choose static tp_printk buffer by explicit nesting count")
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      823e670f
  3. Dec 25, 2021
  4. Dec 18, 2021
  5. Dec 17, 2021
    • Yu Liao's avatar
      timekeeping: Really make sure wall_to_monotonic isn't positive · 4e8c11b6
      Yu Liao authored
      
      Even after commit e1d7ba87 ("time: Always make sure wall_to_monotonic
      isn't positive") it is still possible to make wall_to_monotonic positive
      by running the following code:
      
          int main(void)
          {
              struct timespec time;
      
              clock_gettime(CLOCK_MONOTONIC, &time);
              time.tv_nsec = 0;
              clock_settime(CLOCK_REALTIME, &time);
              return 0;
          }
      
      The reason is that the second parameter of timespec64_compare(), ts_delta,
      may be unnormalized because the delta is calculated with an open coded
      substraction which causes the comparison of tv_sec to yield the wrong
      result:
      
        wall_to_monotonic = { .tv_sec = -10, .tv_nsec =  900000000 }
        ts_delta 	    = { .tv_sec =  -9, .tv_nsec = -900000000 }
      
      That makes timespec64_compare() claim that wall_to_monotonic < ts_delta,
      but actually the result should be wall_to_monotonic > ts_delta.
      
      After normalization, the result of timespec64_compare() is correct because
      the tv_sec comparison is not longer misleading:
      
        wall_to_monotonic = { .tv_sec = -10, .tv_nsec =  900000000 }
        ts_delta 	    = { .tv_sec = -10, .tv_nsec =  100000000 }
      
      Use timespec64_sub() to ensure that ts_delta is normalized, which fixes the
      issue.
      
      Fixes: e1d7ba87 ("time: Always make sure wall_to_monotonic isn't positive")
      Signed-off-by: default avatarYu Liao <liaoyu15@huawei.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20211213135727.1656662-1-liaoyu15@huawei.com
      4e8c11b6
  6. Dec 16, 2021
    • Daniel Borkmann's avatar
      bpf: Make 32->64 bounds propagation slightly more robust · e572ff80
      Daniel Borkmann authored
      
      Make the bounds propagation in __reg_assign_32_into_64() slightly more
      robust and readable by aligning it similarly as we did back in the
      __reg_combine_64_into_32() counterpart. Meaning, only propagate or
      pessimize them as a smin/smax pair.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e572ff80
    • Daniel Borkmann's avatar
      bpf: Fix signed bounds propagation after mov32 · 3cf2b61e
      Daniel Borkmann authored
      
      For the case where both s32_{min,max}_value bounds are positive, the
      __reg_assign_32_into_64() directly propagates them to their 64 bit
      counterparts, otherwise it pessimises them into [0,u32_max] universe and
      tries to refine them later on by learning through the tnum as per comment
      in mentioned function. However, that does not always happen, for example,
      in mov32 operation we call zext_32_to_64(dst_reg) which invokes the
      __reg_assign_32_into_64() as is without subsequent bounds update as
      elsewhere thus no refinement based on tnum takes place.
      
      Thus, not calling into the __update_reg_bounds() / __reg_deduce_bounds() /
      __reg_bound_offset() triplet as we do, for example, in case of ALU ops via
      adjust_scalar_min_max_vals(), will lead to more pessimistic bounds when
      dumping the full register state:
      
      Before fix:
      
        0: (b4) w0 = -1
        1: R0_w=invP4294967295
           (id=0,imm=ffffffff,
            smin_value=4294967295,smax_value=4294967295,
            umin_value=4294967295,umax_value=4294967295,
            var_off=(0xffffffff; 0x0),
            s32_min_value=-1,s32_max_value=-1,
            u32_min_value=-1,u32_max_value=-1)
      
        1: (bc) w0 = w0
        2: R0_w=invP4294967295
           (id=0,imm=ffffffff,
            smin_value=0,smax_value=4294967295,
            umin_value=4294967295,umax_value=4294967295,
            var_off=(0xffffffff; 0x0),
            s32_min_value=-1,s32_max_value=-1,
            u32_min_value=-1,u32_max_value=-1)
      
      Technically, the smin_value=0 and smax_value=4294967295 bounds are not
      incorrect, but given the register is still a constant, they break assumptions
      about const scalars that smin_value == smax_value and umin_value == umax_value.
      
      After fix:
      
        0: (b4) w0 = -1
        1: R0_w=invP4294967295
           (id=0,imm=ffffffff,
            smin_value=4294967295,smax_value=4294967295,
            umin_value=4294967295,umax_value=4294967295,
            var_off=(0xffffffff; 0x0),
            s32_min_value=-1,s32_max_value=-1,
            u32_min_value=-1,u32_max_value=-1)
      
        1: (bc) w0 = w0
        2: R0_w=invP4294967295
           (id=0,imm=ffffffff,
            smin_value=4294967295,smax_value=4294967295,
            umin_value=4294967295,umax_value=4294967295,
            var_off=(0xffffffff; 0x0),
            s32_min_value=-1,s32_max_value=-1,
            u32_min_value=-1,u32_max_value=-1)
      
      Without the smin_value == smax_value and umin_value == umax_value invariant
      being intact for const scalars, it is possible to leak out kernel pointers
      from unprivileged user space if the latter is enabled. For example, when such
      registers are involved in pointer arithmtics, then adjust_ptr_min_max_vals()
      will taint the destination register into an unknown scalar, and the latter
      can be exported and stored e.g. into a BPF map value.
      
      Fixes: 3f50f132 ("bpf: Verifier, do explicit ALU32 bounds tracking")
      Reported-by: default avatarKuee K1r0a <liulin063@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3cf2b61e
  7. Dec 15, 2021
    • Paul Moore's avatar
      audit: improve robustness of the audit queue handling · f4b3ee3c
      Paul Moore authored
      
      If the audit daemon were ever to get stuck in a stopped state the
      kernel's kauditd_thread() could get blocked attempting to send audit
      records to the userspace audit daemon.  With the kernel thread
      blocked it is possible that the audit queue could grow unbounded as
      certain audit record generating events must be exempt from the queue
      limits else the system enter a deadlock state.
      
      This patch resolves this problem by lowering the kernel thread's
      socket sending timeout from MAX_SCHEDULE_TIMEOUT to HZ/10 and tweaks
      the kauditd_send_queue() function to better manage the various audit
      queues when connection problems occur between the kernel and the
      audit daemon.  With this patch, the backlog may temporarily grow
      beyond the defined limits when the audit daemon is stopped and the
      system is under heavy audit pressure, but kauditd_thread() will
      continue to make progress and drain the queues as it would for other
      connection problems.  For example, with the audit daemon put into a
      stopped state and the system configured to audit every syscall it
      was still possible to shutdown the system without a kernel panic,
      deadlock, etc.; granted, the system was slow to shutdown but that is
      to be expected given the extreme pressure of recording every syscall.
      
      The timeout value of HZ/10 was chosen primarily through
      experimentation and this developer's "gut feeling".  There is likely
      no one perfect value, but as this scenario is limited in scope (root
      privileges would be needed to send SIGSTOP to the audit daemon), it
      is likely not worth exposing this as a tunable at present.  This can
      always be done at a later date if it proves necessary.
      
      Cc: stable@vger.kernel.org
      Fixes: 5b52330b ("audit: fix auditd/kernel connection state tracking")
      Reported-by: default avatarGaosheng Cui <cuigaosheng1@huawei.com>
      Tested-by: default avatarGaosheng Cui <cuigaosheng1@huawei.com>
      Reviewed-by: default avatarRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      f4b3ee3c
    • Daniel Borkmann's avatar
      bpf: Fix kernel address leakage in atomic cmpxchg's r0 aux reg · a82fe085
      Daniel Borkmann authored
      
      The implementation of BPF_CMPXCHG on a high level has the following parameters:
      
        .-[old-val]                                          .-[new-val]
        BPF_R0 = cmpxchg{32,64}(DST_REG + insn->off, BPF_R0, SRC_REG)
                                `-[mem-loc]          `-[old-val]
      
      Given a BPF insn can only have two registers (dst, src), the R0 is fixed and
      used as an auxilliary register for input (old value) as well as output (returning
      old value from memory location). While the verifier performs a number of safety
      checks, it misses to reject unprivileged programs where R0 contains a pointer as
      old value.
      
      Through brute-forcing it takes about ~16sec on my machine to leak a kernel pointer
      with BPF_CMPXCHG. The PoC is basically probing for kernel addresses by storing the
      guessed address into the map slot as a scalar, and using the map value pointer as
      R0 while SRC_REG has a canary value to detect a matching address.
      
      Fix it by checking R0 for pointers, and reject if that's the case for unprivileged
      programs.
      
      Fixes: 5ffa2550 ("bpf: Add instructions for atomic_[cmp]xchg")
      Reported-by: Ryota Shiga (Flatt Security)
      Acked-by: default avatarBrendan Jackman <jackmanb@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a82fe085
    • Daniel Borkmann's avatar
      bpf: Fix kernel address leakage in atomic fetch · 7d3baf0a
      Daniel Borkmann authored
      
      The change in commit 37086bfd ("bpf: Propagate stack bounds to registers
      in atomics w/ BPF_FETCH") around check_mem_access() handling is buggy since
      this would allow for unprivileged users to leak kernel pointers. For example,
      an atomic fetch/and with -1 on a stack destination which holds a spilled
      pointer will migrate the spilled register type into a scalar, which can then
      be exported out of the program (since scalar != pointer) by dumping it into
      a map value.
      
      The original implementation of XADD was preventing this situation by using
      a double call to check_mem_access() one with BPF_READ and a subsequent one
      with BPF_WRITE, in both cases passing -1 as a placeholder value instead of
      register as per XADD semantics since it didn't contain a value fetch. The
      BPF_READ also included a check in check_stack_read_fixed_off() which rejects
      the program if the stack slot is of __is_pointer_value() if dst_regno < 0.
      The latter is to distinguish whether we're dealing with a regular stack spill/
      fill or some arithmetical operation which is disallowed on non-scalars, see
      also 6e7e63cb ("bpf: Forbid XADD on spilled pointers for unprivileged
      users") for more context on check_mem_access() and its handling of placeholder
      value -1.
      
      One minimally intrusive option to fix the leak is for the BPF_FETCH case to
      initially check the BPF_READ case via check_mem_access() with -1 as register,
      followed by the actual load case with non-negative load_reg to propagate
      stack bounds to registers.
      
      Fixes: 37086bfd ("bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH")
      Reported-by: default avatar <n4ke4mry@gmail.com>
      Acked-by: default avatarBrendan Jackman <jackmanb@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7d3baf0a
  8. Dec 14, 2021
  9. Dec 11, 2021
  10. Dec 10, 2021
    • Paul Chaignon's avatar
      bpf: Fix incorrect state pruning for <8B spill/fill · 345e004d
      Paul Chaignon authored
      
      Commit 354e8f19 ("bpf: Support <8-byte scalar spill and refill")
      introduced support in the verifier to track <8B spill/fills of scalars.
      The backtracking logic for the precision bit was however skipping
      spill/fills of less than 8B. That could cause state pruning to consider
      two states equivalent when they shouldn't be.
      
      As an example, consider the following bytecode snippet:
      
        0:  r7 = r1
        1:  call bpf_get_prandom_u32
        2:  r6 = 2
        3:  if r0 == 0 goto pc+1
        4:  r6 = 3
        ...
        8: [state pruning point]
        ...
        /* u32 spill/fill */
        10: *(u32 *)(r10 - 8) = r6
        11: r8 = *(u32 *)(r10 - 8)
        12: r0 = 0
        13: if r8 == 3 goto pc+1
        14: r0 = 1
        15: exit
      
      The verifier first walks the path with R6=3. Given the support for <8B
      spill/fills, at instruction 13, it knows the condition is true and skips
      instruction 14. At that point, the backtracking logic kicks in but stops
      at the fill instruction since it only propagates the precision bit for
      8B spill/fill. When the verifier then walks the path with R6=2, it will
      consider it safe at instruction 8 because R6 is not marked as needing
      precision. Instruction 14 is thus never walked and is then incorrectly
      removed as 'dead code'.
      
      It's also possible to lead the verifier to accept e.g. an out-of-bound
      memory access instead of causing an incorrect dead code elimination.
      
      This regression was found via Cilium's bpf-next CI where it was causing
      a conntrack map update to be silently skipped because the code had been
      removed by the verifier.
      
      This commit fixes it by enabling support for <8B spill/fills in the
      bactracking logic. In case of a <8B spill/fill, the full 8B stack slot
      will be marked as needing precision. Then, in __mark_chain_precision,
      any tracked register spilled in a marked slot will itself be marked as
      needing precision, regardless of the spill size. This logic makes two
      assumptions: (1) only 8B-aligned spill/fill are tracked and (2) spilled
      registers are only tracked if the spill and fill sizes are equal. Commit
      ef979017 ("bpf: selftest: Add verifier tests for <8-byte scalar
      spill and refill") covers the first assumption and the next commit in
      this patchset covers the second.
      
      Fixes: 354e8f19 ("bpf: Support <8-byte scalar spill and refill")
      Signed-off-by: default avatarPaul Chaignon <paul@isovalent.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      345e004d
  11. Dec 09, 2021
  12. Dec 08, 2021
  13. Dec 04, 2021
  14. Dec 03, 2021
    • Maxim Mikityanskiy's avatar
      bpf: Fix the off-by-two error in range markings · 2fa7d94a
      Maxim Mikityanskiy authored
      
      The first commit cited below attempts to fix the off-by-one error that
      appeared in some comparisons with an open range. Due to this error,
      arithmetically equivalent pieces of code could get different verdicts
      from the verifier, for example (pseudocode):
      
        // 1. Passes the verifier:
        if (data + 8 > data_end)
            return early
        read *(u64 *)data, i.e. [data; data+7]
      
        // 2. Rejected by the verifier (should still pass):
        if (data + 7 >= data_end)
            return early
        read *(u64 *)data, i.e. [data; data+7]
      
      The attempted fix, however, shifts the range by one in a wrong
      direction, so the bug not only remains, but also such piece of code
      starts failing in the verifier:
      
        // 3. Rejected by the verifier, but the check is stricter than in #1.
        if (data + 8 >= data_end)
            return early
        read *(u64 *)data, i.e. [data; data+7]
      
      The change performed by that fix converted an off-by-one bug into
      off-by-two. The second commit cited below added the BPF selftests
      written to ensure than code chunks like #3 are rejected, however,
      they should be accepted.
      
      This commit fixes the off-by-two error by adjusting new_range in the
      right direction and fixes the tests by changing the range into the
      one that should actually fail.
      
      Fixes: fb2a311a ("bpf: fix off by one for range markings with L{T, E} patterns")
      Fixes: b37242c7 ("bpf: add test cases to bpf selftests to cover all access tests")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20211130181607.593149-1-maximmi@nvidia.com
      2fa7d94a
  15. Dec 02, 2021
  16. Nov 27, 2021
  17. Nov 26, 2021
  18. Nov 24, 2021
    • Evan Green's avatar
      PM: hibernate: Fix snapshot partial write lengths · 88a5045f
      Evan Green authored
      
      snapshot_write() is inappropriately limiting the amount of data that can
      be written in cases where a partial page has already been written. For
      example, one would expect to be able to write 1 byte, then 4095 bytes to
      the snapshot device, and have both of those complete fully (since now
      we're aligned to a page again). But what ends up happening is we write 1
      byte, then 4094/4095 bytes complete successfully.
      
      The reason is that simple_write_to_buffer()'s second argument is the
      total size of the buffer, not the size of the buffer minus the offset.
      Since simple_write_to_buffer() accounts for the offset in its
      implementation, snapshot_write() can just pass the full page size
      directly down.
      
      Signed-off-by: default avatarEvan Green <evgreen@chromium.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      88a5045f
    • Thomas Zeitlhofer's avatar
      PM: hibernate: use correct mode for swsusp_close() · cefcf24b
      Thomas Zeitlhofer authored
      
      Commit 39fbef4b ("PM: hibernate: Get block device exclusively in
      swsusp_check()") changed the opening mode of the block device to
      (FMODE_READ | FMODE_EXCL).
      
      In the corresponding calls to swsusp_close(), the mode is still just
      FMODE_READ which triggers the warning in blkdev_flush_mapping() on
      resume from hibernate.
      
      So, use the mode (FMODE_READ | FMODE_EXCL) also when closing the
      device.
      
      Fixes: 39fbef4b ("PM: hibernate: Get block device exclusively in swsusp_check()")
      Signed-off-by: default avatarThomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      cefcf24b
    • Mark Rutland's avatar
      sched/scs: Reset task stack state in bringup_cpu() · dce1ca05
      Mark Rutland authored
      
      To hot unplug a CPU, the idle task on that CPU calls a few layers of C
      code before finally leaving the kernel. When KASAN is in use, poisoned
      shadow is left around for each of the active stack frames, and when
      shadow call stacks are in use. When shadow call stacks (SCS) are in use
      the task's saved SCS SP is left pointing at an arbitrary point within
      the task's shadow call stack.
      
      When a CPU is offlined than onlined back into the kernel, this stale
      state can adversely affect execution. Stale KASAN shadow can alias new
      stackframes and result in bogus KASAN warnings. A stale SCS SP is
      effectively a memory leak, and prevents a portion of the shadow call
      stack being used. Across a number of hotplug cycles the idle task's
      entire shadow call stack can become unusable.
      
      We previously fixed the KASAN issue in commit:
      
        e1b77c92 ("sched/kasan: remove stale KASAN poison after hotplug")
      
      ... by removing any stale KASAN stack poison immediately prior to
      onlining a CPU.
      
      Subsequently in commit:
      
        f1a0a376 ("sched/core: Initialize the idle task with preemption disabled")
      
      ... the refactoring left the KASAN and SCS cleanup in one-time idle
      thread initialization code rather than something invoked prior to each
      CPU being onlined, breaking both as above.
      
      We fixed SCS (but not KASAN) in commit:
      
        63acd42c ("sched/scs: Reset the shadow stack when idle_task_exit")
      
      ... but as this runs in the context of the idle task being offlined it's
      potentially fragile.
      
      To fix these consistently and more robustly, reset the SCS SP and KASAN
      shadow of a CPU's idle task immediately before we online that CPU in
      bringup_cpu(). This ensures the idle task always has a consistent state
      when it is running, and removes the need to so so when exiting an idle
      task.
      
      Whenever any thread is created, dup_task_struct() will give the task a
      stack which is free of KASAN shadow, and initialize the task's SCS SP,
      so there's no need to specially initialize either for idle thread within
      init_idle(), as this was only necessary to handle hotplug cycles.
      
      I've tested this on arm64 with:
      
      * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK
      * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK
      
      ... offlining and onlining CPUS with:
      
      | while true; do
      |   for C in /sys/devices/system/cpu/cpu*/online; do
      |     echo 0 > $C;
      |     echo 1 > $C;
      |   done
      | done
      
      Fixes: f1a0a376 ("sched/core: Initialize the idle task with preemption disabled")
      Reported-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Tested-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/
      dce1ca05
    • Jiri Olsa's avatar
      tracing/uprobe: Fix uprobe_perf_open probes iteration · 1880ed71
      Jiri Olsa authored
      Add missing 'tu' variable initialization in the probes loop,
      otherwise the head 'tu' is used instead of added probes.
      
      Link: https://lkml.kernel.org/r/20211123142801.182530-1-jolsa@kernel.org
      
      
      
      Cc: stable@vger.kernel.org
      Fixes: 99c9a923 ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe")
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      1880ed71
  19. Nov 23, 2021
    • Marco Elver's avatar
      perf: Ignore sigtrap for tracepoints destined for other tasks · 73743c3b
      Marco Elver authored
      
      syzbot reported that the warning in perf_sigtrap() fires, saying that
      the event's task does not match current:
      
       | WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
       | Modules linked in:
       | CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0
       | Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       | RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline]
       | RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline]
       | RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
       | ...
       | Call Trace:
       |  <IRQ>
       |  irq_work_single+0x106/0x220 kernel/irq_work.c:211
       |  irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242
       |  irq_work_run+0x4f/0xd0 kernel/irq_work.c:251
       |  __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22
       |  sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17
       |  </IRQ>
       |  <TASK>
       |  asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664
       | RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
       | RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194
       | ...
       |  coredump_task_exit kernel/exit.c:371 [inline]
       |  do_exit+0x1865/0x25c0 kernel/exit.c:771
       |  do_group_exit+0xe7/0x290 kernel/exit.c:929
       |  get_signal+0x3b0/0x1ce0 kernel/signal.c:2820
       |  arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
       |  handle_signal_work kernel/entry/common.c:148 [inline]
       |  exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
       |  exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
       |  __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
       |  syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
       |  do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
       |  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      On x86 this shouldn't happen, which has arch_irq_work_raise().
      
      The test program sets up a perf event with sigtrap set to fire on the
      'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup().
      
      This happened because the 'sched_wakeup' tracepoint also takes a task
      argument passed on to perf_tp_event(), which is used to deliver the
      event to that other task.
      
      Since we cannot deliver synchronous signals to other tasks, skip an event if
      perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is
      set, which will avoid ever entering perf_sigtrap() for such events.
      
      Fixes: 97ba62b2 ("perf: Add support for SIGTRAP on perf events")
      Reported-by: default avatar <syzbot+663359e32ce6f1a305ad@syzkaller.appspotmail.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/YYpoCOBmC/kJWfmI@elver.google.com
      73743c3b
    • Muchun Song's avatar
      locking/rwsem: Optimize down_read_trylock() under highly contended case · 14c24048
      Muchun Song authored
      
      We found that a process with 10 thousnads threads has been encountered
      a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
      workload which will concurrently allocate lots of memory in different
      threads sometimes. In this case, we will see the down_read_trylock()
      with a high hotspot. Therefore, we suppose that rwsem has a regression
      at least since Linux-v5.4. In order to easily debug this problem, we
      write a simply benchmark to create the similar situation lile the
      following.
      
        ```c++
        #include <sys/mman.h>
        #include <sys/time.h>
        #include <sys/resource.h>
        #include <sched.h>
      
        #include <cstdio>
        #include <cassert>
        #include <thread>
        #include <vector>
        #include <chrono>
      
        volatile int mutex;
      
        void trigger(int cpu, char* ptr, std::size_t sz)
        {
        	cpu_set_t set;
        	CPU_ZERO(&set);
        	CPU_SET(cpu, &set);
        	assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);
      
        	while (mutex);
      
        	for (std::size_t i = 0; i < sz; i += 4096) {
        		*ptr = '\0';
        		ptr += 4096;
        	}
        }
      
        int main(int argc, char* argv[])
        {
        	std::size_t sz = 100;
      
        	if (argc > 1)
        		sz = atoi(argv[1]);
      
        	auto nproc = std::thread::hardware_concurrency();
        	std::vector<std::thread> thr;
        	sz <<= 30;
        	auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
      			 MAP_PRIVATE, -1, 0);
        	assert(ptr != MAP_FAILED);
        	char* cptr = static_cast<char*>(ptr);
        	auto run = sz / nproc;
        	run = (run >> 12) << 12;
      
        	mutex = 1;
      
        	for (auto i = 0U; i < nproc; ++i) {
        		thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
        		cptr += run;
        	}
      
        	rusage usage_start;
        	getrusage(RUSAGE_SELF, &usage_start);
        	auto start = std::chrono::system_clock::now();
      
        	mutex = 0;
      
        	for (auto& t : thr)
        		t.join();
      
        	rusage usage_end;
        	getrusage(RUSAGE_SELF, &usage_end);
        	auto end = std::chrono::system_clock::now();
        	timeval utime;
        	timeval stime;
        	timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
        	timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
        	printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
        	printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
        	printf("real: %lu\n",
        	       std::chrono::duration_cast<std::chrono::milliseconds>(end -
        	       start).count());
      
        	return 0;
        }
        ```
      
      The functionality of above program is simply which creates `nproc`
      threads and each of them are trying to touch memory (trigger page
      fault) on different CPU. Then we will see the similar profile by
      `perf top`.
      
        25.55%  [kernel]                  [k] down_read_trylock
        14.78%  [kernel]                  [k] handle_mm_fault
        13.45%  [kernel]                  [k] up_read
         8.61%  [kernel]                  [k] clear_page_erms
         3.89%  [kernel]                  [k] __do_page_fault
      
      The highest hot instruction, which accounts for about 92%, in
      down_read_trylock() is cmpxchg like the following.
      
        91.89 │      lock   cmpxchg %rdx,(%rdi)
      
      Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
      so we easily found that the commit ddb20d1d ("locking/rwsem: Optimize
      down_read_trylock()") caused the regression. The reason is that the
      commit assumes the rwsem is not contended at all. But it is not always
      true for mmap lock which could be contended with thousands threads.
      So most threads almost need to run at least 2 times of "cmpxchg" to
      acquire the lock. The overhead of atomic operation is higher than
      non-atomic instructions, which caused the regression.
      
      By using the above benchmark, the real executing time on a x86-64 system
      before and after the patch were:
      
                        Before Patch  After Patch
         # of Threads      real          real     reduced by
         ------------     ------        ------    ----------
               1          65,373        65,206       ~0.0%
               4          15,467        15,378       ~0.5%
              40           6,214         5,528      ~11.0%
      
      For the uncontended case, the new down_read_trylock() is the same as
      before. For the contended cases, the new down_read_trylock() is faster
      than before. The more contended, the more fast.
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
      14c24048
Loading