1. 16 Jan, 2019 1 commit
  2. 15 Jan, 2019 2 commits
    • Tycho Andersen's avatar
      seccomp: fix UAF in user-trap code · a811dc61
      Tycho Andersen authored
      On the failure path, we do an fput() of the listener fd if the filter fails
      to install (e.g. because of a TSYNC race that's lost, or if the thread is
      killed, etc.). fput() doesn't actually release the fd, it just ads it to a
      work queue. Then the thread proceeds to free the filter, even though the
      listener struct file has a reference to it.
      
      To fix this, on the failure path let's set the private data to null, so we
      know in ->release() to ignore the filter.
      
      Reported-by: syzbot+981c26489b2d1c6316ba@syzkaller.appspotmail.com
      Fixes: 6a21cc50 ("seccomp: add a return code to trap to userspace")
      Signed-off-by: 's avatarTycho Andersen <tycho@tycho.ws>
      Acked-by: 's avatarKees Cook <keescook@chromium.org>
      Signed-off-by: 's avatarJames Morris <james.morris@microsoft.com>
      a811dc61
    • Andrea Righi's avatar
      tracing/kprobes: Fix NULL pointer dereference in trace_kprobe_create() · 8b05a3a7
      Andrea Righi authored
      It is possible to trigger a NULL pointer dereference by writing an
      incorrectly formatted string to krpobe_events (trying to create a
      kretprobe omitting the symbol).
      
      Example:
      
       echo "r:event_1 " >> /sys/kernel/debug/tracing/kprobe_events
      
      That triggers this:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
       #PF error: [normal kernel read fault]
       PGD 0 P4D 0
       Oops: 0000 [#1] SMP PTI
       CPU: 6 PID: 1757 Comm: bash Not tainted 5.0.0-rc1+ #125
       Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.5.1 08/09/2018
       RIP: 0010:kstrtoull+0x2/0x20
       Code: 28 00 00 00 75 17 48 83 c4 18 5b 41 5c 5d c3 b8 ea ff ff ff eb e1 b8 de ff ff ff eb da e8 d6 36 bb ff 66 0f 1f 44 00 00 31 c0 <80> 3f 2b 55 48 89 e5 0f 94 c0 48 01 c7 e8 5c ff ff ff 5d c3 66 2e
       RSP: 0018:ffffb5d482e57cb8 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff82b12720
       RDX: ffffb5d482e57cf8 RSI: 0000000000000000 RDI: 0000000000000000
       RBP: ffffb5d482e57d70 R08: ffffa0c05e5a7080 R09: ffffa0c05e003980
       R10: 0000000000000000 R11: 0000000040000000 R12: ffffa0c04fe87b08
       R13: 0000000000000001 R14: 000000000000000b R15: ffffa0c058d749e1
       FS:  00007f137c7f7740(0000) GS:ffffa0c05e580000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 0000000497d46004 CR4: 00000000003606e0
       Call Trace:
        ? trace_kprobe_create+0xb6/0x840
        ? _cond_resched+0x19/0x40
        ? _cond_resched+0x19/0x40
        ? __kmalloc+0x62/0x210
        ? argv_split+0x8f/0x140
        ? trace_kprobe_create+0x840/0x840
        ? trace_kprobe_create+0x840/0x840
        create_or_delete_trace_kprobe+0x11/0x30
        trace_run_command+0x50/0x90
        trace_parse_run_command+0xc1/0x160
        probes_write+0x10/0x20
        __vfs_write+0x3a/0x1b0
        ? apparmor_file_permission+0x1a/0x20
        ? security_file_permission+0x31/0xf0
        ? _cond_resched+0x19/0x40
        vfs_write+0xb1/0x1a0
        ksys_write+0x55/0xc0
        __x64_sys_write+0x1a/0x20
        do_syscall_64+0x5a/0x120
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fix by doing the proper argument checks in trace_kprobe_create().
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Link: https://lore.kernel.org/lkml/20190111095108.b79a2ee026185cbd62365977@kernel.org
      Link: http://lkml.kernel.org/r/20190111060113.GA22841@xps-13
      Fixes: 6212dd29 ("tracing/kprobes: Use dyn_event framework for kprobe events")
      Acked-by: 's avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: 's avatarAndrea Righi <righi.andrea@gmail.com>
      Signed-off-by: 's avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: 's avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      8b05a3a7
  3. 13 Jan, 2019 1 commit
    • Jonathan Neuschäfer's avatar
      kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymore · b7285b42
      Jonathan Neuschäfer authored
      UNAME26 is a mechanism to report Linux's version as 2.6.x, for
      compatibility with old/broken software.  Due to the way it is
      implemented, it would have to be updated after 5.0, to keep the
      resulting versions unique.  Linus Torvalds argued:
      
       "Do we actually need this?
      
        I'd rather let it bitrot, and just let it return random versions. It
        will just start again at 2.4.60, won't it?
      
        Anybody who uses UNAME26 for a 5.x kernel might as well think it's
        still 4.x. The user space is so old that it can't possibly care about
        differences between 4.x and 5.x, can it?
      
        The only thing that matters is that it shows "2.4.<largeenough>",
        which it will do regardless"
      Signed-off-by: 's avatarJonathan Neuschäfer <j.neuschaefer@gmx.net>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7285b42
  4. 12 Jan, 2019 1 commit
    • Taehee Yoo's avatar
      umh: add exit routine for UMH process · 73ab1cb2
      Taehee Yoo authored
      A UMH process which is created by the fork_usermode_blob() such as
      bpfilter needs to release members of the umh_info when process is
      terminated.
      But the do_exit() does not release members of the umh_info. hence module
      which uses UMH needs own code to detect whether UMH process is
      terminated or not.
      But this implementation needs extra code for checking the status of
      UMH process. it eventually makes the code more complex.
      
      The new PF_UMH flag is added and it is used to identify UMH processes.
      The exit_umh() does not release members of the umh_info.
      Hence umh_info->cleanup callback should release both members of the
      umh_info and the private data.
      Suggested-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      73ab1cb2
  5. 11 Jan, 2019 1 commit
    • Yonghong Song's avatar
      bpf: fix bpffs bitfield pretty print · 17e3ac81
      Yonghong Song authored
      Commit 9d5f9f70 ("bpf: btf: fix struct/union/fwd types
      with kind_flag") introduced kind_flag and used bitfield_size
      in the btf_member to directly pretty print member values.
      
      The commit contained a bug where the incorrect parameters could be
      passed to function btf_bitfield_seq_show(). The bits_offset
      parameter in the function expects a value less than 8.
      Instead, the member offset in the structure is passed.
      
      The below is btf_bitfield_seq_show() func signature:
        void btf_bitfield_seq_show(void *data, u8 bits_offset,
                                   u8 nr_bits, struct seq_file *m)
      both bits_offset and nr_bits are u8 type. If the bitfield
      member offset is greater than 256, incorrect value will
      be printed.
      
      This patch fixed the issue by calculating correct proper
      data offset and bits_offset similar to non kind_flag case.
      
      Fixes: 9d5f9f70 ("bpf: btf: fix struct/union/fwd types with kind_flag")
      Acked-by: 's avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: 's avatarYonghong Song <yhs@fb.com>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      17e3ac81
  6. 10 Jan, 2019 1 commit
  7. 09 Jan, 2019 1 commit
  8. 08 Jan, 2019 1 commit
    • David Rheinsberg's avatar
      fork: record start_time late · 7b558513
      David Rheinsberg authored
      This changes the fork(2) syscall to record the process start_time after
      initializing the basic task structure but still before making the new
      process visible to user-space.
      
      Technically, we could record the start_time anytime during fork(2).  But
      this might lead to scenarios where a start_time is recorded long before
      a process becomes visible to user-space.  For instance, with
      userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
      for an indefinite amount of time (and will, if this causes network
      access, or similar).
      
      By recording the start_time late, it much closer reflects the point in
      time where the process becomes live and can be observed by other
      processes.
      
      Lastly, this makes it much harder for user-space to predict and control
      the start_time they get assigned.  Previously, user-space could fork a
      process and stall it in copy_thread_tls() before its pid is allocated,
      but after its start_time is recorded.  This can be misused to later-on
      cycle through PIDs and resume the stalled fork(2) yielding a process
      that has the same pid and start_time as a process that existed before.
      This can be used to circumvent security systems that identify processes
      by their pid+start_time combination.
      
      Even though user-space was always aware that start_time recording is
      flaky (but several projects are known to still rely on start_time-based
      identification), changing the start_time to be recorded late will help
      mitigate existing attacks and make it much harder for user-space to
      control the start_time a process gets assigned.
      Reported-by: Jann Horn's avatarJann Horn <jannh@google.com>
      Signed-off-by: Tom Gundersen's avatarTom Gundersen <teg@jklm.no>
      Signed-off-by: David Rheinsberg's avatarDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b558513
  9. 06 Jan, 2019 3 commits
  10. 05 Jan, 2019 1 commit
  11. 04 Jan, 2019 15 commits
  12. 03 Jan, 2019 8 commits
    • Daniel Borkmann's avatar
      bpf: prevent out of bounds speculation on pointer arithmetic · 979d63d5
      Daniel Borkmann authored
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: Jann Horn's avatarJann Horn <jannh@google.com>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      979d63d5
    • Daniel Borkmann's avatar
      bpf: fix check_map_access smin_value test when pointer contains offset · b7137c4e
      Daniel Borkmann authored
      In check_map_access() we probe actual bounds through __check_map_access()
      with offset of reg->smin_value + off for lower bound and offset of
      reg->umax_value + off for the upper bound. However, even though the
      reg->smin_value could have a negative value, the final result of the
      sum with off could be positive when pointer arithmetic with known and
      unknown scalars is combined. In this case we reject the program with
      an error such as "R<x> min value is negative, either use unsigned index
      or do a if (index >=0) check." even though the access itself would be
      fine. Therefore extend the check to probe whether the actual resulting
      reg->smin_value + off is less than zero.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      b7137c4e
    • Daniel Borkmann's avatar
      bpf: restrict unknown scalars of mixed signed bounds for unprivileged · 9d7eceed
      Daniel Borkmann authored
      For unknown scalars of mixed signed bounds, meaning their smin_value is
      negative and their smax_value is positive, we need to reject arithmetic
      with pointer to map value. For unprivileged the goal is to mask every
      map pointer arithmetic and this cannot reliably be done when it is
      unknown at verification time whether the scalar value is negative or
      positive. Given this is a corner case, the likelihood of breaking should
      be very small.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      9d7eceed
    • Daniel Borkmann's avatar
      bpf: restrict stack pointer arithmetic for unprivileged · e4298d25
      Daniel Borkmann authored
      Restrict stack pointer arithmetic for unprivileged users in that
      arithmetic itself must not go out of bounds as opposed to the actual
      access later on. Therefore after each adjust_ptr_min_max_vals() with
      a stack pointer as a destination we simulate a check_stack_access()
      of 1 byte on the destination and once that fails the program is
      rejected for unprivileged program loads. This is analog to map
      value pointer arithmetic and needed for masking later on.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      e4298d25
    • Daniel Borkmann's avatar
      bpf: restrict map value pointer arithmetic for unprivileged · 0d6303db
      Daniel Borkmann authored
      Restrict map value pointer arithmetic for unprivileged users in that
      arithmetic itself must not go out of bounds as opposed to the actual
      access later on. Therefore after each adjust_ptr_min_max_vals() with a
      map value pointer as a destination it will simulate a check_map_access()
      of 1 byte on the destination and once that fails the program is rejected
      for unprivileged program loads. We use this later on for masking any
      pointer arithmetic with the remainder of the map value space. The
      likelihood of breaking any existing real-world unprivileged eBPF
      program is very small for this corner case.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      0d6303db
    • Daniel Borkmann's avatar
      bpf: enable access to ax register also from verifier rewrite · 9b73bfdd
      Daniel Borkmann authored
      Right now we are using BPF ax register in JIT for constant blinding as
      well as in interpreter as temporary variable. Verifier will not be able
      to use it simply because its use will get overridden from the former in
      bpf_jit_blind_insn(). However, it can be made to work in that blinding
      will be skipped if there is prior use in either source or destination
      register on the instruction. Taking constraints of ax into account, the
      verifier is then open to use it in rewrites under some constraints. Note,
      ax register already has mappings in every eBPF JIT.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      9b73bfdd
    • Daniel Borkmann's avatar
      bpf: move tmp variable into ax register in interpreter · 144cd91c
      Daniel Borkmann authored
      This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
      into the hidden ax register. The latter is currently only used in JITs
      for constant blinding as a temporary scratch register, meaning the BPF
      interpreter will never see the use of ax. Therefore it is safe to use
      it for the cases where tmp has been used earlier. This is needed to later
      on allow restricted hidden use of ax in both interpreter and JITs.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      144cd91c
    • Daniel Borkmann's avatar
      bpf: move {prev_,}insn_idx into verifier env · c08435ec
      Daniel Borkmann authored
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      c08435ec
  13. 30 Dec, 2018 4 commits
    • Linus Torvalds's avatar
      sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f654 · c40f7d74
      Linus Torvalds authored
      Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
      scheduler under high loads, starting at around the v4.18 time frame,
      and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
      manipulation.
      
      Do a (manual) revert of:
      
        a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path")
      
      It turns out that the list_del_leaf_cfs_rq() introduced by this commit
      is a surprising property that was not considered in followup commits
      such as:
      
        9c2791f9 ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")
      
      As Vincent Guittot explains:
      
       "I think that there is a bigger problem with commit a9e7f654 and
        cfs_rq throttling:
      
        Let take the example of the following topology TG2 --> TG1 --> root:
      
         1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
            cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
            one path because it has never been used and can't be throttled so
            tmp_alone_branch will point to leaf_cfs_rq_list at the end.
      
         2) Then TG1 is throttled
      
         3) and we add TG3 as a new child of TG1.
      
         4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
            cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.
      
        With commit a9e7f654, we can del a cfs_rq from rq->leaf_cfs_rq_list.
        So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
        cfs_rq is removed from the list.
        Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
        but tmp_alone_branch still points to TG3 cfs_rq because its throttled
        parent can't be enqueued when the lock is released.
        tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.
      
        So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
        points on another TG cfs_rq, the next TG cfs_rq that will be added,
        will be linked outside rq->leaf_cfs_rq_list - which is bad.
      
        In addition, we can break the ordering of the cfs_rq in
        rq->leaf_cfs_rq_list but this ordering is used to update and
        propagate the update from leaf down to root."
      
      Instead of trying to work through all these cases and trying to reproduce
      the very high loads that produced the lockup to begin with, simplify
      the code temporarily by reverting a9e7f654 - which change was clearly
      not thought through completely.
      
      This (hopefully) gives us a kernel that doesn't lock up so people
      can continue to enjoy their holidays without worrying about regressions. ;-)
      
      [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]
      Analyzed-by: 's avatarXie XiuQi <xiexiuqi@huawei.com>
      Analyzed-by: 's avatarVincent Guittot <vincent.guittot@linaro.org>
      Reported-by: 's avatarZhipeng Xie <xiezhipeng1@huawei.com>
      Reported-by: 's avatarSargun Dhillon <sargun@sargun.me>
      Reported-by: 's avatarXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: 's avatarZhipeng Xie <xiezhipeng1@huawei.com>
      Tested-by: 's avatarSargun Dhillon <sargun@sargun.me>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: 's avatarVincent Guittot <vincent.guittot@linaro.org>
      Cc: <stable@vger.kernel.org> # v4.13+
      Cc: Bin Li <huawei.libin@huawei.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path")
      Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.comSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      c40f7d74
    • Nicholas Mc Guire's avatar
      kdb: use bool for binary state indicators · 7faedcd4
      Nicholas Mc Guire authored
      defcmd_in_progress  is the state trace for command group processing
      - within a command group or not -  usable  is an indicator if a command
      set is valid (allocated/non-empty) - so use a bool for those binary
      indication here.
      Signed-off-by: 's avatarNicholas Mc Guire <hofrat@osadl.org>
      Reviewed-by: 's avatarDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: 's avatarDaniel Thompson <daniel.thompson@linaro.org>
      7faedcd4
    • Douglas Anderson's avatar
      kdb: Don't back trace on a cpu that didn't round up · 162bc7f5
      Douglas Anderson authored
      If you have a CPU that fails to round up and then run 'btc' you'll end
      up crashing in kdb becaue we dereferenced NULL.  Let's add a check.
      It's wise to also set the task to NULL when leaving the debugger so
      that if we fail to round up on a later entry into the debugger we
      won't backtrace a stale task.
      Signed-off-by: 's avatarDouglas Anderson <dianders@chromium.org>
      Acked-by: 's avatarDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: 's avatarDaniel Thompson <daniel.thompson@linaro.org>
      162bc7f5
    • Douglas Anderson's avatar
      kgdb: Don't round up a CPU that failed rounding up before · 87b09592
      Douglas Anderson authored
      If we're using the default implementation of kgdb_roundup_cpus() that
      uses smp_call_function_single_async() we can end up hanging
      kgdb_roundup_cpus() if we try to round up a CPU that failed to round
      up before.
      
      Specifically smp_call_function_single_async() will try to wait on the
      csd lock for the CPU that we're trying to round up.  If the previous
      round up never finished then that lock could still be held and we'll
      just sit there hanging.
      
      There's not a lot of use trying to round up a CPU that failed to round
      up before.  Let's keep a flag that indicates whether the CPU started
      but didn't finish to round up before.  If we see that flag set then
      we'll skip the next round up.
      
      In general we have a few goals here:
      - We never want to end up calling smp_call_function_single_async()
        when the csd is still locked.  This is accomplished because
        flush_smp_call_function_queue() unlocks the csd _before_ invoking
        the callback.  That means that when kgdb_nmicallback() runs we know
        for sure the the csd is no longer locked.  Thus when we set
        "rounding_up = false" we know for sure that the csd is unlocked.
      - If there are no timeouts rounding up we should never skip a round
        up.
      
      NOTE #1: In general trying to continue running after failing to round
      up CPUs doesn't appear to be supported in the debugger.  When I
      simulate this I find that kdb reports "Catastrophic error detected"
      when I try to continue.  I can overrule and continue anyway, but it
      should be noted that we may be entering the land of dragons here.
      Possibly the "Catastrophic error detected" was added _because_ of the
      future failure to round up, but even so this is an area of the code
      that hasn't been strongly tested.
      
      NOTE #2: I did a bit of testing before and after this change.  I
      introduced a 10 second hang in the kernel while holding a spinlock
      that I could invoke on a certain CPU with 'taskset -c 3 cat /sys/...".
      
      Before this change if I did:
      - Invoke hang
      - Enter debugger
      - g (which warns about Catastrophic error, g again to go anyway)
      - g
      - Enter debugger
      
      ...I'd hang the rest of the 10 seconds without getting a debugger
      prompt.  After this change I end up in the debugger the 2nd time after
      only 1 second with the standard warning about 'Timed out waiting for
      secondary CPUs.'
      
      I'll also note that once the CPU finished waiting I could actually
      debug it (aka "btc" worked)
      
      I won't promise that everything works perfectly if the errant CPU
      comes back at just the wrong time (like as we're entering or exiting
      the debugger) but it certainly seems like an improvement.
      
      NOTE #3: setting 'kgdb_info[cpu].rounding_up = false' is in
      kgdb_nmicallback() instead of kgdb_call_nmi_hook() because some
      implementations override kgdb_call_nmi_hook().  It shouldn't hurt to
      have it in kgdb_nmicallback() in any case.
      
      NOTE #4: this logic is really only needed because there is no API call
      like "smp_try_call_function_single_async()" or "smp_csd_is_locked()".
      If such an API existed then we'd use it instead, but it seemed a bit
      much to add an API like this just for kgdb.
      Signed-off-by: 's avatarDouglas Anderson <dianders@chromium.org>
      Acked-by: 's avatarDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: 's avatarDaniel Thompson <daniel.thompson@linaro.org>
      87b09592