Skip to content
Snippets Groups Projects
  1. Jan 18, 2024
    • Andrii Nakryiko's avatar
      bpf: enforce types for __arg_ctx-tagged arguments in global subprogs · 0ba97151
      Andrii Nakryiko authored
      
      Add enforcement of expected types for context arguments tagged with
      arg:ctx (__arg_ctx) tag.
      
      First, any program type will accept generic `void *` context type when
      combined with __arg_ctx tag.
      
      Besides accepting "canonical" struct names and `void *`, for a bunch of
      program types for which program context is actually a named struct, we
      allows a bunch of pragmatic exceptions to match real-world and expected
      usage:
      
        - for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
          canonical context argument type, where `bpf_user_pt_regs_t` is a
          *typedef*, not a struct;
        - for kprobes, we also always accept `struct pt_regs *`, as that's what
          actually is passed as a context to any kprobe program;
        - for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
          down to actual struct type and accept `struct pt_regs *`, or
          `struct user_pt_regs *`, or `struct user_regs_struct *`, depending
          on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
          typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
          expected;
        - for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
          what's expected with BPF_PROG() usage; otherwise, canonical
          `struct bpf_raw_tracepoint_args *` is expected;
        - tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
          formats, both are coded as expections as tp_btf is actually a TRACING
          program type, which has no canonical context type;
        - iterator programs accept `struct bpf_iter__xxx *` structs, currently
          with no further iterator-type specific enforcement;
        - fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
        - classic tracepoint programs, as well as syscall and freplace
          programs allow any user-provided type.
      
      In all other cases kernel will enforce exact match of struct name to
      expected canonical type. And if user-provided type doesn't match that
      expectation, verifier will emit helpful message with expected type name.
      
      Note a bit unnatural way the check is done after processing all the
      arguments. This is done to avoid conflict between bpf and bpf-next
      trees. Once trees converge, a small follow up patch will place a simple
      btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
      (which bpf-next tree patch refactored already), removing duplicated
      arg:ctx detection logic.
      
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240118033143.3384355-4-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0ba97151
    • Andrii Nakryiko's avatar
      bpf: extract bpf_ctx_convert_map logic and make it more reusable · 66967a32
      Andrii Nakryiko authored
      
      Refactor btf_get_prog_ctx_type() a bit to allow reuse of
      bpf_ctx_convert_map logic in more than one places. Simplify interface by
      returning btf_type instead of btf_member (field reference in BTF).
      
      To do the above we need to touch and start untangling
      btf_translate_to_vmlinux() implementation. We do the bare minimum to
      not regress anything for btf_translate_to_vmlinux(), but its
      implementation is very questionable for what it claims to be doing.
      Mapping kfunc argument types to kernel corresponding types conceptually
      is quite different from recognizing program context types. Fixing this
      is out of scope for this change though.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240118033143.3384355-3-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      66967a32
  2. Jan 16, 2024
    • Hao Sun's avatar
      bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS · 22c7fa17
      Hao Sun authored
      
      For PTR_TO_FLOW_KEYS, check_flow_keys_access() only uses fixed off
      for validation. However, variable offset ptr alu is not prohibited
      for this ptr kind. So the variable offset is not checked.
      
      The following prog is accepted:
      
        func#0 @0
        0: R1=ctx() R10=fp0
        0: (bf) r6 = r1                       ; R1=ctx() R6_w=ctx()
        1: (79) r7 = *(u64 *)(r6 +144)        ; R6_w=ctx() R7_w=flow_keys()
        2: (b7) r8 = 1024                     ; R8_w=1024
        3: (37) r8 /= 1                       ; R8_w=scalar()
        4: (57) r8 &= 1024                    ; R8_w=scalar(smin=smin32=0,
        smax=umax=smax32=umax32=1024,var_off=(0x0; 0x400))
        5: (0f) r7 += r8
        mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1
        mark_precise: frame0: regs=r8 stack= before 4: (57) r8 &= 1024
        mark_precise: frame0: regs=r8 stack= before 3: (37) r8 /= 1
        mark_precise: frame0: regs=r8 stack= before 2: (b7) r8 = 1024
        6: R7_w=flow_keys(smin=smin32=0,smax=umax=smax32=umax32=1024,var_off
        =(0x0; 0x400)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1024,
        var_off=(0x0; 0x400))
        6: (79) r0 = *(u64 *)(r7 +0)          ; R0_w=scalar()
        7: (95) exit
      
      This prog loads flow_keys to r7, and adds the variable offset r8
      to r7, and finally causes out-of-bounds access:
      
        BUG: unable to handle page fault for address: ffffc90014c80038
        [...]
        Call Trace:
         <TASK>
         bpf_dispatcher_nop_func include/linux/bpf.h:1231 [inline]
         __bpf_prog_run include/linux/filter.h:651 [inline]
         bpf_prog_run include/linux/filter.h:658 [inline]
         bpf_prog_run_pin_on_cpu include/linux/filter.h:675 [inline]
         bpf_flow_dissect+0x15f/0x350 net/core/flow_dissector.c:991
         bpf_prog_test_run_flow_dissector+0x39d/0x620 net/bpf/test_run.c:1359
         bpf_prog_test_run kernel/bpf/syscall.c:4107 [inline]
         __sys_bpf+0xf8f/0x4560 kernel/bpf/syscall.c:5475
         __do_sys_bpf kernel/bpf/syscall.c:5561 [inline]
         __se_sys_bpf kernel/bpf/syscall.c:5559 [inline]
         __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:5559
         do_syscall_x64 arch/x86/entry/common.c:52 [inline]
         do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:83
         entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Fix this by rejecting ptr alu with variable offset on flow_keys.
      Applying the patch rejects the program with "R7 pointer arithmetic
      on flow_keys prohibited".
      
      Fixes: d58e468b ("flow_dissector: implements flow dissector BPF hook")
      Signed-off-by: default avatarHao Sun <sunhao.th@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/bpf/20240115082028.9992-1-sunhao.th@gmail.com
      22c7fa17
  3. Jan 05, 2024
    • Jiri Olsa's avatar
      bpf: Fix re-attachment branch in bpf_tracing_prog_attach · 715d82ba
      Jiri Olsa authored
      
      The following case can cause a crash due to missing attach_btf:
      
      1) load rawtp program
      2) load fentry program with rawtp as target_fd
      3) create tracing link for fentry program with target_fd = 0
      4) repeat 3
      
      In the end we have:
      
      - prog->aux->dst_trampoline == NULL
      - tgt_prog == NULL (because we did not provide target_fd to link_create)
      - prog->aux->attach_btf == NULL (the program was loaded with attach_prog_fd=X)
      - the program was loaded for tgt_prog but we have no way to find out which one
      
          BUG: kernel NULL pointer dereference, address: 0000000000000058
          Call Trace:
           <TASK>
           ? __die+0x20/0x70
           ? page_fault_oops+0x15b/0x430
           ? fixup_exception+0x22/0x330
           ? exc_page_fault+0x6f/0x170
           ? asm_exc_page_fault+0x22/0x30
           ? bpf_tracing_prog_attach+0x279/0x560
           ? btf_obj_id+0x5/0x10
           bpf_tracing_prog_attach+0x439/0x560
           __sys_bpf+0x1cf4/0x2de0
           __x64_sys_bpf+0x1c/0x30
           do_syscall_64+0x41/0xf0
           entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
      Return -EINVAL in this situation.
      
      Fixes: f3a95075 ("bpf: Allow trampoline re-attach for tracing and lsm programs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJiri Olsa <olsajiri@gmail.com>
      Acked-by: default avatarJiri Olsa <olsajiri@gmail.com>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarDmitrii Dolgov <9erthalion6@gmail.com>
      Link: https://lore.kernel.org/r/20240103190559.14750-4-9erthalion6@gmail.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      715d82ba
    • Dmitrii Dolgov's avatar
      bpf: Relax tracing prog recursive attach rules · 19bfcdf9
      Dmitrii Dolgov authored
      Currently, it's not allowed to attach an fentry/fexit prog to another
      one fentry/fexit. At the same time it's not uncommon to see a tracing
      program with lots of logic in use, and the attachment limitation
      prevents usage of fentry/fexit for performance analysis (e.g. with
      "bpftool prog profile" command) in this case. An example could be
      falcosecurity libs project that uses tp_btf tracing programs.
      
      Following the corresponding discussion [1], the reason for that is to
      avoid tracing progs call cycles without introducing more complex
      solutions. But currently it seems impossible to load and attach tracing
      programs in a way that will form such a cycle. The limitation is coming
      from the fact that attach_prog_fd is specified at the prog load (thus
      making it impossible to attach to a program loaded after it in this
      way), as well as tracing progs not implementing link_detach.
      
      Replace "no same type" requirement with verification that no more than
      one level of attachment nesting is allowed. In this way only one
      fentry/fexit program could be attached to another fentry/fexit to cover
      profiling use case, and still no cycle could be formed. To implement,
      add a new field into bpf_prog_aux to track nested attachment for tracing
      programs.
      
      [1]: https://lore.kernel.org/bpf/20191108064039.2041889-16-ast@kernel.org/
      
      
      
      Acked-by: default avatarJiri Olsa <olsajiri@gmail.com>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarDmitrii Dolgov <9erthalion6@gmail.com>
      Link: https://lore.kernel.org/r/20240103190559.14750-2-9erthalion6@gmail.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      19bfcdf9
  4. Jan 04, 2024
  5. Jan 03, 2024
    • Andrei Matei's avatar
      bpf: Simplify checking size of helper accesses · 8a021e7f
      Andrei Matei authored
      
      This patch simplifies the verification of size arguments associated to
      pointer arguments to helpers and kfuncs. Many helpers take a pointer
      argument followed by the size of the memory access performed to be
      performed through that pointer. Before this patch, the handling of the
      size argument in check_mem_size_reg() was confusing and wasteful: if the
      size register's lower bound was 0, then the verification was done twice:
      once considering the size of the access to be the lower-bound of the
      respective argument, and once considering the upper bound (even if the
      two are the same). The upper bound checking is a super-set of the
      lower-bound checking(*), except: the only point of the lower-bound check
      is to handle the case where zero-sized-accesses are explicitly not
      allowed and the lower-bound is zero. This static condition is now
      checked explicitly, replacing a much more complex, expensive and
      confusing verification call to check_helper_mem_access().
      
      Error messages change in this patch. Before, messages about illegal
      zero-size accesses depended on the type of the pointer and on other
      conditions, and sometimes the message was plain wrong: in some tests
      that changed you'll see that the old message was something like "R1 min
      value is outside of the allowed memory range", where R1 is the pointer
      register; the error was wrongly claiming that the pointer was bad
      instead of the size being bad. Other times the information that the size
      came for a register with a possible range of values was wrong, and the
      error presented the size as a fixed zero. Now the errors refer to the
      right register. However, the old error messages did contain useful
      information about the pointer register which is now lost; recovering
      this information was deemed not important enough.
      
      (*) Besides standing to reason that the checks for a bigger size access
      are a super-set of the checks for a smaller size access, I have also
      mechanically verified this by reading the code for all types of
      pointers. I could convince myself that it's true for all but
      PTR_TO_BTF_ID (check_ptr_to_btf_access). There, simply looking
      line-by-line does not immediately prove what we want. If anyone has any
      qualms, let me know.
      
      Signed-off-by: default avatarAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231221232225.568730-2-andreimatei1@gmail.com
      8a021e7f
  6. Dec 27, 2023
  7. Dec 21, 2023
  8. Dec 20, 2023
    • Hou Tao's avatar
      bpf: Use c->unit_size to select target cache during free · 7ac5c53e
      Hou Tao authored
      
      At present, bpf memory allocator uses check_obj_size() to ensure that
      ksize() of allocated pointer is equal with the unit_size of used
      bpf_mem_cache. Its purpose is to prevent bpf_mem_free() from selecting
      a bpf_mem_cache which has different unit_size compared with the
      bpf_mem_cache used for allocation. But as reported by lkp, the return
      value of ksize() or kmalloc_size_roundup() may change due to slab merge
      and it will lead to the warning report in check_obj_size().
      
      The reported warning happened as follows:
      (1) in bpf_mem_cache_adjust_size(), kmalloc_size_roundup(96) returns the
      object_size of kmalloc-96 instead of kmalloc-cg-96. The object_size of
      kmalloc-96 is 96, so size_index for 96 is not adjusted accordingly.
      (2) the object_size of kmalloc-cg-96 is adjust from 96 to 128 due to
      slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is
      enabled by default for kmalloc slab, so align is 64 and size is 128 for
      kmalloc-cg-96. SLUB has a similar merge logic, but its object_size will
      not be changed, because its align is 8 under x86-64.
      (3) when unit_alloc() does kmalloc_node(96, __GFP_ACCOUNT, node),
      ksize() returns 128 instead of 96 for the returned pointer.
      (4) the warning in check_obj_size() is triggered.
      
      Considering the slab merge can happen in anytime (e.g, a slab created in
      a new module), the following case is also possible: during the
      initialization of bpf_global_ma, there is no slab merge and ksize() for
      a 96-bytes object returns 96. But after that a new slab created by a
      kernel module is merged to kmalloc-cg-96 and the object_size of
      kmalloc-cg-96 is adjust from 96 to 128 (which is possible for x86-64 +
      CONFIG_SLAB, because its alignment requirement is 64 for 96-bytes slab).
      So soon or later, when bpf_global_ma frees a 96-byte-sized pointer
      which is allocated from bpf_mem_cache with unit_size=96, bpf_mem_free()
      will free the pointer through a bpf_mem_cache in which unit_size is 128,
      because the return value of ksize() changes. The warning for the
      mismatch will be triggered again.
      
      A feasible fix is introducing similar APIs compared with ksize() and
      kmalloc_size_roundup() to return the actually-allocated size instead of
      size which may change due to slab merge, but it will introduce
      unnecessary dependency on the implementation details of mm subsystem.
      
      As for now the pointer of bpf_mem_cache is saved in the 8-bytes area
      (or 4-bytes under 32-bit host) above the returned pointer, using
      unit_size in the saved bpf_mem_cache to select the target cache instead
      of inferring the size from the pointer itself. Beside no extra
      dependency on mm subsystem, the performance for bpf_mem_free_rcu() is
      also improved as shown below.
      
      Before applying the patch, the performances of bpf_mem_alloc() and
      bpf_mem_free_rcu() on 8-CPUs VM with one producer are as follows:
      
      kmalloc : alloc 11.69 ± 0.28M/s free 29.58 ± 0.93M/s
      percpu  : alloc 14.11 ± 0.52M/s free 14.29 ± 0.99M/s
      
      After apply the patch, the performance for bpf_mem_free_rcu() increases
      9% and 146% for kmalloc memory and per-cpu memory respectively:
      
      kmalloc: alloc 11.01 ± 0.03M/s free   32.42 ± 0.48M/s
      percpu:  alloc 12.84 ± 0.12M/s free   35.24 ± 0.23M/s
      
      After the fixes, there is no need to adjust size_index to fix the
      mismatch between allocation and free, so remove it as well. Also return
      NULL instead of ZERO_SIZE_PTR for zero-sized alloc in bpf_mem_alloc(),
      because there is no bpf_mem_cache pointer saved above ZERO_SIZE_PTR.
      
      Fixes: 9077fc22 ("bpf: Use kmalloc_size_roundup() to adjust size_index")
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Closes: https://lore.kernel.org/bpf/202310302113.9f8fe705-oliver.sang@intel.com
      
      
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231216131052.27621-2-houtao@huaweicloud.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7ac5c53e
    • Andrii Nakryiko's avatar
      bpf: add support for passing dynptr pointer to global subprog · a64bfe61
      Andrii Nakryiko authored
      
      Add ability to pass a pointer to dynptr into global functions.
      This allows to have global subprogs that accept and work with generic
      dynptrs that are created by caller. Dynptr argument is detected based on
      the name of a struct type, if it's "bpf_dynptr", it's assumed to be
      a proper dynptr pointer. Both actual struct and forward struct
      declaration types are supported.
      
      This is conceptually exactly the same semantics as
      bpf_user_ringbuf_drain()'s use of dynptr to pass a variable-sized
      pointer to ringbuf record. So we heavily rely on CONST_PTR_TO_DYNPTR
      bits of already existing logic in the verifier.
      
      During global subprog validation, we mark such CONST_PTR_TO_DYNPTR as
      having LOCAL type, as that's the most unassuming type of dynptr and it
      doesn't have any special helpers that can try to free or acquire extra
      references (unlike skb, xdp, or ringbuf dynptr). So that seems like a safe
      "choice" to make from correctness standpoint. It's still possible to
      pass any type of dynptr to such subprog, though, because generic dynptr
      helpers, like getting data/slice pointers, read/write memory copying
      routines, dynptr adjustment and getter routines all work correctly with
      any type of dynptr.
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231215011334.2307144-8-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a64bfe61
    • Andrii Nakryiko's avatar
      bpf: support 'arg:xxx' btf_decl_tag-based hints for global subprog args · 94e1c70a
      Andrii Nakryiko authored
      
      Add support for annotating global BPF subprog arguments to provide more
      information about expected semantics of the argument. Currently,
      verifier relies purely on argument's BTF type information, and supports
      three general use cases: scalar, pointer-to-context, and
      pointer-to-fixed-size-memory.
      
      Scalar and pointer-to-fixed-mem work well in practice and are quite
      natural to use. But pointer-to-context is a bit problematic, as typical
      BPF users don't realize that they need to use a special type name to
      signal to verifier that argument is not just some pointer, but actually
      a PTR_TO_CTX. Further, even if users do know which type to use, it is
      limiting in situations where the same BPF program logic is used across
      few different program types. Common case is kprobes, tracepoints, and
      perf_event programs having a helper to send some data over BPF perf
      buffer. bpf_perf_event_output() requires `ctx` argument, and so it's
      quite cumbersome to share such global subprog across few BPF programs of
      different types, necessitating extra static subprog that is context
      type-agnostic.
      
      Long story short, there is a need to go beyond types and allow users to
      add hints to global subprog arguments to define expectations.
      
      This patch adds such support for two initial special tags:
        - pointer to context;
        - non-null qualifier for generic pointer arguments.
      
      All of the above came up in practice already and seem generally useful
      additions. Non-null qualifier is an often requested feature, which
      currently has to be worked around by having unnecessary NULL checks
      inside subprogs even if we know that arguments are never NULL. Pointer
      to context was discussed earlier.
      
      As for implementation, we utilize btf_decl_tag attribute and set up an
      "arg:xxx" convention to specify argument hint. As such:
        - btf_decl_tag("arg:ctx") is a PTR_TO_CTX hint;
        - btf_decl_tag("arg:nonnull") marks pointer argument as not allowed to
          be NULL, making NULL check inside global subprog unnecessary.
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231215011334.2307144-7-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      94e1c70a
    • Andrii Nakryiko's avatar
      bpf: reuse subprog argument parsing logic for subprog call checks · f18c3d88
      Andrii Nakryiko authored
      
      Remove duplicated BTF parsing logic when it comes to subprog call check.
      Instead, use (potentially cached) results of btf_prepare_func_args() to
      abstract away expectations of each subprog argument in generic terms
      (e.g., "this is pointer to context", or "this is a pointer to memory of
      size X"), and then use those simple high-level argument type
      expectations to validate actual register states to check if they match
      expectations.
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231215011334.2307144-6-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f18c3d88
    • Andrii Nakryiko's avatar
      bpf: move subprog call logic back to verifier.c · c5a72447
      Andrii Nakryiko authored
      
      Subprog call logic in btf_check_subprog_call() currently has both a lot
      of BTF parsing logic (which is, presumably, what justified putting it
      into btf.c), but also a bunch of register state checks, some of each
      utilize deep verifier logic helpers, necessarily exported from
      verifier.c: check_ptr_off_reg(), check_func_arg_reg_off(),
      and check_mem_reg().
      
      Going forward, btf_check_subprog_call() will have a minimum of
      BTF-related logic, but will get more internal verifier logic related to
      register state manipulation. So move it into verifier.c to minimize
      amount of verifier-specific logic exposed to btf.c.
      
      We do this move before refactoring btf_check_func_arg_match() to
      preserve as much history post-refactoring as possible.
      
      No functional changes.
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231215011334.2307144-5-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c5a72447
    • Andrii Nakryiko's avatar
      bpf: prepare btf_prepare_func_args() for handling static subprogs · e26080d0
      Andrii Nakryiko authored
      
      Generalize btf_prepare_func_args() to support both global and static
      subprogs. We are going to utilize this property in the next patch,
      reusing btf_prepare_func_args() for subprog call logic instead of
      reparsing BTF information in a completely separate implementation.
      
      btf_prepare_func_args() now detects whether subprog is global or static
      makes slight logic adjustments for static func cases, like not failing
      fatally (-EFAULT) for conditions that are allowable for static subprogs.
      
      Somewhat subtle (but major!) difference is the handling of pointer arguments.
      Both global and static functions need to handle special context
      arguments (which are pointers to predefined type names), but static
      subprogs give up on any other pointers, falling back to marking subprog
      as "unreliable", disabling the use of BTF type information altogether.
      
      For global functions, though, we are assuming that such pointers to
      unrecognized types are just pointers to fixed-sized memory region (or
      error out if size cannot be established, like for `void *` pointers).
      
      This patch accommodates these small differences and sets up a stage for
      refactoring in the next patch, eliminating a separate BTF-based parsing
      logic in btf_check_func_arg_match().
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231215011334.2307144-4-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e26080d0
    • Andrii Nakryiko's avatar
      bpf: reuse btf_prepare_func_args() check for main program BTF validation · 5eccd2db
      Andrii Nakryiko authored
      
      Instead of btf_check_subprog_arg_match(), use btf_prepare_func_args()
      logic to validate "trustworthiness" of main BPF program's BTF information,
      if it is present.
      
      We ignored results of original BTF check anyway, often times producing
      confusing and ominously-sounding "reg type unsupported for arg#0
      function" message, which has no apparent effect on program correctness
      and verification process.
      
      All the -EFAULT returning sanity checks are already performed in
      check_btf_info_early(), so there is zero reason to have this duplication
      of logic between btf_check_subprog_call() and btf_check_subprog_arg_match().
      Dropping btf_check_subprog_arg_match() simplifies
      btf_check_func_arg_match() further removing `bool processing_call` flag.
      
      One subtle bit that was done by btf_check_subprog_arg_match() was
      potentially marking main program's BTF as unreliable. We do this
      explicitly now with a dedicated simple check, preserving the original
      behavior, but now based on well factored btf_prepare_func_args() logic.
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231215011334.2307144-3-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5eccd2db
    • Andrii Nakryiko's avatar
      bpf: abstract away global subprog arg preparation logic from reg state setup · 4ba1d0f2
      Andrii Nakryiko authored
      
      btf_prepare_func_args() is used to understand expectations and
      restrictions on global subprog arguments. But current implementation is
      hard to extend, as it intermixes BTF-based func prototype parsing and
      interpretation logic with setting up register state at subprog entry.
      
      Worse still, those registers are not completely set up inside
      btf_prepare_func_args(), requiring some more logic later in
      do_check_common(). Like calling mark_reg_unknown() and similar
      initialization operations.
      
      This intermixing of BTF interpretation and register state setup is
      problematic. First, it causes duplication of BTF parsing logic for global
      subprog verification (to set up initial state of global subprog) and
      global subprog call sites analysis (when we need to check that whatever
      is being passed into global subprog matches expectations), performed in
      btf_check_subprog_call().
      
      Given we want to extend global func argument with tags later, this
      duplication is problematic. So refactor btf_prepare_func_args() to do
      only BTF-based func proto and args parsing, returning high-level
      argument "expectations" only, with no regard to specifics of register
      state. I.e., if it's a context argument, instead of setting register
      state to PTR_TO_CTX, we return ARG_PTR_TO_CTX enum for that argument as
      "an argument specification" for further processing inside
      do_check_common(). Similarly for SCALAR arguments, PTR_TO_MEM, etc.
      
      This allows to reuse btf_prepare_func_args() in following patches at
      global subprog call site analysis time. It also keeps register setup
      code consistently in one place, do_check_common().
      
      Besides all this, we cache this argument specs information inside
      env->subprog_info, eliminating the need to redo these potentially
      expensive BTF traversals, especially if BPF program's BTF is big and/or
      there are lots of global subprog calls.
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231215011334.2307144-2-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4ba1d0f2
    • Menglong Dong's avatar
      bpf: make the verifier tracks the "not equal" for regs · d028f875
      Menglong Dong authored
      
      We can derive some new information for BPF_JNE in regs_refine_cond_op().
      Take following code for example:
      
        /* The type of "a" is u32 */
        if (a > 0 && a < 100) {
          /* the range of the register for a is [0, 99], not [1, 99],
           * and will cause the following error:
           *
           *   invalid zero-sized read
           *
           * as a can be 0.
           */
          bpf_skb_store_bytes(skb, xx, xx, a, 0);
        }
      
      In the code above, "a > 0" will be compiled to "jmp xxx if a == 0". In the
      TRUE branch, the dst_reg will be marked as known to 0. However, in the
      fallthrough(FALSE) branch, the dst_reg will not be handled, which makes
      the [min, max] for a is [0, 99], not [1, 99].
      
      For BPF_JNE, we can reduce the range of the dst reg if the src reg is a
      const and is exactly the edge of the dst reg.
      
      Signed-off-by: default avatarMenglong Dong <menglong8.dong@gmail.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Link: https://lore.kernel.org/r/20231219134800.1550388-2-menglong8.dong@gmail.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d028f875
  9. Dec 19, 2023
  10. Dec 18, 2023
  11. Dec 16, 2023
  12. Dec 15, 2023
    • Daniel Xu's avatar
      bpf: xdp: Register generic_kfunc_set with XDP programs · 7489723c
      Daniel Xu authored
      
      Registering generic_kfunc_set with XDP programs enables some of the
      newer BPF features inside XDP -- namely tree based data structures and
      BPF exceptions.
      
      The current motivation for this commit is to enable assertions inside
      XDP bpf progs. Assertions are a standard and useful tool to encode
      intent.
      
      Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Link: https://lore.kernel.org/r/d07d4614b81ca6aada44fcb89bb6b618fb66e4ca.1702594357.git.dxu@dxuuu.xyz
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7489723c
    • Andrii Nakryiko's avatar
      bpf: support symbolic BPF FS delegation mount options · c5707b21
      Andrii Nakryiko authored
      
      Besides already supported special "any" value and hex bit mask, support
      string-based parsing of delegation masks based on exact enumerator
      names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
      `enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
      symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
      prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
      BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
      normalized to lower case in mount option output. So "PROG_LOAD",
      "prog_load", and "MAP_create" are all valid values to specify for
      delegate_cmds options, "array" is among supported for map types, etc.
      
      Besides supporting string values, we also support multiple values
      specified at the same time, using colon (':') separator.
      
      There are corresponding changes on bpf_show_options side to use known
      values to print them in human-readable format, falling back to hex mask
      printing, if there are any unrecognized bits. This shouldn't be
      necessary when enum BTF information is present, but in general we should
      always be able to fall back to this even if kernel was built without BTF.
      As mentioned, emitted symbolic names are normalized to be all lower case.
      
      Example below shows various ways to specify delegate_cmds options
      through mount command and how mount options are printed back:
      
      12/14 14:39:07.604
      vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
      $ mount | rg token
      
        $ sudo mkdir -p /sys/fs/bpf/token
        $ sudo mount -t bpf bpffs /sys/fs/bpf/token \
                     -o delegate_cmds=prog_load:MAP_CREATE \
                     -o delegate_progs=kprobe \
                     -o delegate_attachs=xdp
        $ mount | grep token
        bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
      
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231214225016.1209867-2-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c5707b21
    • Yonghong Song's avatar
      bpf: Fix a race condition between btf_put() and map_free() · 59e5791f
      Yonghong Song authored
      When running `./test_progs -j` in my local vm with latest kernel,
      I once hit a kasan error like below:
      
        [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
        [ 1887.186498]
        [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
        [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
        [ 1887.190341] Call Trace:
        [ 1887.190666]  <TASK>
        [ 1887.190949]  dump_stack_lvl+0xac/0xe0
        [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
        [ 1887.192019]  ? panic+0x3c0/0x3c0
        [ 1887.192449]  print_report+0x14f/0x720
        [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
        [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
        [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.194572]  kasan_report+0xc3/0x100
        [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
        [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
        [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
        [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
        [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
        [ 1887.198883]  array_map_free+0x1a3/0x260
        [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
        [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
        [ 1887.200549]  worker_thread+0x633/0x890
        [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
        [ 1887.201574]  ? kthread+0x102/0x1d0
        [ 1887.202020]  kthread+0x1ab/0x1d0
        [ 1887.202447]  ? pr_cont_work+0x270/0x270
        [ 1887.202954]  ? kthread_blkcg+0x50/0x50
        [ 1887.203444]  ret_from_fork+0x34/0x50
        [ 1887.203914]  ? kthread_blkcg+0x50/0x50
        [ 1887.204397]  ret_from_fork_asm+0x11/0x20
        [ 1887.204913]  </TASK>
        [ 1887.204913]  </TASK>
        [ 1887.205209]
        [ 1887.205416] Allocated by task 2197:
        [ 1887.205881]  kasan_set_track+0x3f/0x60
        [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
        [ 1887.206856]  __kmalloc+0xac/0x1a0
        [ 1887.207293]  btf_parse_fields+0xa15/0x1480
        [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
        [ 1887.208387]  btf_new_fd+0x294/0x4d0
        [ 1887.208851]  __sys_bpf+0x4ba/0x600
        [ 1887.209292]  __x64_sys_bpf+0x41/0x50
        [ 1887.209762]  do_syscall_64+0x4c/0xf0
        [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
        [ 1887.210868]
        [ 1887.211074] Freed by task 36:
        [ 1887.211460]  kasan_set_track+0x3f/0x60
        [ 1887.211951]  kasan_save_free_info+0x28/0x40
        [ 1887.212485]  ____kasan_slab_free+0x101/0x180
        [ 1887.213027]  __kmem_cache_free+0xe4/0x210
        [ 1887.213514]  btf_free+0x5b/0x130
        [ 1887.213918]  rcu_core+0x638/0xcc0
        [ 1887.214347]  __do_softirq+0x114/0x37e
      
      The error happens at bpf_rb_root_free+0x1f8/0x2b0:
      
        00000000000034c0 <bpf_rb_root_free>:
        ; {
          34c0: f3 0f 1e fa                   endbr64
          34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
          34c9: 55                            pushq   %rbp
          34ca: 48 89 e5                      movq    %rsp, %rbp
        ...
        ;       if (rec && rec->refcount_off >= 0 &&
          36aa: 4d 85 ed                      testq   %r13, %r13
          36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
          36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
          36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
                                              <==== kasan function
          36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
                                              <==== use-after-free load
          36bc: 45 85 ff                      testl   %r15d, %r15d
          36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>
      
      So the problem is at rec->refcount_off in the above.
      
      I did some source code analysis and find the reason.
                                        CPU A                        CPU B
        bpf_map_put:
          ...
          btf_put with rcu callback
          ...
          bpf_map_free_deferred
            with system_unbound_wq
          ...                          ...                           ...
          ...                          btf_free_rcu:                 ...
          ...                          ...                           bpf_map_free_deferred:
          ...                          ...
          ...         --------->       btf_struct_metas_free()
          ...         | race condition ...
          ...         --------->                                     map->ops->map_free()
          ...
          ...                          btf->struct_meta_tab = NULL
      
      In the above, map_free() corresponds to array_map_free() and eventually
      calling bpf_rb_root_free() which calls:
        ...
        __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
        ...
      
      Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
      
        meta = btf_find_struct_meta(btf, btf_id);
        if (!meta)
          return -EFAULT;
        rec->fields[i].graph_root.value_rec = meta->record;
      
      So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
      And it is possible that that particular record has been freed by
      btf_struct_metas_free() and hence we have a kasan error here.
      
      Actually it is very hard to reproduce the failure with current bpf/bpf-next
      code, I only got the above error once. To increase reproducibility, I added
      a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
      significantly increased reproducibility.
      
        diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
        index 5e43ddd1b83f..aae5b5213e93 100644
        --- a/kernel/bpf/syscall.c
        +++ b/kernel/bpf/syscall.c
        @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
              struct bpf_map *map = container_of(work, struct bpf_map, work);
              struct btf_record *rec = map->record;
      
        +     mdelay(100);
              security_bpf_map_free(map);
              bpf_map_release_memcg(map);
              /* implementation dependent freeing */
      
      Hao also provided test cases ([1]) for easily reproducing the above issue.
      
      There are two ways to fix the issue, the v1 of the patch ([2]) moving
      btf_put() after map_free callback, and the v5 of the patch ([3]) using
      a kptr style fix which tries to get a btf reference during
      map_check_btf(). Each approach has its pro and cons. The first approach
      delays freeing btf while the second approach needs to acquire reference
      depending on context which makes logic not very elegant and may
      complicate things with future new data structures. Alexei
      suggested in [4] going back to v1 which is what this patch
      tries to do.
      
      Rerun './test_progs -j' with the above mdelay() hack for a couple
      of times and didn't observe the error for the above rb_root test cases.
      Running Hou's test ([1]) is also successful.
      
        [1] https://lore.kernel.org/bpf/20231207141500.917136-1-houtao@huaweicloud.com/
        [2] v1: https://lore.kernel.org/bpf/20231204173946.3066377-1-yonghong.song@linux.dev/
        [3] v5: https://lore.kernel.org/bpf/20231208041621.2968241-1-yonghong.song@linux.dev/
        [4] v4: https://lore.kernel.org/bpf/CAADnVQJ3FiXUhZJwX_81sjZvSYYKCFB3BT6P8D59RS2Gu+0Z7g@mail.gmail.com/
      
      
      
      Cc: Hou Tao <houtao@huaweicloud.com>
      Fixes: 958cf2e2 ("bpf: Introduce bpf_obj_new")
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20231214203815.1469107-1-yonghong.song@linux.dev
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      59e5791f
  13. Dec 14, 2023
  14. Dec 13, 2023
Loading