Skip to content
Snippets Groups Projects
  1. Feb 01, 2024
    • Daniel Bristot de Oliveira's avatar
      tracing/timerlat: Move hrtimer_init to timerlat_fd open() · 1389358b
      Daniel Bristot de Oliveira authored
      Currently, the timerlat's hrtimer is initialized at the first read of
      timerlat_fd, and destroyed at close(). It works, but it causes an error
      if the user program open() and close() the file without reading.
      
      Here's an example:
      
       # echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
       # echo timerlat > /sys/kernel/debug/tracing/current_tracer
      
       # cat <<EOF > ./timerlat_load.py
       # !/usr/bin/env python3
      
       timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r')
       timerlat_fd.close();
       EOF
      
       # ./taskset -c 0 ./timerlat_load.py
      <BOOM>
      
       BUG: kernel NULL pointer dereference, address: 0000000000000010
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 1 PID: 2673 Comm: python3 Not tainted 6.6.13-200.fc39.x86_64 #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
       RIP: 0010:hrtimer_active+0xd/0x50
       Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d
       RSP: 0018:ffffb031009b7e10 EFLAGS: 00010286
       RAX: 000000000002db00 RBX: ffff9118f786db08 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffff9117a0e64400 RDI: ffff9118f786db08
       RBP: ffff9118f786db80 R08: ffff9117a0ddd420 R09: ffff9117804d4f70
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9118f786db08
       R13: ffff91178fdd5e20 R14: ffff9117840978c0 R15: 0000000000000000
       FS:  00007f2ffbab1740(0000) GS:ffff9118f7840000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000010 CR3: 00000001b402e000 CR4: 0000000000750ee0
       PKRU: 55555554
       Call Trace:
        <TASK>
        ? __die+0x23/0x70
        ? page_fault_oops+0x171/0x4e0
        ? srso_alias_return_thunk+0x5/0x7f
        ? avc_has_extended_perms+0x237/0x520
        ? exc_page_fault+0x7f/0x180
        ? asm_exc_page_fault+0x26/0x30
        ? hrtimer_active+0xd/0x50
        hrtimer_cancel+0x15/0x40
        timerlat_fd_release+0x48/0xe0
        __fput+0xf5/0x290
        __x64_sys_close+0x3d/0x80
        do_syscall_64+0x60/0x90
        ? srso_alias_return_thunk+0x5/0x7f
        ? __x64_sys_ioctl+0x72/0xd0
        ? srso_alias_return_thunk+0x5/0x7f
        ? syscall_exit_to_user_mode+0x2b/0x40
        ? srso_alias_return_thunk+0x5/0x7f
        ? do_syscall_64+0x6c/0x90
        ? srso_alias_return_thunk+0x5/0x7f
        ? exit_to_user_mode_prepare+0x142/0x1f0
        ? srso_alias_return_thunk+0x5/0x7f
        ? syscall_exit_to_user_mode+0x2b/0x40
        ? srso_alias_return_thunk+0x5/0x7f
        ? do_syscall_64+0x6c/0x90
        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
       RIP: 0033:0x7f2ffb321594
       Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
       RSP: 002b:00007ffe8d8eef18 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
       RAX: ffffffffffffffda RBX: 00007f2ffba4e668 RCX: 00007f2ffb321594
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
       RBP: 00007ffe8d8eef40 R08: 0000000000000000 R09: 0000000000000000
       R10: 55c926e3167eae79 R11: 0000000000000202 R12: 0000000000000003
       R13: 00007ffe8d8ef030 R14: 0000000000000000 R15: 00007f2ffba4e668
        </TASK>
       CR2: 0000000000000010
       ---[ end trace 0000000000000000 ]---
      
      Move hrtimer_init to timerlat_fd open() to avoid this problem.
      
      Link: https://lore.kernel.org/linux-trace-kernel/7324dd3fc0035658c99b825204a66049389c56e3.1706798888.git.bristot@kernel.org
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: stable@vger.kernel.org
      Fixes: e88ed227 ("tracing/timerlat: Add user-space interface")
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1389358b
  2. Jan 31, 2024
  3. Jan 26, 2024
  4. Jan 22, 2024
    • Petr Pavlu's avatar
      tracing: Ensure visibility when inserting an element into tracing_map · 2b447606
      Petr Pavlu authored
      Running the following two commands in parallel on a multi-processor
      AArch64 machine can sporadically produce an unexpected warning about
      duplicate histogram entries:
      
       $ while true; do
           echo hist:key=id.syscall:val=hitcount > \
             /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
           cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
           sleep 0.001
         done
       $ stress-ng --sysbadaddr $(nproc)
      
      The warning looks as follows:
      
      [ 2911.172474] ------------[ cut here ]------------
      [ 2911.173111] Duplicates detected: 1
      [ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
      [ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
      [ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
      [ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G            E      6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
      [ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
      [ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
      [ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.185310] sp : ffff8000a1513900
      [ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
      [ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
      [ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
      [ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
      [ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
      [ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
      [ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
      [ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
      [ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
      [ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
      [ 2911.194259] Call trace:
      [ 2911.194626]  tracing_map_sort_entries+0x3e0/0x408
      [ 2911.195220]  hist_show+0x124/0x800
      [ 2911.195692]  seq_read_iter+0x1d4/0x4e8
      [ 2911.196193]  seq_read+0xe8/0x138
      [ 2911.196638]  vfs_read+0xc8/0x300
      [ 2911.197078]  ksys_read+0x70/0x108
      [ 2911.197534]  __arm64_sys_read+0x24/0x38
      [ 2911.198046]  invoke_syscall+0x78/0x108
      [ 2911.198553]  el0_svc_common.constprop.0+0xd0/0xf8
      [ 2911.199157]  do_el0_svc+0x28/0x40
      [ 2911.199613]  el0_svc+0x40/0x178
      [ 2911.200048]  el0t_64_sync_handler+0x13c/0x158
      [ 2911.200621]  el0t_64_sync+0x1a8/0x1b0
      [ 2911.201115] ---[ end trace 0000000000000000 ]---
      
      The problem appears to be caused by CPU reordering of writes issued from
      __tracing_map_insert().
      
      The check for the presence of an element with a given key in this
      function is:
      
       val = READ_ONCE(entry->val);
       if (val && keys_match(key, val->key, map->key_size)) ...
      
      The write of a new entry is:
      
       elt = get_free_elt(map);
       memcpy(elt->key, key, map->key_size);
       entry->val = elt;
      
      The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
      stores may become visible in the reversed order on another CPU. This
      second CPU might then incorrectly determine that a new key doesn't match
      an already present val->key and subsequently insert a new element,
      resulting in a duplicate.
      
      Fix the problem by adding a write barrier between
      "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
      good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
      element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
      and the "val->key" check which has an address dependency.
      
      The barrier is placed on a path executed when adding an element for
      a new key. Subsequent updates targeting the same key remain unaffected.
      
      From the user's perspective, the issue was introduced by commit
      c193707d ("tracing: Remove code which merges duplicates"), which
      followed commit cbf4100e ("tracing: Add support to detect and avoid
      duplicates"). The previous code operated differently; it inherently
      expected potential races which result in duplicates but merged them
      later when they occurred.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com
      
      
      
      Fixes: c193707d ("tracing: Remove code which merges duplicates")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Acked-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2b447606
  5. Jan 10, 2024
  6. Jan 08, 2024
  7. Dec 30, 2023
    • Steven Rostedt (Google)'s avatar
      ftrace: Fix modification of direct_function hash while in use · d05cb470
      Steven Rostedt (Google) authored
      Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
      if the number of new entries are added is large enough to cause two
      allocations in the loop:
      
              for (i = 0; i < size; i++) {
                      hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
                              new = ftrace_add_rec_direct(entry->ip, addr, &free_hash);
                              if (!new)
                                      goto out_remove;
                              entry->direct = addr;
                      }
              }
      
      Where ftrace_add_rec_direct() has:
      
              if (ftrace_hash_empty(direct_functions) ||
                  direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
                      struct ftrace_hash *new_hash;
                      int size = ftrace_hash_empty(direct_functions) ? 0 :
                              direct_functions->count + 1;
      
                      if (size < 32)
                              size = 32;
      
                      new_hash = dup_hash(direct_functions, size);
                      if (!new_hash)
                              return NULL;
      
                      *free_hash = direct_functions;
                      direct_functions = new_hash;
              }
      
      The "*free_hash = direct_functions;" can happen twice, losing the previous
      allocation of direct_functions.
      
      But this also exposed a more serious bug.
      
      The modification of direct_functions above is not safe. As
      direct_functions can be referenced at any time to find what direct caller
      it should call, the time between:
      
                      new_hash = dup_hash(direct_functions, size);
       and
                      direct_functions = new_hash;
      
      can have a race with another CPU (or even this one if it gets interrupted),
      and the entries being moved to the new hash are not referenced.
      
      That's because the "dup_hash()" is really misnamed and is really a
      "move_hash()". It moves the entries from the old hash to the new one.
      
      Now even if that was changed, this code is not proper as direct_functions
      should not be updated until the end. That is the best way to handle
      function reference changes, and is the way other parts of ftrace handles
      this.
      
      The following is done:
      
       1. Change add_hash_entry() to return the entry it created and inserted
          into the hash, and not just return success or not.
      
       2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
          the former.
      
       3. Allocate a "new_hash" at the start that is made for holding both the
          new hash entries as well as the existing entries in direct_functions.
      
       4. Copy (not move) the direct_function entries over to the new_hash.
      
       5. Copy the entries of the added hash to the new_hash.
      
       6. If everything succeeds, then use rcu_pointer_assign() to update the
          direct_functions with the new_hash.
      
      This simplifies the code and fixes both the memory leak as well as the
      race condition mentioned above.
      
      Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/
      Link: https://lore.kernel.org/linux-trace-kernel/20231229115134.08dd5174@gandalf.local.home
      
      
      
      Cc: stable@vger.kernel.org
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Fixes: 763e34e7 ("ftrace: Add register_ftrace_direct()")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      d05cb470
  8. Dec 29, 2023
    • Steven Rostedt (Google)'s avatar
      tracing: Fix blocked reader of snapshot buffer · 39a7dc23
      Steven Rostedt (Google) authored
      If an application blocks on the snapshot or snapshot_raw files, expecting
      to be woken up when a snapshot occurs, it will not happen. Or it may
      happen with an unexpected result.
      
      That result is that the application will be reading the main buffer
      instead of the snapshot buffer. That is because when the snapshot occurs,
      the main and snapshot buffers are swapped. But the reader has a descriptor
      still pointing to the buffer that it originally connected to.
      
      This is fine for the main buffer readers, as they may be blocked waiting
      for a watermark to be hit, and when a snapshot occurs, the data that the
      main readers want is now on the snapshot buffer.
      
      But for waiters of the snapshot buffer, they are waiting for an event to
      occur that will trigger the snapshot and they can then consume it quickly
      to save the snapshot before the next snapshot occurs. But to do this, they
      need to read the new snapshot buffer, not the old one that is now
      receiving new data.
      
      Also, it does not make sense to have a watermark "buffer_percent" on the
      snapshot buffer, as the snapshot buffer is static and does not receive new
      data except all at once.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231228095149.77f5b45d@gandalf.local.home
      
      
      
      Cc: stable@vger.kernel.org
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Fixes: debdd57f ("tracing: Make a snapshot feature available from userspace")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      39a7dc23
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Fix wake ups when buffer_percent is set to 100 · 623b1f89
      Steven Rostedt (Google) authored
      The tracefs file "buffer_percent" is to allow user space to set a
      water-mark on how much of the tracing ring buffer needs to be filled in
      order to wake up a blocked reader.
      
       0 - is to wait until any data is in the buffer
       1 - is to wait for 1% of the sub buffers to be filled
       50 - would be half of the sub buffers are filled with data
       100 - is not to wake the waiter until the ring buffer is completely full
      
      Unfortunately the test for being full was:
      
      	dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
      	return (dirty * 100) > (full * nr_pages);
      
      Where "full" is the value for "buffer_percent".
      
      There is two issues with the above when full == 100.
      
      1. dirty * 100 > 100 * nr_pages will never be true
         That is, the above is basically saying that if the user sets
         buffer_percent to 100, more pages need to be dirty than exist in the
         ring buffer!
      
      2. The page that the writer is on is never considered dirty, as dirty
         pages are only those that are full. When the writer goes to a new
         sub-buffer, it clears the contents of that sub-buffer.
      
      That is, even if the check was ">=" it would still not be equal as the
      most pages that can be considered "dirty" is nr_pages - 1.
      
      To fix this, add one to dirty and use ">=" in the compare.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231226125902.4a057f1d@gandalf.local.home
      
      
      
      Cc: stable@vger.kernel.org
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Fixes: 03329f99 ("tracing: Add tracefs file buffer_percentage")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      623b1f89
  9. Dec 21, 2023
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Use subbuf_order for buffer page masking · 3cb30911
      Steven Rostedt (Google) authored
      The comparisons to PAGE_SIZE were all converted to use the
      buffer->subbuf_order, but the use of PAGE_MASK was missed.
      
      Convert all the PAGE_MASK usages over to:
      
        (PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219173800.66eefb7a@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Fixes: 139f8400 ("ring-buffer: Page size per ring buffer")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      3cb30911
    • Steven Rostedt (Google)'s avatar
      tracing: Update subbuffer with kilobytes not page order · 2f84b39f
      Steven Rostedt (Google) authored
      Using page order for deciding what the size of the ring buffer sub buffers
      are is exposing a bit too much of the implementation. Although the sub
      buffers are only allocated in orders of pages, allow the user to specify
      the minimum size of each sub-buffer via kilobytes like they can with the
      buffer size itself.
      
      If the user specifies 3 via:
      
        echo 3 > buffer_subbuf_size_kb
      
      Then the sub-buffer size will round up to 4kb (on a 4kb page size system).
      
      If they specify:
      
        echo 6 > buffer_subbuf_size_kb
      
      The sub-buffer size will become 8kb.
      
      and so on.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219185631.809766769@goodmis.org
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2f84b39f
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Just update the subbuffers when changing their allocation order · 8e7b58c2
      Steven Rostedt (Google) authored
      The ring_buffer_subbuf_order_set() was creating ring_buffer_per_cpu
      cpu_buffers with the new subbuffers with the updated order, and if they
      all successfully were created, then they the ring_buffer's per_cpu buffers
      would be freed and replaced by them.
      
      The problem is that the freed per_cpu buffers contains state that would be
      lost. Running the following commands:
      
      1. # echo 3 > /sys/kernel/tracing/buffer_subbuf_order
      2. # echo 0 > /sys/kernel/tracing/tracing_cpumask
      3. # echo 1 > /sys/kernel/tracing/snapshot
      4. # echo ff > /sys/kernel/tracing/tracing_cpumask
      5. # echo test > /sys/kernel/tracing/trace_marker
      
      Would result in:
      
       -bash: echo: write error: Bad file descriptor
      
      That's because the state of the per_cpu buffers of the snapshot buffer is
      lost when the order is changed (the order of a freed snapshot buffer goes
      to 0 to save memory, and when the snapshot buffer is allocated again, it
      goes back to what the main buffer is).
      
      In operation 2, the snapshot buffers were set to "disable" (as all the
      ring buffers CPUs were disabled).
      
      In operation 3, the snapshot is allocated and a call to
      ring_buffer_subbuf_order_set() replaced the per_cpu buffers losing the
      "record_disable" count.
      
      When it was enabled again, the atomic_dec(&cpu_buffer->record_disable) was
      decrementing a zero, setting it to -1. Writing 1 into the snapshot would
      swap the snapshot buffer with the main buffer, so now the main buffer is
      "disabled", and nothing can write to the ring buffer anymore.
      
      Instead of creating new per_cpu buffers and losing the state of the old
      buffers, basically do what the resize does and just allocate new subbuf
      pages into the new_pages link list of the per_cpu buffer and if they all
      succeed, then replace the old sub buffers with the new ones. This keeps
      the per_cpu buffer descriptor in tact and by doing so, keeps its state.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.944104939@goodmis.org
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Fixes: f9b94daa ("ring-buffer: Set new size of the ring buffer sub page")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      8e7b58c2
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Keep the same size when updating the order · 353cc219
      Steven Rostedt (Google) authored
      The function ring_buffer_subbuf_order_set() just updated the sub-buffers
      to the new size, but this also changes the size of the buffer in doing so.
      As the size is determined by nr_pages * subbuf_size. If the subbuf_size is
      increased without decreasing the nr_pages, this causes the total size of
      the buffer to increase.
      
      This broke the latency tracers as the snapshot needs to be the same size
      as the main buffer. The size of the snapshot buffer is only expanded when
      needed, and because the order is still the same, the size becomes out of
      sync with the main buffer, as the main buffer increased in size without
      the tracing system knowing.
      
      Calculate the nr_pages to allocate with the new subbuf_size to be
      buffer_size / new_subbuf_size.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.649397785@goodmis.org
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Fixes: f9b94daa ("ring-buffer: Set new size of the ring buffer sub page")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      353cc219
    • Steven Rostedt (Google)'s avatar
      tracing: Stop the tracing while changing the ring buffer subbuf size · fa4b54af
      Steven Rostedt (Google) authored
      Because the main buffer and the snapshot buffer need to be the same for
      some tracers, otherwise it will fail and disable all tracing, the tracers
      need to be stopped while updating the sub buffer sizes so that the tracers
      see the main and snapshot buffers with the same sub buffer size.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.353222794@goodmis.org
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Fixes: 2808e31e ("ring-buffer: Add interface for configuring trace sub buffer size")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      fa4b54af
    • Steven Rostedt (Google)'s avatar
      tracing: Update snapshot order along with main buffer order · aa067682
      Steven Rostedt (Google) authored
      When updating the order of the sub buffers for the main buffer, make sure
      that if the snapshot buffer exists, that it gets its order updated as
      well.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.054668186@goodmis.org
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      aa067682
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Make sure the spare sub buffer used for reads has same size · 4e958db3
      Steven Rostedt (Google) authored
      Now that the ring buffer specifies the size of its sub buffers, they all
      need to be the same size. When doing a read, a swap is done with a spare
      page. Make sure they are the same size before doing the swap, otherwise
      the read will fail.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.763664788@goodmis.org
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      4e958db3
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Do no swap cpu buffers if order is different · b81e03a2
      Steven Rostedt (Google) authored
      As all the subbuffer order (subbuffer sizes) must be the same throughout
      the ring buffer, check the order of the buffers that are doing a CPU
      buffer swap in ring_buffer_swap_cpu() to make sure they are the same.
      
      If the are not the same, then fail to do the swap, otherwise the ring
      buffer will think the CPU buffer has a specific subbuffer size when it
      does not.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.467894710@goodmis.org
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      b81e03a2
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure · 22887dfb
      Steven Rostedt (Google) authored
      On failure to allocate ring buffer pages, the pointer to the CPU buffer
      pages is freed, but the pages that were allocated previously were not.
      Make sure they are freed too.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.179352802@goodmis.org
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Fixes: f9b94daa ("tracing: Set new size of the ring buffer sub page")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      22887dfb
    • Steven Rostedt (Google)'s avatar
      tracing / synthetic: Disable events after testing in synth_event_gen_test_init() · 88b30c7f
      Steven Rostedt (Google) authored
      The synth_event_gen_test module can be built in, if someone wants to run
      the tests at boot up and not have to load them.
      
      The synth_event_gen_test_init() function creates and enables the synthetic
      events and runs its tests.
      
      The synth_event_gen_test_exit() disables the events it created and
      destroys the events.
      
      If the module is builtin, the events are never disabled. The issue is, the
      events should be disable after the tests are run. This could be an issue
      if the rest of the boot up tests are enabled, as they expect the events to
      be in a known state before testing. That known state happens to be
      disabled.
      
      When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y
      a warning will trigger:
      
       Running tests on trace events:
       Testing event create_synth_test:
       Enabled event during self test!
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 event_trace_self_tests+0x1c2/0x480
       Modules linked in:
       CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
       RIP: 0010:event_trace_self_tests+0x1c2/0x480
       Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 84 d6 fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 3d 99 fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff
       RSP: 0000:ffffc9000001fdc0 EFLAGS: 00010246
       RAX: 0000000000000029 RBX: ffff88810399ca80 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffffffffb9f19478 RDI: ffff88823c734e64
       RBP: ffff88810399f300 R08: 0000000000000000 R09: fffffbfff79eb32a
       R10: ffffffffbcf59957 R11: 0000000000000001 R12: ffff888104068090
       R13: ffffffffbc89f0a0 R14: ffffffffbc8a0f08 R15: 0000000000000078
       FS:  0000000000000000(0000) GS:ffff88823c700000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 00000001f6282001 CR4: 0000000000170ef0
       Call Trace:
        <TASK>
        ? __warn+0xa5/0x200
        ? event_trace_self_tests+0x1c2/0x480
        ? report_bug+0x1f6/0x220
        ? handle_bug+0x6f/0x90
        ? exc_invalid_op+0x17/0x50
        ? asm_exc_invalid_op+0x1a/0x20
        ? tracer_preempt_on+0x78/0x1c0
        ? event_trace_self_tests+0x1c2/0x480
        ? __pfx_event_trace_self_tests_init+0x10/0x10
        event_trace_self_tests_init+0x27/0xe0
        do_one_initcall+0xd6/0x3c0
        ? __pfx_do_one_initcall+0x10/0x10
        ? kasan_set_track+0x25/0x30
        ? rcu_is_watching+0x38/0x60
        kernel_init_freeable+0x324/0x450
        ? __pfx_kernel_init+0x10/0x10
        kernel_init+0x1f/0x1e0
        ? _raw_spin_unlock_irq+0x33/0x50
        ret_from_fork+0x34/0x60
        ? __pfx_kernel_init+0x10/0x10
        ret_from_fork_asm+0x1b/0x30
        </TASK>
      
      This is because the synth_event_gen_test_init() left the synthetic events
      that it created enabled. By having it disable them after testing, the
      other selftests will run fine.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231220111525.2f0f49b0@gandalf.local.home
      
      
      
      Cc: stable@vger.kernel.org
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Tom Zanussi <zanussi@kernel.org>
      Fixes: 9fe41efa ("tracing: Add synth event generation test module")
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Reported-by: default avatarAlexander Graf <graf@amazon.com>
      Tested-by: default avatarAlexander Graf <graf@amazon.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      88b30c7f
  10. Dec 20, 2023
  11. Dec 19, 2023
    • Andrii Nakryiko's avatar
      Revert BPF token-related functionality · d17aff80
      Andrii Nakryiko authored
      This patch includes the following revert (one  conflicting BPF FS
      patch and three token patch sets, represented by merge commits):
        - revert 0f5d5454 "Merge branch 'bpf-fs-mount-options-parsing-follow-ups'";
        - revert 750e7857 "bpf: Support uid and gid when mounting bpffs";
        - revert 73376328 "Merge branch 'bpf-token-support-in-libbpf-s-bpf-object'";
        - revert c35919dc "Merge branch 'bpf-token-and-bpf-fs-based-delegation'".
      
      Link: https://lore.kernel.org/bpf/CAHk-=wg7JuFYwGy=GOMbRCtOL+jwSQsdUaBsRWkDVYbxipbM5A@mail.gmail.com
      
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      d17aff80
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Check if absolute timestamp goes backwards · f50345b4
      Steven Rostedt (Google) authored
      The check_buffer() which checks the timestamps of the ring buffer
      sub-buffer page, when enabled, only checks if the adding of deltas of the
      events from the last absolute timestamp or the timestamp of the sub-buffer
      page adds up to the current event.
      
      What it does not check is if the absolute timestamp causes the time of the
      events to go backwards, as that can cause issues elsewhere.
      
      Test for the timestamp going backwards too.
      
      This also fixes a slight issue where if the warning triggers at boot up
      (because of the resetting of the tsc), it will disable all further checks,
      even those that are after boot Have it continue checking if the warning
      was ignored during boot up.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219074732.18b092d4@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      f50345b4
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Add interrupt information to dump of data sub-buffer · d40dbb61
      Steven Rostedt (Google) authored
      When the ring buffer timestamp verifier triggers, it dumps the content of
      the sub-buffer. But currently it only dumps the timestamps and the offset
      of the data as well as the deltas. It would be even more informative if
      the event data also showed the interrupt context level it was in.
      
      That is, if each event showed that the event was written in normal,
      softirq, irq or NMI context. Then a better idea about how the events may
      have been interrupted from each other.
      
      As the payload of the ring buffer is really a black box of the ring
      buffer, just assume that if the payload is larger than a trace entry, that
      it is a trace entry. As trace entries have the interrupt context
      information saved in a flags field, look at that location and report the
      output of the flags.
      
      If the payload is not a trace entry, there's no way to really know, and
      the information will be garbage. But that's OK, because this is for
      debugging only (this output is not used in production as the buffer check
      that calls it causes a huge overhead to the tracing). This information,
      when available, is crucial for debugging timestamp issues. If it's
      garbage, it will also be pretty obvious that its garbage too.
      
      As this output usually happens in kselftests of the tracing code, the user
      will know what the payload is at the time.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219074542.6f304601@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Suggested-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      d40dbb61
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Remove 32bit timestamp logic · c84897c0
      Steven Rostedt (Google) authored
      Each event has a 27 bit timestamp delta that is used to hold the delta
      from the last event. If the time between events is greater than 2^27, then
      a timestamp is added that holds a 59 bit absolute timestamp.
      
      Until a389d86f ("ring-buffer: Have nested events still record running
      time stamp"), if an interrupt interrupted an event in progress, all the
      events delta would be zero to not deal with the races that need to be
      handled. The commit a389d86f changed that to handle the races giving
      all events, even those that preempt other events, still have an accurate
      timestamp.
      
      To handle those races requires performing 64-bit cmpxchg on the
      timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
      very slow. To try to deal with this the timestamp logic was broken into
      two and then three 32-bit cmpxchgs, with the thought that two (or three)
      32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
      architectures.
      
      Part of the problem with this is that I didn't have any 32-bit
      architectures to test on. After hitting several subtle bugs in this code,
      an effort was made to try and see if three 32-bit cmpxchgs are indeed
      faster than a single 64-bit. After a few people brushed off the dust of
      their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
      was faster than a single 64-bit, it was in the order of 50% at best, not
      300%.
      
      After some more refactoring of the code, all 4 64-bit cmpxchg were removed:
      
       https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home
       https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home
       https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home
       https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home/
      
      With all the 64-bit cmpxchg removed, the complex 32-bit workaround can also be
      removed.
      
      The 32-bit and 64-bit logic is now exactly the same.
      
      Link: https://lore.kernel.org/all/20231213214632.15047c40@gandalf.local.home/
      Link: https://lore.kernel.org/linux-trace-kernel/20231219074303.28f9abda@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      c84897c0
    • Steven Rostedt (Google)'s avatar
      tracing: Increase size of trace_marker_raw to max ring buffer entry · 76ca20c7
      Steven Rostedt (Google) authored
      There's no reason to give an arbitrary limit to the size of a raw trace
      marker. Just let it be as big as the size that is allowed by the ring
      buffer itself.
      
      And there's also no reason to artificially break up the write to
      TRACE_BUF_SIZE, as that's not even used.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231213104218.2efc70c1@gandalf.local.home
      
      
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      76ca20c7
    • Steven Rostedt (Google)'s avatar
      tracing: Have trace_marker break up by lines by size of trace_seq · 9482341d
      Steven Rostedt (Google) authored
      If a trace_marker write is bigger than what trace_seq can hold, then it
      will print "LINE TOO BIG" message and not what was written.
      
      Instead, check if the write is bigger than the trace_seq and break it
      up by that size.
      
      Ideally, we could make the trace_seq dynamic that could hold this. But
      that's for another time.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231212190422.1eaf224f@gandalf.local.home
      
      
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      9482341d
    • Steven Rostedt (Google)'s avatar
      trace_seq: Increase the buffer size to almost two pages · 40fc60e3
      Steven Rostedt (Google) authored
      Now that trace_marker can hold more than 1KB string, and can write as much
      as the ring buffer can hold, the trace_seq is not big enough to hold
      writes:
      
       ~# a="1234567890"
       ~# cnt=4080
       ~# s=""
       ~# while [ $cnt -gt 10 ]; do
       ~#	s="${s}${a}"
       ~#	cnt=$((cnt-10))
       ~# done
       ~# echo $s > trace_marker
       ~# cat trace
       # tracer: nop
       #
       # entries-in-buffer/entries-written: 2/2   #P:8
       #
       #                                _-----=> irqs-off/BH-disabled
       #                               / _----=> need-resched
       #                              | / _---=> hardirq/softirq
       #                              || / _--=> preempt-depth
       #                              ||| / _-=> migrate-disable
       #                              |||| /     delay
       #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
       #              | |         |   |||||     |         |
                  <...>-860     [002] .....   105.543465: tracing_mark_write[LINE TOO BIG]
                  <...>-860     [002] .....   105.543496: tracing_mark_write: 789012345678901234567890
      
      By increasing the trace_seq buffer to almost two pages, it can now print
      out the first line.
      
      This also subtracts the rest of the trace_seq fields from the buffer, so
      that the entire trace_seq is now PAGE_SIZE aligned.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231209175220.19867af4@gandalf.local.home
      
      
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      40fc60e3
    • Steven Rostedt (Google)'s avatar
      tracing: Allow for max buffer data size trace_marker writes · 8ec90be7
      Steven Rostedt (Google) authored
      Allow a trace write to be as big as the ring buffer tracing data will
      allow. Currently, it only allows writes of 1KB in size, but there's no
      reason that it cannot allow what the ring buffer can hold.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231212131901.5f501e72@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      8ec90be7
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Add offset of events in dump on mismatch · 0b9036ef
      Steven Rostedt (Google) authored
      On bugs that have the ring buffer timestamp get out of sync, the config
      CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS, that checks for it and if it is
      detected it causes a dump of the bad sub buffer.
      
      It shows each event and their timestamp as well as the delta in the event.
      But it's also good to see the offset into the subbuffer for that event to
      know if how close to the end it is.
      
      Also print where the last event actually ended compared to where it was
      expected to end.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231211131623.59eaebd2@gandalf.local.home
      
      
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      0b9036ef
    • Steven Rostedt (Google)'s avatar
      tracing: Allow creating instances with specified system events · d2356997
      Steven Rostedt (Google) authored
      A trace instance may only need to enable specific events. As the eventfs
      directory of an instance currently creates all events which adds overhead,
      allow internal instances to be created with just the events in systems
      that they care about. This currently only deals with systems and not
      individual events, but this should bring down the overhead of creating
      instances for specific use cases quite bit.
      
      The trace_array_get_by_name() now has another parameter "systems". This
      parameter is a const string pointer of a comma/space separated list of
      event systems that should be created by the trace_array. (Note if the
      trace_array already exists, this parameter is ignored).
      
      The list of systems is saved and if a module is loaded, its events will
      not be added unless the system for those events also match the systems
      string.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231213093701.03fddec0@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Sean Paul <seanpaul@chromium.org>
      Cc: Arun Easi   <aeasi@marvell.com>
      Cc: Daniel Wagner <dwagner@suse.de>
      Tested-by: default avatarDmytro Maluka <dmaluka@chromium.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      d2356997
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Fix slowpath of interrupted event · b803d7c6
      Steven Rostedt (Google) authored
      To synchronize the timestamps with the ring buffer reservation, there are
      two timestamps that are saved in the buffer meta data.
      
      1. before_stamp
      2. write_stamp
      
      When the two are equal, the write_stamp is considered valid, as in, it may
      be used to calculate the delta of the next event as the write_stamp is the
      timestamp of the previous reserved event on the buffer.
      
      This is done by the following:
      
       /*A*/	w = current position on the ring buffer
      	before = before_stamp
      	after = write_stamp
      	ts = read current timestamp
      
      	if (before != after) {
      		write_stamp is not valid, force adding an absolute
      		timestamp.
      	}
      
       /*B*/	before_stamp = ts
      
       /*C*/	write = local_add_return(event length, position on ring buffer)
      
      	if (w == write - event length) {
      		/* Nothing interrupted between A and C */
       /*E*/		write_stamp = ts;
      		delta = ts - after
      		/*
      		 * If nothing interrupted again,
      		 * before_stamp == write_stamp and write_stamp
      		 * can be used to calculate the delta for
      		 * events that come in after this one.
      		 */
      	} else {
      
      		/*
      		 * The slow path!
      		 * Was interrupted between A and C.
      		 */
      
      This is the place that there's a bug. We currently have:
      
      		after = write_stamp
      		ts = read current timestamp
      
       /*F*/		if (write == current position on the ring buffer &&
      		    after < ts && cmpxchg(write_stamp, after, ts)) {
      
      			delta = ts - after;
      
      		} else {
      			delta = 0;
      		}
      
      The assumption is that if the current position on the ring buffer hasn't
      moved between C and F, then it also was not interrupted, and that the last
      event written has a timestamp that matches the write_stamp. That is the
      write_stamp is valid.
      
      But this may not be the case:
      
      If a task context event was interrupted by softirq between B and C.
      
      And the softirq wrote an event that got interrupted by a hard irq between
      C and E.
      
      and the hard irq wrote an event (does not need to be interrupted)
      
      We have:
      
       /*B*/ before_stamp = ts of normal context
      
         ---> interrupted by softirq
      
      	/*B*/ before_stamp = ts of softirq context
      
      	  ---> interrupted by hardirq
      
      		/*B*/ before_stamp = ts of hard irq context
      		/*E*/ write_stamp = ts of hard irq context
      
      		/* matches and write_stamp valid */
      	  <----
      
      	/*E*/ write_stamp = ts of softirq context
      
      	/* No longer matches before_stamp, write_stamp is not valid! */
      
         <---
      
       w != write - length, go to slow path
      
      // Right now the order of events in the ring buffer is:
      //
      // |-- softirq event --|-- hard irq event --|-- normal context event --|
      //
      
       after = write_stamp (this is the ts of softirq)
       ts = read current timestamp
      
       if (write == current position on the ring buffer [true] &&
           after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
      
      	delta = ts - after  [Wrong!]
      
      The delta is to be between the hard irq event and the normal context
      event, but the above logic made the delta between the softirq event and
      the normal context event, where the hard irq event is between the two. This
      will shift all the remaining event timestamps on the sub-buffer
      incorrectly.
      
      The write_stamp is only valid if it matches the before_stamp. The cmpxchg
      does nothing to help this.
      
      Instead, the following logic can be done to fix this:
      
      	before = before_stamp
      	ts = read current timestamp
      	before_stamp = ts
      
      	after = write_stamp
      
      	if (write == current position on the ring buffer &&
      	    after == before && after < ts) {
      
      		delta = ts - after
      
      	} else {
      		delta = 0;
      	}
      
      The above will only use the write_stamp if it still matches before_stamp
      and was tested to not have changed since C.
      
      As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!
      
      This means the 32-bit rb_time_t workaround can finally be removed. But
      that's for a later time.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/
      Link: https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home
      
      
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Fixes: dd939425 ("ring-buffer: Do not try to put back write_stamp")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      b803d7c6
  12. Dec 18, 2023
  13. Dec 17, 2023
  14. Dec 15, 2023
Loading