- Feb 01, 2024
-
-
Daniel Bristot de Oliveira authored
Currently, the timerlat's hrtimer is initialized at the first read of timerlat_fd, and destroyed at close(). It works, but it causes an error if the user program open() and close() the file without reading. Here's an example: # echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options # echo timerlat > /sys/kernel/debug/tracing/current_tracer # cat <<EOF > ./timerlat_load.py # !/usr/bin/env python3 timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r') timerlat_fd.close(); EOF # ./taskset -c 0 ./timerlat_load.py <BOOM> BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 2673 Comm: python3 Not tainted 6.6.13-200.fc39.x86_64 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:hrtimer_active+0xd/0x50 Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d RSP: 0018:ffffb031009b7e10 EFLAGS: 00010286 RAX: 000000000002db00 RBX: ffff9118f786db08 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9117a0e64400 RDI: ffff9118f786db08 RBP: ffff9118f786db80 R08: ffff9117a0ddd420 R09: ffff9117804d4f70 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9118f786db08 R13: ffff91178fdd5e20 R14: ffff9117840978c0 R15: 0000000000000000 FS: 00007f2ffbab1740(0000) GS:ffff9118f7840000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 00000001b402e000 CR4: 0000000000750ee0 PKRU: 55555554 Call Trace: <TASK> ? __die+0x23/0x70 ? page_fault_oops+0x171/0x4e0 ? srso_alias_return_thunk+0x5/0x7f ? avc_has_extended_perms+0x237/0x520 ? exc_page_fault+0x7f/0x180 ? asm_exc_page_fault+0x26/0x30 ? hrtimer_active+0xd/0x50 hrtimer_cancel+0x15/0x40 timerlat_fd_release+0x48/0xe0 __fput+0xf5/0x290 __x64_sys_close+0x3d/0x80 do_syscall_64+0x60/0x90 ? srso_alias_return_thunk+0x5/0x7f ? __x64_sys_ioctl+0x72/0xd0 ? srso_alias_return_thunk+0x5/0x7f ? syscall_exit_to_user_mode+0x2b/0x40 ? srso_alias_return_thunk+0x5/0x7f ? do_syscall_64+0x6c/0x90 ? srso_alias_return_thunk+0x5/0x7f ? exit_to_user_mode_prepare+0x142/0x1f0 ? srso_alias_return_thunk+0x5/0x7f ? syscall_exit_to_user_mode+0x2b/0x40 ? srso_alias_return_thunk+0x5/0x7f ? do_syscall_64+0x6c/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7f2ffb321594 Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d RSP: 002b:00007ffe8d8eef18 EFLAGS: 00000202 ORIG_RAX: 0000000000000003 RAX: ffffffffffffffda RBX: 00007f2ffba4e668 RCX: 00007f2ffb321594 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007ffe8d8eef40 R08: 0000000000000000 R09: 0000000000000000 R10: 55c926e3167eae79 R11: 0000000000000202 R12: 0000000000000003 R13: 00007ffe8d8ef030 R14: 0000000000000000 R15: 00007f2ffba4e668 </TASK> CR2: 0000000000000010 ---[ end trace 0000000000000000 ]--- Move hrtimer_init to timerlat_fd open() to avoid this problem. Link: https://lore.kernel.org/linux-trace-kernel/7324dd3fc0035658c99b825204a66049389c56e3.1706798888.git.bristot@kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org Fixes: e88ed227 ("tracing/timerlat: Add user-space interface") Signed-off-by:
Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Jan 31, 2024
-
-
Vincent Donnefort authored
The return type for ring_buffer_poll_wait() is __poll_t. This is behind the scenes an unsigned where we can set event bits. In case of a non-allocated CPU, we do return instead -EINVAL (0xffffffea). Lucky us, this ends up setting few error bits (EPOLLERR | EPOLLHUP | EPOLLNVAL), so user-space at least is aware something went wrong. Nonetheless, this is an incorrect code. Replace that -EINVAL with a proper EPOLLERR to clean that output. As this doesn't change the behaviour, there's no need to treat this change as a bug fix. Link: https://lore.kernel.org/linux-trace-kernel/20240131140955.3322792-1-vdonnefort@google.com Cc: stable@vger.kernel.org Fixes: 6721cb60 ("ring-buffer: Do not poll non allocated cpu buffers") Signed-off-by:
Vincent Donnefort <vdonnefort@google.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Jan 26, 2024
-
-
Masami Hiramatsu (Google) authored
Fix register_snapshot_trigger() to return error code if it failed to allocate a snapshot instead of 0 (success). Unless that, it will register snapshot trigger without an error. Link: https://lore.kernel.org/linux-trace-kernel/170622977792.270660.2789298642759362200.stgit@devnote2 Fixes: 0bbe7f71 ("tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation") Cc: stable@vger.kernel.org Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Jan 22, 2024
-
-
Petr Pavlu authored
Running the following two commands in parallel on a multi-processor AArch64 machine can sporadically produce an unexpected warning about duplicate histogram entries: $ while true; do echo hist:key=id.syscall:val=hitcount > \ /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist sleep 0.001 done $ stress-ng --sysbadaddr $(nproc) The warning looks as follows: [ 2911.172474] ------------[ cut here ]------------ [ 2911.173111] Duplicates detected: 1 [ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408 [ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E) [ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1 [ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G E 6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01 [ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018 [ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408 [ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408 [ 2911.185310] sp : ffff8000a1513900 [ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001 [ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008 [ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180 [ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff [ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8 [ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731 [ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c [ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8 [ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000 [ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480 [ 2911.194259] Call trace: [ 2911.194626] tracing_map_sort_entries+0x3e0/0x408 [ 2911.195220] hist_show+0x124/0x800 [ 2911.195692] seq_read_iter+0x1d4/0x4e8 [ 2911.196193] seq_read+0xe8/0x138 [ 2911.196638] vfs_read+0xc8/0x300 [ 2911.197078] ksys_read+0x70/0x108 [ 2911.197534] __arm64_sys_read+0x24/0x38 [ 2911.198046] invoke_syscall+0x78/0x108 [ 2911.198553] el0_svc_common.constprop.0+0xd0/0xf8 [ 2911.199157] do_el0_svc+0x28/0x40 [ 2911.199613] el0_svc+0x40/0x178 [ 2911.200048] el0t_64_sync_handler+0x13c/0x158 [ 2911.200621] el0t_64_sync+0x1a8/0x1b0 [ 2911.201115] ---[ end trace 0000000000000000 ]--- The problem appears to be caused by CPU reordering of writes issued from __tracing_map_insert(). The check for the presence of an element with a given key in this function is: val = READ_ONCE(entry->val); if (val && keys_match(key, val->key, map->key_size)) ... The write of a new entry is: elt = get_free_elt(map); memcpy(elt->key, key, map->key_size); entry->val = elt; The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;" stores may become visible in the reversed order on another CPU. This second CPU might then incorrectly determine that a new key doesn't match an already present val->key and subsequently insert a new element, resulting in a duplicate. Fix the problem by adding a write barrier between "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for good measure, also use WRITE_ONCE(entry->val, elt) for publishing the element. The sequence pairs with the mentioned "READ_ONCE(entry->val);" and the "val->key" check which has an address dependency. The barrier is placed on a path executed when adding an element for a new key. Subsequent updates targeting the same key remain unaffected. From the user's perspective, the issue was introduced by commit c193707d ("tracing: Remove code which merges duplicates"), which followed commit cbf4100e ("tracing: Add support to detect and avoid duplicates"). The previous code operated differently; it inherently expected potential races which result in duplicates but merged them later when they occurred. Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com Fixes: c193707d ("tracing: Remove code which merges duplicates") Signed-off-by:
Petr Pavlu <petr.pavlu@suse.com> Acked-by:
Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Jan 10, 2024
-
-
Steven Rostedt (Google) authored
It's been 11 years since the ring_buffer_size() function was updated to use the nr_pages from the buffer->buffers[cpu] structure instead of using the buffer->nr_pages that no longer exists. The comment in the code is more of what a change log should have and is pretty much useless for development. It's saying how things worked back in 2012 that bares no purpose on today's code. Remove it. Link: https://lore.kernel.org/linux-trace-kernel/84d3b41a72bd43dbb9d44921ef535c92@AcuMS.aculab.com/ Link: https://lore.kernel.org/linux-trace-kernel/20231220081028.7cd7e8e2@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reported-by:
David Laight <David.Laight@ACULAB.COM> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org>
-
- Jan 08, 2024
-
-
Steven Rostedt (Google) authored
The parse_actions() function uses 'len = str_has_prefix()' to test which action is in the string being parsed. But then it goes and repeats the logic for each different action. This logic can be simplified and duplicate code can be removed as 'len' contains the length of the found prefix which should be used for all actions. Link: https://lore.kernel.org/all/20240107112044.6702cb66@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240107203258.37e26d2b@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andy Shevchenko <andy@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Dec 30, 2023
-
-
Steven Rostedt (Google) authored
Masami Hiramatsu reported a memory leak in register_ftrace_direct() where if the number of new entries are added is large enough to cause two allocations in the loop: for (i = 0; i < size; i++) { hlist_for_each_entry(entry, &hash->buckets[i], hlist) { new = ftrace_add_rec_direct(entry->ip, addr, &free_hash); if (!new) goto out_remove; entry->direct = addr; } } Where ftrace_add_rec_direct() has: if (ftrace_hash_empty(direct_functions) || direct_functions->count > 2 * (1 << direct_functions->size_bits)) { struct ftrace_hash *new_hash; int size = ftrace_hash_empty(direct_functions) ? 0 : direct_functions->count + 1; if (size < 32) size = 32; new_hash = dup_hash(direct_functions, size); if (!new_hash) return NULL; *free_hash = direct_functions; direct_functions = new_hash; } The "*free_hash = direct_functions;" can happen twice, losing the previous allocation of direct_functions. But this also exposed a more serious bug. The modification of direct_functions above is not safe. As direct_functions can be referenced at any time to find what direct caller it should call, the time between: new_hash = dup_hash(direct_functions, size); and direct_functions = new_hash; can have a race with another CPU (or even this one if it gets interrupted), and the entries being moved to the new hash are not referenced. That's because the "dup_hash()" is really misnamed and is really a "move_hash()". It moves the entries from the old hash to the new one. Now even if that was changed, this code is not proper as direct_functions should not be updated until the end. That is the best way to handle function reference changes, and is the way other parts of ftrace handles this. The following is done: 1. Change add_hash_entry() to return the entry it created and inserted into the hash, and not just return success or not. 2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove the former. 3. Allocate a "new_hash" at the start that is made for holding both the new hash entries as well as the existing entries in direct_functions. 4. Copy (not move) the direct_function entries over to the new_hash. 5. Copy the entries of the added hash to the new_hash. 6. If everything succeeds, then use rcu_pointer_assign() to update the direct_functions with the new_hash. This simplifies the code and fixes both the memory leak as well as the race condition mentioned above. Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/ Link: https://lore.kernel.org/linux-trace-kernel/20231229115134.08dd5174@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: 763e34e7 ("ftrace: Add register_ftrace_direct()") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Dec 29, 2023
-
-
Steven Rostedt (Google) authored
If an application blocks on the snapshot or snapshot_raw files, expecting to be woken up when a snapshot occurs, it will not happen. Or it may happen with an unexpected result. That result is that the application will be reading the main buffer instead of the snapshot buffer. That is because when the snapshot occurs, the main and snapshot buffers are swapped. But the reader has a descriptor still pointing to the buffer that it originally connected to. This is fine for the main buffer readers, as they may be blocked waiting for a watermark to be hit, and when a snapshot occurs, the data that the main readers want is now on the snapshot buffer. But for waiters of the snapshot buffer, they are waiting for an event to occur that will trigger the snapshot and they can then consume it quickly to save the snapshot before the next snapshot occurs. But to do this, they need to read the new snapshot buffer, not the old one that is now receiving new data. Also, it does not make sense to have a watermark "buffer_percent" on the snapshot buffer, as the snapshot buffer is static and does not receive new data except all at once. Link: https://lore.kernel.org/linux-trace-kernel/20231228095149.77f5b45d@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: debdd57f ("tracing: Make a snapshot feature available from userspace") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
The tracefs file "buffer_percent" is to allow user space to set a water-mark on how much of the tracing ring buffer needs to be filled in order to wake up a blocked reader. 0 - is to wait until any data is in the buffer 1 - is to wait for 1% of the sub buffers to be filled 50 - would be half of the sub buffers are filled with data 100 - is not to wake the waiter until the ring buffer is completely full Unfortunately the test for being full was: dirty = ring_buffer_nr_dirty_pages(buffer, cpu); return (dirty * 100) > (full * nr_pages); Where "full" is the value for "buffer_percent". There is two issues with the above when full == 100. 1. dirty * 100 > 100 * nr_pages will never be true That is, the above is basically saying that if the user sets buffer_percent to 100, more pages need to be dirty than exist in the ring buffer! 2. The page that the writer is on is never considered dirty, as dirty pages are only those that are full. When the writer goes to a new sub-buffer, it clears the contents of that sub-buffer. That is, even if the check was ">=" it would still not be equal as the most pages that can be considered "dirty" is nr_pages - 1. To fix this, add one to dirty and use ">=" in the compare. Link: https://lore.kernel.org/linux-trace-kernel/20231226125902.4a057f1d@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: 03329f99 ("tracing: Add tracefs file buffer_percentage") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Dec 21, 2023
-
-
Steven Rostedt (Google) authored
The comparisons to PAGE_SIZE were all converted to use the buffer->subbuf_order, but the use of PAGE_MASK was missed. Convert all the PAGE_MASK usages over to: (PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1 Link: https://lore.kernel.org/linux-trace-kernel/20231219173800.66eefb7a@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: 139f8400 ("ring-buffer: Page size per ring buffer") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Using page order for deciding what the size of the ring buffer sub buffers are is exposing a bit too much of the implementation. Although the sub buffers are only allocated in orders of pages, allow the user to specify the minimum size of each sub-buffer via kilobytes like they can with the buffer size itself. If the user specifies 3 via: echo 3 > buffer_subbuf_size_kb Then the sub-buffer size will round up to 4kb (on a 4kb page size system). If they specify: echo 6 > buffer_subbuf_size_kb The sub-buffer size will become 8kb. and so on. Link: https://lore.kernel.org/linux-trace-kernel/20231219185631.809766769@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
The ring_buffer_subbuf_order_set() was creating ring_buffer_per_cpu cpu_buffers with the new subbuffers with the updated order, and if they all successfully were created, then they the ring_buffer's per_cpu buffers would be freed and replaced by them. The problem is that the freed per_cpu buffers contains state that would be lost. Running the following commands: 1. # echo 3 > /sys/kernel/tracing/buffer_subbuf_order 2. # echo 0 > /sys/kernel/tracing/tracing_cpumask 3. # echo 1 > /sys/kernel/tracing/snapshot 4. # echo ff > /sys/kernel/tracing/tracing_cpumask 5. # echo test > /sys/kernel/tracing/trace_marker Would result in: -bash: echo: write error: Bad file descriptor That's because the state of the per_cpu buffers of the snapshot buffer is lost when the order is changed (the order of a freed snapshot buffer goes to 0 to save memory, and when the snapshot buffer is allocated again, it goes back to what the main buffer is). In operation 2, the snapshot buffers were set to "disable" (as all the ring buffers CPUs were disabled). In operation 3, the snapshot is allocated and a call to ring_buffer_subbuf_order_set() replaced the per_cpu buffers losing the "record_disable" count. When it was enabled again, the atomic_dec(&cpu_buffer->record_disable) was decrementing a zero, setting it to -1. Writing 1 into the snapshot would swap the snapshot buffer with the main buffer, so now the main buffer is "disabled", and nothing can write to the ring buffer anymore. Instead of creating new per_cpu buffers and losing the state of the old buffers, basically do what the resize does and just allocate new subbuf pages into the new_pages link list of the per_cpu buffer and if they all succeed, then replace the old sub buffers with the new ones. This keeps the per_cpu buffer descriptor in tact and by doing so, keeps its state. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.944104939@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: f9b94daa ("ring-buffer: Set new size of the ring buffer sub page") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
The function ring_buffer_subbuf_order_set() just updated the sub-buffers to the new size, but this also changes the size of the buffer in doing so. As the size is determined by nr_pages * subbuf_size. If the subbuf_size is increased without decreasing the nr_pages, this causes the total size of the buffer to increase. This broke the latency tracers as the snapshot needs to be the same size as the main buffer. The size of the snapshot buffer is only expanded when needed, and because the order is still the same, the size becomes out of sync with the main buffer, as the main buffer increased in size without the tracing system knowing. Calculate the nr_pages to allocate with the new subbuf_size to be buffer_size / new_subbuf_size. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.649397785@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: f9b94daa ("ring-buffer: Set new size of the ring buffer sub page") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Because the main buffer and the snapshot buffer need to be the same for some tracers, otherwise it will fail and disable all tracing, the tracers need to be stopped while updating the sub buffer sizes so that the tracers see the main and snapshot buffers with the same sub buffer size. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.353222794@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: 2808e31e ("ring-buffer: Add interface for configuring trace sub buffer size") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
When updating the order of the sub buffers for the main buffer, make sure that if the snapshot buffer exists, that it gets its order updated as well. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.054668186@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Now that the ring buffer specifies the size of its sub buffers, they all need to be the same size. When doing a read, a swap is done with a spare page. Make sure they are the same size before doing the swap, otherwise the read will fail. Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.763664788@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
As all the subbuffer order (subbuffer sizes) must be the same throughout the ring buffer, check the order of the buffers that are doing a CPU buffer swap in ring_buffer_swap_cpu() to make sure they are the same. If the are not the same, then fail to do the swap, otherwise the ring buffer will think the CPU buffer has a specific subbuffer size when it does not. Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.467894710@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
On failure to allocate ring buffer pages, the pointer to the CPU buffer pages is freed, but the pages that were allocated previously were not. Make sure they are freed too. Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.179352802@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: f9b94daa ("tracing: Set new size of the ring buffer sub page") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
The synth_event_gen_test module can be built in, if someone wants to run the tests at boot up and not have to load them. The synth_event_gen_test_init() function creates and enables the synthetic events and runs its tests. The synth_event_gen_test_exit() disables the events it created and destroys the events. If the module is builtin, the events are never disabled. The issue is, the events should be disable after the tests are run. This could be an issue if the rest of the boot up tests are enabled, as they expect the events to be in a known state before testing. That known state happens to be disabled. When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y a warning will trigger: Running tests on trace events: Testing event create_synth_test: Enabled event during self test! ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 event_trace_self_tests+0x1c2/0x480 Modules linked in: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:event_trace_self_tests+0x1c2/0x480 Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 84 d6 fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 3d 99 fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff RSP: 0000:ffffc9000001fdc0 EFLAGS: 00010246 RAX: 0000000000000029 RBX: ffff88810399ca80 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffb9f19478 RDI: ffff88823c734e64 RBP: ffff88810399f300 R08: 0000000000000000 R09: fffffbfff79eb32a R10: ffffffffbcf59957 R11: 0000000000000001 R12: ffff888104068090 R13: ffffffffbc89f0a0 R14: ffffffffbc8a0f08 R15: 0000000000000078 FS: 0000000000000000(0000) GS:ffff88823c700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000001f6282001 CR4: 0000000000170ef0 Call Trace: <TASK> ? __warn+0xa5/0x200 ? event_trace_self_tests+0x1c2/0x480 ? report_bug+0x1f6/0x220 ? handle_bug+0x6f/0x90 ? exc_invalid_op+0x17/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? tracer_preempt_on+0x78/0x1c0 ? event_trace_self_tests+0x1c2/0x480 ? __pfx_event_trace_self_tests_init+0x10/0x10 event_trace_self_tests_init+0x27/0xe0 do_one_initcall+0xd6/0x3c0 ? __pfx_do_one_initcall+0x10/0x10 ? kasan_set_track+0x25/0x30 ? rcu_is_watching+0x38/0x60 kernel_init_freeable+0x324/0x450 ? __pfx_kernel_init+0x10/0x10 kernel_init+0x1f/0x1e0 ? _raw_spin_unlock_irq+0x33/0x50 ret_from_fork+0x34/0x60 ? __pfx_kernel_init+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> This is because the synth_event_gen_test_init() left the synthetic events that it created enabled. By having it disable them after testing, the other selftests will run fine. Link: https://lore.kernel.org/linux-trace-kernel/20231220111525.2f0f49b0@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: 9fe41efa ("tracing: Add synth event generation test module") Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Reported-by:
Alexander Graf <graf@amazon.com> Tested-by:
Alexander Graf <graf@amazon.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Dec 20, 2023
-
-
Randy Dunlap authored
scripts/kernel-doc warns about using @args: for variadic arguments to functions. Documentation/doc-guide/kernel-doc.rst says that this should be written as @...: instead, so update the source code to match that, preventing the warnings. trace_events_synth.c:1165: warning: Excess function parameter 'args' description in '__synth_event_gen_cmd_start' trace_events_synth.c:1714: warning: Excess function parameter 'args' description in 'synth_event_trace' Link: https://lore.kernel.org/linux-trace-kernel/20231220061226.30962-1-rdunlap@infradead.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 35ca5207 ("tracing: Add synthetic event command generation functions") Fixes: 8dcc53ad ("tracing: Add synth_event_trace() and related functions") Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Tzvetomir Stoyanov (VMware) authored
As the size of the ring sub buffer page can be changed dynamically, the logic that reads and writes to the buffer should be fixed to take that into account. Some internal ring buffer APIs are changed: ring_buffer_alloc_read_page() ring_buffer_free_read_page() ring_buffer_read_page() A new API is introduced: ring_buffer_read_page_data() Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-6-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.875145995@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by:
Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> [ Fixed kerneldoc on data_page parameter in ring_buffer_free_read_page() ] Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Tzvetomir Stoyanov (VMware) authored
There are two approaches when changing the size of the ring buffer sub page: 1. Destroying all pages and allocating new pages with the new size. 2. Allocating new pages, copying the content of the old pages before destroying them. The first approach is easier, it is selected in the proposed implementation. Changing the ring buffer sub page size is supposed to not happen frequently. Usually, that size should be set only once, when the buffer is not in use yet and is supposed to be empty. Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-5-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.588995543@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by:
Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Tzvetomir Stoyanov (VMware) authored
The trace ring buffer sub page size can be configured, per trace instance. A new ftrace file "buffer_subbuf_order" is added to get and set the size of the ring buffer sub page for current trace instance. The size must be an order of system page size, that's why the new interface works with system page order, instead of absolute page size: 0 means the ring buffer sub page is equal to 1 system page and so forth: 0 - 1 system page 1 - 2 system pages 2 - 4 system pages ... The ring buffer sub page size is limited between 1 and 128 system pages. The default value is 1 system page. New ring buffer APIs are introduced: ring_buffer_subbuf_order_set() ring_buffer_subbuf_order_get() ring_buffer_subbuf_size_get() Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-4-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.298324722@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by:
Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Tzvetomir Stoyanov (VMware) authored
Currently the size of one sub buffer page is global for all buffers and it is hard coded to one system page. In order to introduce configurable ring buffer sub page size, the internal logic should be refactored to work with sub page size per ring buffer. Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-3-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.009147038@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by:
Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Tzvetomir Stoyanov (VMware) authored
In order to introduce sub-buffer size per ring buffer, some internal refactoring is needed. As ring_buffer_print_page_header() will depend on the trace_buffer structure, it is moved after the structure definition. Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-2-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185627.723857541@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by:
Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Dec 19, 2023
-
-
Andrii Nakryiko authored
This patch includes the following revert (one conflicting BPF FS patch and three token patch sets, represented by merge commits): - revert 0f5d5454 "Merge branch 'bpf-fs-mount-options-parsing-follow-ups'"; - revert 750e7857 "bpf: Support uid and gid when mounting bpffs"; - revert 73376328 "Merge branch 'bpf-token-support-in-libbpf-s-bpf-object'"; - revert c35919dc "Merge branch 'bpf-token-and-bpf-fs-based-delegation'". Link: https://lore.kernel.org/bpf/CAHk-=wg7JuFYwGy=GOMbRCtOL+jwSQsdUaBsRWkDVYbxipbM5A@mail.gmail.com Signed-off-by:
Andrii Nakryiko <andrii@kernel.org>
-
Steven Rostedt (Google) authored
The check_buffer() which checks the timestamps of the ring buffer sub-buffer page, when enabled, only checks if the adding of deltas of the events from the last absolute timestamp or the timestamp of the sub-buffer page adds up to the current event. What it does not check is if the absolute timestamp causes the time of the events to go backwards, as that can cause issues elsewhere. Test for the timestamp going backwards too. This also fixes a slight issue where if the warning triggers at boot up (because of the resetting of the tsc), it will disable all further checks, even those that are after boot Have it continue checking if the warning was ignored during boot up. Link: https://lore.kernel.org/linux-trace-kernel/20231219074732.18b092d4@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
When the ring buffer timestamp verifier triggers, it dumps the content of the sub-buffer. But currently it only dumps the timestamps and the offset of the data as well as the deltas. It would be even more informative if the event data also showed the interrupt context level it was in. That is, if each event showed that the event was written in normal, softirq, irq or NMI context. Then a better idea about how the events may have been interrupted from each other. As the payload of the ring buffer is really a black box of the ring buffer, just assume that if the payload is larger than a trace entry, that it is a trace entry. As trace entries have the interrupt context information saved in a flags field, look at that location and report the output of the flags. If the payload is not a trace entry, there's no way to really know, and the information will be garbage. But that's OK, because this is for debugging only (this output is not used in production as the buffer check that calls it causes a huge overhead to the tracing). This information, when available, is crucial for debugging timestamp issues. If it's garbage, it will also be pretty obvious that its garbage too. As this output usually happens in kselftests of the tracing code, the user will know what the payload is at the time. Link: https://lore.kernel.org/linux-trace-kernel/20231219074542.6f304601@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Suggested-by:
Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Each event has a 27 bit timestamp delta that is used to hold the delta from the last event. If the time between events is greater than 2^27, then a timestamp is added that holds a 59 bit absolute timestamp. Until a389d86f ("ring-buffer: Have nested events still record running time stamp"), if an interrupt interrupted an event in progress, all the events delta would be zero to not deal with the races that need to be handled. The commit a389d86f changed that to handle the races giving all events, even those that preempt other events, still have an accurate timestamp. To handle those races requires performing 64-bit cmpxchg on the timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered very slow. To try to deal with this the timestamp logic was broken into two and then three 32-bit cmpxchgs, with the thought that two (or three) 32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit architectures. Part of the problem with this is that I didn't have any 32-bit architectures to test on. After hitting several subtle bugs in this code, an effort was made to try and see if three 32-bit cmpxchgs are indeed faster than a single 64-bit. After a few people brushed off the dust of their old 32-bit machines, tests were done, and even though 32-bit cmpxchg was faster than a single 64-bit, it was in the order of 50% at best, not 300%. After some more refactoring of the code, all 4 64-bit cmpxchg were removed: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home/ With all the 64-bit cmpxchg removed, the complex 32-bit workaround can also be removed. The 32-bit and 64-bit logic is now exactly the same. Link: https://lore.kernel.org/all/20231213214632.15047c40@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20231219074303.28f9abda@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
There's no reason to give an arbitrary limit to the size of a raw trace marker. Just let it be as big as the size that is allowed by the ring buffer itself. And there's also no reason to artificially break up the write to TRACE_BUF_SIZE, as that's not even used. Link: https://lore.kernel.org/linux-trace-kernel/20231213104218.2efc70c1@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
If a trace_marker write is bigger than what trace_seq can hold, then it will print "LINE TOO BIG" message and not what was written. Instead, check if the write is bigger than the trace_seq and break it up by that size. Ideally, we could make the trace_seq dynamic that could hold this. But that's for another time. Link: https://lore.kernel.org/linux-trace-kernel/20231212190422.1eaf224f@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Now that trace_marker can hold more than 1KB string, and can write as much as the ring buffer can hold, the trace_seq is not big enough to hold writes: ~# a="1234567890" ~# cnt=4080 ~# s="" ~# while [ $cnt -gt 10 ]; do ~# s="${s}${a}" ~# cnt=$((cnt-10)) ~# done ~# echo $s > trace_marker ~# cat trace # tracer: nop # # entries-in-buffer/entries-written: 2/2 #P:8 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <...>-860 [002] ..... 105.543465: tracing_mark_write[LINE TOO BIG] <...>-860 [002] ..... 105.543496: tracing_mark_write: 789012345678901234567890 By increasing the trace_seq buffer to almost two pages, it can now print out the first line. This also subtracts the rest of the trace_seq fields from the buffer, so that the entire trace_seq is now PAGE_SIZE aligned. Link: https://lore.kernel.org/linux-trace-kernel/20231209175220.19867af4@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Allow a trace write to be as big as the ring buffer tracing data will allow. Currently, it only allows writes of 1KB in size, but there's no reason that it cannot allow what the ring buffer can hold. Link: https://lore.kernel.org/linux-trace-kernel/20231212131901.5f501e72@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
On bugs that have the ring buffer timestamp get out of sync, the config CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS, that checks for it and if it is detected it causes a dump of the bad sub buffer. It shows each event and their timestamp as well as the delta in the event. But it's also good to see the offset into the subbuffer for that event to know if how close to the end it is. Also print where the last event actually ended compared to where it was expected to end. Link: https://lore.kernel.org/linux-trace-kernel/20231211131623.59eaebd2@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
A trace instance may only need to enable specific events. As the eventfs directory of an instance currently creates all events which adds overhead, allow internal instances to be created with just the events in systems that they care about. This currently only deals with systems and not individual events, but this should bring down the overhead of creating instances for specific use cases quite bit. The trace_array_get_by_name() now has another parameter "systems". This parameter is a const string pointer of a comma/space separated list of event systems that should be created by the trace_array. (Note if the trace_array already exists, this parameter is ignored). The list of systems is saved and if a module is loaded, its events will not be added unless the system for those events also match the systems string. Link: https://lore.kernel.org/linux-trace-kernel/20231213093701.03fddec0@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Sean Paul <seanpaul@chromium.org> Cc: Arun Easi <aeasi@marvell.com> Cc: Daniel Wagner <dwagner@suse.de> Tested-by:
Dmytro Maluka <dmaluka@chromium.org> Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
To synchronize the timestamps with the ring buffer reservation, there are two timestamps that are saved in the buffer meta data. 1. before_stamp 2. write_stamp When the two are equal, the write_stamp is considered valid, as in, it may be used to calculate the delta of the next event as the write_stamp is the timestamp of the previous reserved event on the buffer. This is done by the following: /*A*/ w = current position on the ring buffer before = before_stamp after = write_stamp ts = read current timestamp if (before != after) { write_stamp is not valid, force adding an absolute timestamp. } /*B*/ before_stamp = ts /*C*/ write = local_add_return(event length, position on ring buffer) if (w == write - event length) { /* Nothing interrupted between A and C */ /*E*/ write_stamp = ts; delta = ts - after /* * If nothing interrupted again, * before_stamp == write_stamp and write_stamp * can be used to calculate the delta for * events that come in after this one. */ } else { /* * The slow path! * Was interrupted between A and C. */ This is the place that there's a bug. We currently have: after = write_stamp ts = read current timestamp /*F*/ if (write == current position on the ring buffer && after < ts && cmpxchg(write_stamp, after, ts)) { delta = ts - after; } else { delta = 0; } The assumption is that if the current position on the ring buffer hasn't moved between C and F, then it also was not interrupted, and that the last event written has a timestamp that matches the write_stamp. That is the write_stamp is valid. But this may not be the case: If a task context event was interrupted by softirq between B and C. And the softirq wrote an event that got interrupted by a hard irq between C and E. and the hard irq wrote an event (does not need to be interrupted) We have: /*B*/ before_stamp = ts of normal context ---> interrupted by softirq /*B*/ before_stamp = ts of softirq context ---> interrupted by hardirq /*B*/ before_stamp = ts of hard irq context /*E*/ write_stamp = ts of hard irq context /* matches and write_stamp valid */ <---- /*E*/ write_stamp = ts of softirq context /* No longer matches before_stamp, write_stamp is not valid! */ <--- w != write - length, go to slow path // Right now the order of events in the ring buffer is: // // |-- softirq event --|-- hard irq event --|-- normal context event --| // after = write_stamp (this is the ts of softirq) ts = read current timestamp if (write == current position on the ring buffer [true] && after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) { delta = ts - after [Wrong!] The delta is to be between the hard irq event and the normal context event, but the above logic made the delta between the softirq event and the normal context event, where the hard irq event is between the two. This will shift all the remaining event timestamps on the sub-buffer incorrectly. The write_stamp is only valid if it matches the before_stamp. The cmpxchg does nothing to help this. Instead, the following logic can be done to fix this: before = before_stamp ts = read current timestamp before_stamp = ts after = write_stamp if (write == current position on the ring buffer && after == before && after < ts) { delta = ts - after } else { delta = 0; } The above will only use the write_stamp if it still matches before_stamp and was tested to not have changed since C. As a bonus, with this logic we do not need any 64-bit cmpxchg() at all! This means the 32-bit rb_time_t workaround can finally be removed. But that's for a later time. Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Fixes: dd939425 ("ring-buffer: Do not try to put back write_stamp") Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Dec 18, 2023
-
-
Jiri Olsa authored
Currently the __uprobe_register will return 0 (success) when called with negative offset. The reason is that the call to register_for_each_vma and then build_map_info won't return error for negative offset. They just won't do anything - no matching vma is found so there's no registered breakpoint for the uprobe. I don't think we can change the behaviour of __uprobe_register and fail for negative uprobe offset, because apps might depend on that already. But I think we can still make the change and check for it on bpf multi link syscall level. Also moving the __get_user call and check for the offsets to the top of loop, to fail early without extra __get_user calls for ref_ctr_offset and cookie arrays. Signed-off-by:
Jiri Olsa <jolsa@kernel.org> Signed-off-by:
Andrii Nakryiko <andrii@kernel.org> Acked-by:
Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20231217215538.3361991-2-jolsa@kernel.org
-
- Dec 17, 2023
-
-
Naveen N Rao authored
Trying to probe update_sd_lb_stats() using perf results in the below message in the kernel log: trace_kprobe: Could not probe notrace function _text This is because 'perf probe' specifies the kprobe location as an offset from '_text': $ sudo perf probe -D update_sd_lb_stats p:probe/update_sd_lb_stats _text+1830728 However, the error message is misleading and doesn't help convey the actual notrace function that is being probed. Fix this by looking up the actual function name that is being probed. With this fix, we now get the below message in the kernel log: trace_kprobe: Could not probe notrace function update_sd_lb_stats.constprop.0 Link: https://lore.kernel.org/all/20231214051702.1687300-1-naveen@kernel.org/ Signed-off-by:
Naveen N Rao <naveen@kernel.org> Acked-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org>
-
- Dec 15, 2023
-
-
Hou Tao authored
An abnormally big cnt may also be assigned to kprobe_multi.cnt when attaching multiple kprobes. It will trigger the following warning in kvmalloc_node(): if (unlikely(size > INT_MAX)) { WARN_ON_ONCE(!(flags & __GFP_NOWARN)); return NULL; } Fix the warning by limiting the maximal number of kprobes in bpf_kprobe_multi_link_attach(). If the number of kprobes is greater than MAX_KPROBE_MULTI_CNT, the attachment will fail and return -E2BIG. Fixes: 0dcac272 ("bpf: Add multi kprobe link") Signed-off-by:
Hou Tao <houtao1@huawei.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Jiri Olsa <jolsa@kernel.org> Acked-by:
Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231215100708.2265609-3-houtao@huaweicloud.com
-
Hou Tao authored
An abnormally big cnt may be passed to link_create.uprobe_multi.cnt, and it will trigger the following warning in kvmalloc_node(): if (unlikely(size > INT_MAX)) { WARN_ON_ONCE(!(flags & __GFP_NOWARN)); return NULL; } Fix the warning by limiting the maximal number of uprobes in bpf_uprobe_multi_link_attach(). If the number of uprobes is greater than MAX_UPROBE_MULTI_CNT, the attachment will return -E2BIG. Fixes: 89ae89f5 ("bpf: Add multi uprobe link") Reported-by:
Xingwei Lee <xrivendell7@gmail.com> Signed-off-by:
Hou Tao <houtao1@huawei.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Jiri Olsa <jolsa@kernel.org> Acked-by:
Andrii Nakryiko <andrii@kernel.org> Closes: https://lore.kernel.org/bpf/CABOYnLwwJY=yFAGie59LFsUsBAgHfroVqbzZ5edAXbFE3YiNVA@mail.gmail.com Link: https://lore.kernel.org/bpf/20231215100708.2265609-2-houtao@huaweicloud.com
-