- Sep 27, 2024
-
-
Tejun Heo authored
scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned on before the first scx_ops_init_task() loop in scx_ops_enable(). However, if an external entity causes sched_class switch before the loop is complete, tasks which are not initialized could be switched to SCX. The following can be reproduced by running a program which keeps toggling a process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2). sched_ext: Invalid task state transition 0 -> 3 for fish[1623] WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200 ... Sched_ext: simple (enabling) RIP: 0010:scx_ops_enable_task+0x1a1/0x200 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x850/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix it by gating scx_ops_init_task() separately using scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are finished with scx_ops_init_task(). Signed-off-by:
Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
scx_ops_enable() has two task iteration loops. The first one calls scx_ops_init_task() on every task and the latter switches the eligible ones into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the second loop switched it into READY before switching the task into SCX. The distinction between INIT and READY is only meaningful in the fork path where it's used to tell whether the task finished forking so that we can tell ops.exit_task() accordingly. Leaving task in INIT state between the two loops is incosistent with the fork path and incorrect. The following can be triggered by running a program which keeps toggling a task between SCHED_OTHER and SCHED_SCX while enabling a task: sched_ext: Invalid task state transition 1 -> 3 for fish[1526] WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200 ... Sched_ext: qmap (enabling+all) RIP: 0010:scx_ops_enable_task+0x1a1/0x200 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x850/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix it by transitioning to READY in the first loop right after scx_ops_init_task() succeeds. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com>
-
Tejun Heo authored
scx_ops_enable() used preempt_disable() around the task iteration loop to switch tasks into SCX to guarantee forward progress of the task which is running scx_ops_enable(). However, in the gap between setting __scx_ops_enabled and preeempt_disable(), an external entity can put tasks including the enabling one into SCX prematurely, which can lead to malfunctions including stalls. The bypass mode can wrap the entire enabling operation and guarantee forward progress no matter what the BPF scheduler does. Use the bypass mode instead to guarantee forward progress while enabling. While at it, release and regrab scx_tasks_lock between the two task iteration locks in scx_ops_enable() for clarity as there is no reason to keep holding the lock between them. Signed-off-by:
Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used anywhere and only adds confusion. Drop SCX_OPS_PREPPING. Signed-off-by:
Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
check_hotplug_seq() is used to detect CPU hotplug event which occurred while the BPF scheduler is being loaded so that initialization can be retried if CPU hotplug events take place before the CPU hotplug callbacks are online. As such, the best place to call it is in the same cpu_read_lock() section that enables the CPU hotplug ops. Currently, it is called in the next cpus_read_lock() block in scx_ops_enable(). The side effect of this placement is a small window in which hotplug sequence detection can trigger unnecessarily, which isn't critical. Move check_hotplug_seq() invocation to the same cpus_read_lock() block as the hotplug operation enablement to close the window and get the invocation out of the way for planned locking updates. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com>
-
- Sep 26, 2024
-
-
Tejun Heo authored
While bypassing, tasks are scheduled in FIFO order which favors tasks that hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While bypassing, guaranteeing timely forward progress is the main goal. There's no point in giving long slices. Shorten the time slice used while bypassing from 20ms to 5ms. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
David Vernet <void@manifault.com>
-
Tejun Heo authored
In the bypass mode, the global DSQ is used to schedule all tasks in simple FIFO order. All tasks are queued into the global DSQ and all CPUs try to execute tasks from it. This creates a lot of cross-node cacheline accesses and scheduling across the node boundaries, and can lead to live-lock conditions where the system takes tens of minutes to disable the BPF scheduler while executing in the bypass mode. Split the global DSQ per NUMA node. Each node has its own global DSQ. When a task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to the task's CPU and all CPUs in a node only consume its node-local global DSQ. This resolves a livelock condition which could be reliably triggered on an 2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with `stress-ng --workload 80 --workload-threads 10` while repeatedly enabling and disabling a SCX scheduler. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
David Vernet <void@manifault.com>
-
Tejun Heo authored
To prepare for the addition of find_global_dsq(). No functional changes. Signed-off-by:
tejun heo <tj@kernel.org> Acked-by:
David Vernet <void@manifault.com>
-
Tejun Heo authored
sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new() SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and is consumed implicitly, but all BPF DSQ related kfuncs could be used on it. SCX_DSQ_GLOBAL will be split per-node for scalability and those operations won't make sense anymore. Disallow SCX_DSQ_GLOBAL on scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new(). This means that SCX_DSQ_GLOBAL can only be used as a dispatch target from BPF schedulers. With scx_flatcg, which was using SCX_DSQ_GLOBAL as the fallback DSQ, updated, this shouldn't affect any schedulers. This leaves find_dsq_for_dispatch() the only user of find_non_local_dsq(). Open code and remove find_non_local_dsq(). Signed-off-by:
tejun heo <tj@kernel.org> Acked-by:
David Vernet <void@manifault.com>
-
- Sep 24, 2024
-
-
Tejun Heo authored
move_remote_task_to_local_dsq() is only defined on SMP configs but scx_disaptch_from_dsq() was calling move_remote_task_to_local_dsq() on UP configs too causing build failures. Add a dummy move_remote_task_to_local_dsq() which triggers a warning. Signed-off-by:
Tejun Heo <tj@kernel.org> Fixes: 4c30f5ce ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Reported-by:
kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409241108.jaocHiDJ-lkp@intel.com/
-
- Sep 23, 2024
-
-
Andrea Righi authored
As discussed during the distro-centric session within the sched_ext Microconference at LPC 2024, introduce a sequence counter that is incremented every time a BPF scheduler is loaded. This feature can help distributions in diagnosing potential performance regressions by identifying systems where users are running (or have ran) custom BPF schedulers. Example: arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 0 arighi@virtme-ng~> sudo scx_simple local=1 global=0 ^CEXIT: unregistered from user space arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 1 In this way user-space tools (such as Ubuntu's apport and similar) are able to gather and include this information in bug reports. Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com> Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Phil Auld <pauld@redhat.com> Signed-off-by:
Andrea Righi <andrea.righi@linux.dev> Signed-off-by:
Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
a2f4b16e ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix build when !CONFIG_STACKTRACE. Signed-off-by:
Tejun Heo <tj@kernel.org> Reported-by:
kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/
-
Pat Somaru authored
Disable the rq empty path when scx is enabled. SCX must consult the BPF scheduler (via the dispatch path in balance) to determine if rq is empty. This fixes stalls when scx is enabled. Signed-off-by:
Pat Somaru <patso@likewhatevs.io> Fixes: 3dcac251 ("sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()") Signed-off-by:
Tejun Heo <tj@kernel.org>
-
Yu Liao authored
When build with CONFIG_GROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED, the idle member is not defined: kernel/sched/ext.c:3701:16: error: 'struct task_group' has no member named 'idle' 3701 | if (!tg->idle) | ^~ Fix this by putting 'idle' under new CONFIG_GROUP_SCHED_WEIGHT. tj: Move idle field upward to avoid breaking up CONFIG_FAIR_GROUP_SCHED block. Fixes: e179e80c ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT") Reported-by:
kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/ Signed-off-by:
Yu Liao <liaoyu15@huawei.com> Signed-off-by:
Tejun Heo <tj@kernel.org>
-
Yu Liao authored
Fix the following error when build with CONFIG_GROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED: kernel/sched/core.c:9634:15: error: implicit declaration of function 'sched_group_set_idle'; did you mean 'scx_group_set_idle'? [-Wimplicit-function-declaration] 9634 | ret = sched_group_set_idle(css_tg(css), idle); | ^~~~~~~~~~~~~~~~~~~~ | scx_group_set_idle Fixes: e179e80c ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT") Reported-by:
kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/ Signed-off-by:
Yu Liao <liaoyu15@huawei.com> Signed-off-by:
Tejun Heo <tj@kernel.org>
-
Leon Romanovsky authored
While using the IOMMU DMA path, the dma_addressing_limited() function checks ops struct which doesn't exist in the IOMMU case. This causes to the kernel panic while loading ADMGPU driver. BUG: kernel NULL pointer dereference, address: 00000000000000a0 PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 10 UID: 0 PID: 611 Comm: (udev-worker) Tainted: G T 6.11.0-clang-07154-g726e2d0cf2bb #257 Tainted: [T]=RANDSTRUCT Hardware name: ASUS System Product Name/ROG STRIX Z690-G GAMING WIFI, BIOS 3701 07/03/2024 RIP: 0010:dma_addressing_limited+0x53/0xa0 Code: 8b 93 48 02 00 00 48 39 d1 49 89 d6 4c 0f 42 f1 48 85 d2 4c 0f 44 f1 f6 83 fc 02 00 00 40 75 0a 48 89 df e8 1f 09 00 00 eb 24 <4c> 8b 1c 25 a0 00 00 00 4d 85 db 74 17 48 89 df 41 ba 8b 84 2d 55 RSP: 0018:ffffa8d2c12cf740 EFLAGS: 00010202 RAX: 00000000ffffffff RBX: ffff8948820220c8 RCX: 000000ffffffffff RDX: 0000000000000000 RSI: ffffffffc124dc6d RDI: ffff8948820220c8 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff894883c3f040 R13: ffff89488dac8828 R14: 000000ffffffffff R15: ffff8948820220c8 FS: 00007fe6ba881900(0000) GS:ffff894fdf700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000a0 CR3: 0000000111984000 CR4: 0000000000f50ef0 PKRU: 55555554 Call Trace: <TASK> ? __die_body+0x65/0xc0 ? page_fault_oops+0x3b9/0x450 ? _prb_read_valid+0x212/0x390 ? do_user_addr_fault+0x608/0x680 ? exc_page_fault+0x4e/0xa0 ? asm_exc_page_fault+0x26/0x30 ? dma_addressing_limited+0x53/0xa0 amdgpu_ttm_init+0x56/0x4b0 [amdgpu] gmc_v8_0_sw_init+0x561/0x670 [amdgpu] amdgpu_device_ip_init+0xf5/0x570 [amdgpu] amdgpu_device_init+0x1a57/0x1ea0 [amdgpu] ? _raw_spin_unlock_irqrestore+0x1a/0x40 ? pci_conf1_read+0xc0/0xe0 ? pci_bus_read_config_word+0x52/0xa0 amdgpu_driver_load_kms+0x15/0xa0 [amdgpu] amdgpu_pci_probe+0x1b7/0x4c0 [amdgpu] pci_device_probe+0x1c5/0x260 really_probe+0x130/0x470 __driver_probe_device+0x77/0x150 driver_probe_device+0x19/0x120 __driver_attach+0xb1/0x1e0 ? __cfi___driver_attach+0x10/0x10 bus_for_each_dev+0x115/0x170 bus_add_driver+0x192/0x2d0 driver_register+0x5c/0xf0 ? __cfi_init_module+0x10/0x10 [amdgpu] do_one_initcall+0x128/0x380 ? idr_alloc_cyclic+0x139/0x1d0 ? security_kernfs_init_security+0x42/0x140 ? __kernfs_new_node+0x1be/0x250 ? sysvec_apic_timer_interrupt+0xb6/0xc0 ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 ? _raw_spin_unlock+0x11/0x30 ? free_unref_page+0x283/0x650 ? kfree+0x274/0x3a0 ? kfree+0x274/0x3a0 ? kfree+0x274/0x3a0 ? load_module+0xf2e/0x1130 ? __kmalloc_cache_noprof+0x12a/0x2e0 do_init_module+0x7d/0x240 __se_sys_init_module+0x19e/0x220 do_syscall_64+0x8a/0x150 ? __irq_exit_rcu+0x5e/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fe6bb5980ee Code: 48 8b 0d 3d ed 12 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0a ed 12 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd462219d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 0000556caf0d0670 RCX: 00007fe6bb5980ee RDX: 0000556caf0d3080 RSI: 0000000002893458 RDI: 00007fe6b3400010 RBP: 0000000000020000 R08: 0000000000020010 R09: 0000000000000080 R10: c26073c166186e00 R11: 0000000000000206 R12: 0000556caf0d3430 R13: 0000556caf0d0670 R14: 0000556caf0d3080 R15: 0000556caf0ce700 </TASK> Modules linked in: amdgpu(+) i915(+) drm_suballoc_helper intel_gtt drm_exec drm_buddy iTCO_wdt i2c_algo_bit intel_pmc_bxt drm_display_helper iTCO_vendor_support gpu_sched drm_ttm_helper cec ttm amdxcp video backlight pinctrl_alderlake nct6775 hwmon_vid nct6775_core coretemp CR2: 00000000000000a0 ---[ end trace 0000000000000000 ]--- RIP: 0010:dma_addressing_limited+0x53/0xa0 Code: 8b 93 48 02 00 00 48 39 d1 49 89 d6 4c 0f 42 f1 48 85 d2 4c 0f 44 f1 f6 83 fc 02 00 00 40 75 0a 48 89 df e8 1f 09 00 00 eb 24 <4c> 8b 1c 25 a0 00 00 00 4d 85 db 74 17 48 89 df 41 ba 8b 84 2d 55 RSP: 0018:ffffa8d2c12cf740 EFLAGS: 00010202 RAX: 00000000ffffffff RBX: ffff8948820220c8 RCX: 000000ffffffffff RDX: 0000000000000000 RSI: ffffffffc124dc6d RDI: ffff8948820220c8 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff894883c3f040 R13: ffff89488dac8828 R14: 000000ffffffffff R15: ffff8948820220c8 FS: 00007fe6ba881900(0000) GS:ffff894fdf700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000a0 CR3: 0000000111984000 CR4: 0000000000f50ef0 PKRU: 55555554 Fixes: b5c58b2f ("dma-mapping: direct calls for dma-iommu") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219292 Reported-by:
Niklāvs Koļesņikovs <pinkflames.linux@gmail.com> Signed-off-by:
Leon Romanovsky <leon@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de> Tested-by:
Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>
-
- Sep 22, 2024
-
-
Christoph Hellwig authored
Commit b5c58b2f ("dma-mapping: direct calls for dma-iommu") switched to use direct calls to dma-iommu, but missed the dma_vmap_noncontiguous, dma_vunmap_noncontiguous and dma_mmap_noncontiguous behavior keyed off the presence of the alloc_noncontiguous method. Fix this by removing the now unused alloc_noncontiguous and free_noncontiguous methods and moving the vmapping and mmaping of the noncontiguous allocations into the iommu code, as it is the only provider of actually noncontiguous allocations. Fixes: b5c58b2f ("dma-mapping: direct calls for dma-iommu") Reported-by:
Xi Ruoyao <xry111@xry111.site> Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Leon Romanovsky <leon@kernel.org> Tested-by:
Xi Ruoyao <xry111@xry111.site>
-
Kan Liang authored
The below warning is triggered when building with arm multi_v7_defconfig. kernel/events/core.c: In function 'perf_event_setup_cpumask': kernel/events/core.c:14012:13: warning: the comparison will always evaluate as 'true' for the address of 'thread_sibling' will never be NULL [-Waddress] 14012 | if (!topology_sibling_cpumask(cpu)) { The perf_event_init_cpu() may be invoked at the early boot stage, while the topology_*_cpumask hasn't been initialized yet. The check is to specially handle the case, and initialize the perf_online_<domain>_masks on the boot CPU. X86 uses a per-cpu cpumask pointer, which could be NULL at the early boot stage. However, ARM uses a global variable, which never be NULL. Use perf_online_mask as an indicator instead. Only initialize the perf_online_<domain>_masks when perf_online_mask is empty. Fix a typo as well. Fixes: 4ba4f1af ("perf: Generic hotplug support for a PMU with a scope") Reported-by:
Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/lkml/20240911153854.240bbc1f@canb.auug.org.au/ Reported-by:
Steven Price <steven.price@arm.com> Closes: https://lore.kernel.org/lkml/1835eb6d-3e05-47f3-9eae-507ce165c3bf@arm.com/ Signed-off-by:
Kan Liang <kan.liang@linux.intel.com> Tested-by:
Steven Price <steven.price@arm.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Sep 20, 2024
-
-
Jinjie Ruan authored
On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high" will cause system stall as below: Zone ranges: DMA32 [mem 0x0000000080000000-0x000000009fffffff] Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000080000000-0x000000008005ffff] node 0: [mem 0x0000000080060000-0x000000009fffffff] Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff] (stall here) commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop bug") fix this on 32-bit architecture. However, the problem is not completely solved. If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on 64-bit architecture, for example, when system memory is equal to CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also occur: -> reserve_crashkernel_generic() and high is true -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX). As Catalin suggested, do not remove the ",high" reservation fallback to ",low" logic which will change arm64's kdump behavior, but fix it by skipping the above situation similar to commit d2f32f23190b ("crash: fix x86_32 crash memory reserve dead loop"). After this patch, it print: cannot allocate crashkernel (size:0x1f400000) Signed-off-by:
Jinjie Ruan <ruanjinjie@huawei.com> Suggested-by:
Catalin Marinas <catalin.marinas@arm.com> Reviewed-by:
Catalin Marinas <catalin.marinas@arm.com> Acked-by:
Baoquan He <bhe@redhat.com> Link: https://lore.kernel.org/r/20240812062017.2674441-1-ruanjinjie@huawei.com Signed-off-by:
Palmer Dabbelt <palmer@rivosinc.com>
-
- Sep 17, 2024
-
-
Oleg Nesterov authored
Now that xol_mapping has its own ->fault() method we no longer need xol_area->pages[1] == NULL, we need a single page. Link: https://lkml.kernel.org/r/20240911131437.GC3448@redhat.com Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Oleg Nesterov authored
Currently each xol_area has its own instance of vm_special_mapping, this is suboptimal and ugly. Kill xol_area->xol_mapping and add a single global instance of vm_special_mapping, the ->fault() method can use area->pages rather than xol_mapping->pages. As a side effect this fixes the problem introduced by the recent commit 223febc6 ("mm: add optional close() to struct vm_special_mapping"), if special_mapping_close() is called from the __mmput() paths, it will use vma->vm_private_data = &area->xol_mapping freed by uprobe_clear_state(). Link: https://lkml.kernel.org/r/20240911131407.GB3448@redhat.com Fixes: 223febc6 ("mm: add optional close() to struct vm_special_mapping") Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Reported-by:
Sven Schnelle <svens@linux.ibm.com> Closes: https://lore.kernel.org/all/yt9dy149vprr.fsf@linux.ibm.com/ Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Oleg Nesterov authored
This reverts commit 08e28de1. A malicious application can munmap() its "[uprobes]" vma and in this case xol_mapping.close == uprobe_clear_state() will free the memory which can be used by another thread, or the same thread when it hits the uprobe bp afterwards. Link: https://lkml.kernel.org/r/20240911131320.GA3448@redhat.com Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
Patch series "resource: Fix region_intersects() vs add_memory_driver_managed()", v3. The patchset fixes a bug of region_intersects() for systems with CXL memory. The details of the bug can be found in [1/3]. To avoid similar bugs in the future. A kunit test case for region_intersects() is added in [3/3]. [2/3] is a preparation patch for [3/3]. This patch (of 3): region_intersects() is important because it's used for /dev/mem permission checking. To avoid possible bug of region_intersects() in the future, a kunit test case for region_intersects() is added. Link: https://lkml.kernel.org/r/20240906030713.204292-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20240906030713.204292-4-ying.huang@intel.com Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
During developing a kunit test case for region_intersects(), some fake resources need to be inserted into iomem_resource. To do that, a resource hole needs to be found first in iomem_resource. However, alloc_free_mem_region() cannot work for iomem_resource now. Because the start address to check cannot be 0 to detect address wrapping 0 in gfr_continue(), while iomem_resource.start == 0. To make alloc_free_mem_region() works for iomem_resource, gfr_start() is changed to avoid to return 0 even if base->start == 0. We don't need to check 0 as start address. Link: https://lkml.kernel.org/r/20240906030713.204292-3-ying.huang@intel.com Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
On a system with CXL memory, the resource tree (/proc/iomem) related to CXL memory may look like something as follows. 490000000-50fffffff : CXL Window 0 490000000-50fffffff : region0 490000000-50fffffff : dax0.0 490000000-50fffffff : System RAM (kmem) Because drivers/dax/kmem.c calls add_memory_driver_managed() during onlining CXL memory, which makes "System RAM (kmem)" a descendant of "CXL Window X". This confuses region_intersects(), which expects all "System RAM" resources to be at the top level of iomem_resource. This can lead to bugs. For example, when the following command line is executed to write some memory in CXL memory range via /dev/mem, $ dd if=data of=/dev/mem bs=$((1 << 10)) seek=$((0x490000000 >> 10)) count=1 dd: error writing '/dev/mem': Bad address 1+0 records in 0+0 records out 0 bytes copied, 0.0283507 s, 0.0 kB/s the command fails as expected. However, the error code is wrong. It should be "Operation not permitted" instead of "Bad address". More seriously, the /dev/mem permission checking in devmem_is_allowed() passes incorrectly. Although the accessing is prevented later because ioremap() isn't allowed to map system RAM, it is a potential security issue. During command executing, the following warning is reported in the kernel log for calling ioremap() on system RAM. ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d Call Trace: memremap+0xcb/0x184 xlate_dev_mem_ptr+0x25/0x2f write_mem+0x94/0xfb vfs_write+0x128/0x26d ksys_write+0xac/0xfe do_syscall_64+0x9a/0xfd entry_SYSCALL_64_after_hwframe+0x4b/0x53 The details of command execution process are as follows. In the above resource tree, "System RAM" is a descendant of "CXL Window 0" instead of a top level resource. So, region_intersects() will report no System RAM resources in the CXL memory region incorrectly, because it only checks the top level resources. Consequently, devmem_is_allowed() will return 1 (allow access via /dev/mem) for CXL memory region incorrectly. Fortunately, ioremap() doesn't allow to map System RAM and reject the access. So, region_intersects() needs to be fixed to work correctly with the resource tree with "System RAM" not at top level as above. To fix it, if we found a unmatched resource in the top level, we will continue to search matched resources in its descendant resources. So, we will not miss any matched resources in resource tree anymore. In the new implementation, an example resource tree |------------- "CXL Window 0" ------------| |-- "System RAM" --| will behave similar as the following fake resource tree for region_intersects(, IORESOURCE_SYSTEM_RAM, ), |-- "System RAM" --||-- "CXL Window 0a" --| Where "CXL Window 0a" is part of the original "CXL Window 0" that isn't covered by "System RAM". Link: https://lkml.kernel.org/r/20240906030713.204292-2-ying.huang@intel.com Fixes: c221c0b0 ("device-dax: "Hotplug" persistent memory for use like normal RAM") Signed-off-by:
"Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Sep 13, 2024
-
-
Hou Tao authored
Call the missed kfree() in btf_parse_struct_metas() when there is no special field in btf, otherwise will get the following kmemleak report: unreferenced object 0xffff888101033620 (size 8): comm "test_progs", pid 604, jiffies 4295127011 ...... backtrace (crc e77dc444): [<00000000186f90f3>] kmemleak_alloc+0x4b/0x80 [<00000000ac8e9c4d>] __kmalloc_cache_noprof+0x2a1/0x310 [<00000000d99d68d6>] btf_new_fd+0x72d/0xe90 [<00000000f010b7f8>] __sys_bpf+0xec3/0x2410 [<00000000e077ed6f>] __x64_sys_bpf+0x1f/0x30 [<00000000a12f9e55>] x64_sys_call+0x199/0x9f0 [<00000000f3029ea6>] do_syscall_64+0x3b/0xc0 [<000000005640913a>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: 7a851ecb ("bpf: Search for kptrs in prog BTF structs") Signed-off-by:
Hou Tao <houtao1@huawei.com> Acked-by:
Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240912012845.3458483-3-houtao@huaweicloud.com Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
When security_bpf_map_create() in map_create() fails, map_create() will call btf_put() and ->map_free() callback to free the map. It doesn't free the btf_record of map value, so add the missed btf_record_free() when map creation fails. However btf_record_free() needs to be called after ->map_free() just like bpf_map_free_deferred() did, because ->map_free() may use the btf_record to free the special fields in preallocated map value. So factor out bpf_map_free() helper to free the map, btf_record, and btf orderly and use the helper in both map_create() and bpf_map_free_deferred(). Signed-off-by:
Hou Tao <houtao1@huawei.com> Acked-by:
Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240912012845.3458483-2-houtao@huaweicloud.com Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
For all non-tracing helpers which formerly had ARG_PTR_TO_{LONG,INT} as input arguments, zero the value for the case of an error as otherwise it could leak memory. For tracing, it is not needed given CAP_PERFMON can already read all kernel memory anyway hence bpf_get_func_arg() and bpf_get_func_ret() is skipped in here. Also, the MTU helpers mtu_len pointer value is being written but also read. Technically, the MEM_UNINIT should not be there in order to always force init. Removing MEM_UNINIT needs more verifier rework though: MEM_UNINIT right now implies two things actually: i) write into memory, ii) memory does not have to be initialized. If we lift MEM_UNINIT, it then becomes: i) read into memory, ii) memory must be initialized. This means that for bpf_*_check_mtu() we're readding the issue we're trying to fix, that is, it would then be able to write back into things like .rodata BPF maps. Follow-up work will rework the MEM_UNINIT semantics such that the intent can be better expressed. For now just clear the *mtu_len on error path which can be lifted later again. Fixes: 8a67f2de ("bpf: expose bpf_strtol and bpf_strtoul to all program types") Fixes: d7a4cb9b ("bpf: Introduce bpf_strtol and bpf_strtoul helpers") Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/e5edd241-59e7-5e39-0ee5-a51e31b6840a@iogearbox.net Link: https://lore.kernel.org/r/20240913191754.13290-5-daniel@iogearbox.net Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
When checking malformed helper function signatures, also take other argument types into account aside from just ARG_PTR_TO_UNINIT_MEM. This concerns (formerly) ARG_PTR_TO_{INT,LONG} given uninitialized memory can be passed there, too. The func proto sanity check goes back to commit 435faee1 ("bpf, verifier: add ARG_PTR_TO_RAW_STACK type"), and its purpose was to detect wrong func protos which had more than just one MEM_UNINIT-tagged type as arguments. The reason more than one is currently not supported is as we mark stack slots with STACK_MISC in check_helper_call() in case of raw mode based on meta.access_size to allow uninitialized stack memory to be passed to helpers when they just write into the buffer. Probing for base type as well as MEM_UNINIT tagging ensures that other types do not get missed (as it used to be the case for ARG_PTR_TO_{INT,LONG}). Fixes: 57c3bb72 ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types") Reported-by:
Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Andrii Nakryiko <andrii@kernel.org> Acked-by:
Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240913191754.13290-4-daniel@iogearbox.net Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
Lonial found an issue that despite user- and BPF-side frozen BPF map (like in case of .rodata), it was still possible to write into it from a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT} as arguments. In check_func_arg() when the argument is as mentioned, the meta->raw_mode is never set. Later, check_helper_mem_access(), under the case of PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the subsequent call to check_map_access_type() and given the BPF map is read-only it succeeds. The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} | MEM_UNINIT when results are written into them as opposed to read out of them. The latter indicates that it's okay to pass a pointer to uninitialized memory as the memory is written to anyway. However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM just with additional alignment requirement. So it is better to just get rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the fixed size memory types. For this, add MEM_ALIGNED to additionally ensure alignment given these helpers write directly into the args via *<ptr> = val. The .arg*_size has been initialized reflecting the actual sizeof(*<ptr>). MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated argument types, since in !MEM_FIXED_SIZE cases the verifier does not know the buffer size a priori and therefore cannot blindly write *<ptr> = val. Fixes: 57c3bb72 ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types") Reported-by:
Lonial Con <kongln9170@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Andrii Nakryiko <andrii@kernel.org> Acked-by:
Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240913191754.13290-3-daniel@iogearbox.net Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
Both bpf_strtol() and bpf_strtoul() helpers passed a temporary "long long" respectively "unsigned long long" to __bpf_strtoll() / __bpf_strtoull(). Later, the result was checked for truncation via _res != ({unsigned,} long)_res as the destination buffer for the BPF helpers was of type {unsigned,} long which is 32bit on 32bit architectures. Given the latter was a bug in the helper signatures where the destination buffer got adjusted to {s,u}64, the truncation check can now be removed. Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240913191754.13290-2-daniel@iogearbox.net Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
The bpf_strtol() and bpf_strtoul() helpers are currently broken on 32bit: The argument type ARG_PTR_TO_LONG is BPF-side "long", not kernel-side "long" and therefore always considered fixed 64bit no matter if 64 or 32bit underlying architecture. This contract breaks in case of the two mentioned helpers since their BPF_CALL definition for the helpers was added with {unsigned,}long *res. Meaning, the transition from BPF-side "long" (BPF program) to kernel-side "long" (BPF helper) breaks here. Both helpers call __bpf_strtoll() with "long long" correctly, but later assigning the result into 32-bit "*(long *)" on 32bit architectures. From a BPF program point of view, this means upper bits will be seen as uninitialised. Therefore, fix both BPF_CALL signatures to {s,u}64 types to fix this situation. Now, changing also uapi/bpf.h helper documentation which generates bpf_helper_defs.h for BPF programs is tricky: Changing signatures there to __{s,u}64 would trigger compiler warnings (incompatible pointer types passing 'long *' to parameter of type '__s64 *' (aka 'long long *')) for existing BPF programs. Leaving the signatures as-is would be fine as from BPF program point of view it is still BPF-side "long" and thus equivalent to __{s,u}64 on 64 or 32bit underlying architectures. Note that bpf_strtol() and bpf_strtoul() are the only helpers with this issue. Fixes: d7a4cb9b ("bpf: Introduce bpf_strtol and bpf_strtoul helpers") Reported-by:
Alexei Starovoitov <ast@kernel.org> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/481fcec8-c12c-9abb-8ecb-76c71c009959@iogearbox.net Link: https://lore.kernel.org/r/20240913191754.13290-1-daniel@iogearbox.net Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
Zac Ecob reported a problem where a bpf program may cause kernel crash due to the following error: Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI The failure is due to the below signed divide: LLONG_MIN/-1 where LLONG_MIN equals to -9,223,372,036,854,775,808. LLONG_MIN/-1 is supposed to give a positive number 9,223,372,036,854,775,808, but it is impossible since for 64-bit system, the maximum positive number is 9,223,372,036,854,775,807. On x86_64, LLONG_MIN/-1 will cause a kernel exception. On arm64, the result for LLONG_MIN/-1 is LLONG_MIN. Further investigation found all the following sdiv/smod cases may trigger an exception when bpf program is running on x86_64 platform: - LLONG_MIN/-1 for 64bit operation - INT_MIN/-1 for 32bit operation - LLONG_MIN%-1 for 64bit operation - INT_MIN%-1 for 32bit operation where -1 can be an immediate or in a register. On arm64, there are no exceptions: - LLONG_MIN/-1 = LLONG_MIN - INT_MIN/-1 = INT_MIN - LLONG_MIN%-1 = 0 - INT_MIN%-1 = 0 where -1 can be an immediate or in a register. Insn patching is needed to handle the above cases and the patched codes produced results aligned with above arm64 result. The below are pseudo codes to handle sdiv/smod exceptions including both divisor -1 and divisor 0 and the divisor is stored in a register. sdiv: tmp = rX tmp += 1 /* [-1, 0] -> [0, 1] if tmp >(unsigned) 1 goto L2 if tmp == 0 goto L1 rY = 0 L1: rY = -rY; goto L3 L2: rY /= rX L3: smod: tmp = rX tmp += 1 /* [-1, 0] -> [0, 1] if tmp >(unsigned) 1 goto L1 if tmp == 1 (is64 ? goto L2 : goto L3) rY = 0; goto L2 L1: rY %= rX L2: goto L4 // only when !is64 L3: wY = wY // only when !is64 L4: [1] https://lore.kernel.org/bpf/tPJLTEh7S_DxFEqAI2Ji5MBSoZVg7_G-Py2iaZpAaWtM961fFTWtsnlzwvTbzBzaUzwQAoNATXKUlt0LZOFgnDcIyKCswAnAGdUF3LBrhGQ=@protonmail.com/ Reported-by:
Zac Ecob <zacecob@protonmail.com> Signed-off-by:
Yonghong Song <yonghong.song@linux.dev> Acked-by:
Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240913150326.1187788-1-yonghong.song@linux.dev Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
- Sep 12, 2024
-
-
Christoph Hellwig authored
dma_supported has become too much spaghetti for my taste. Reflow it to remove the duplicate use_dma_iommu condition and make the main path more obvious. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Leon Romanovsky <leon@kernel.org>
-
Christian Brauner authored
When I expanded uidgid mappings I intended for a struct uid_gid_map to fit into a single cacheline on x86 as they tend to be pretty performance sensitive (idmapped mounts etc). But a 4 byte hole was added that brought it over 64 bytes. Fix that by adding the static extent array and the extent counter into a substruct. C's type punning for unions guarantees that we can access ->nr_extents even if the last written to member wasn't within the same object. This is also what we rely on in struct_group() and friends. This of course relies on non-strict aliasing which we don't do. 99) If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf Link: https://lore.kernel.org/r/20240910-work-uid_gid_map-v1-1-e6bc761363ed@kernel.org Reviewed-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Leon Romanovsky authored
If the DMA IOMMU path is going to be used, the appropriate check should return that DMA is supported. Fixes: b5c58b2f ("dma-mapping: direct calls for dma-iommu") Closes: https://lore.kernel.org/all/181e06ff-35a3-434f-b505-672f430bd1cb@notapiano Reported-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> #KernelCI Signed-off-by:
Leon Romanovsky <leonro@nvidia.com> Reviewed-by:
Robin Murphy <robin.murphy@arm.com> Tested-by:
Nícolas F. R. A. Prado <nfraprado@collabora.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Tejun Heo authored
96fd6c65 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") added update_other_load_avgs() in kernel/sched/syscalls.c right above effective_cpu_util(). This location didn't fit that well in the first place, and with 5d871a63 ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c") moving effective_cpu_util() to kernel/sched/fair.c, it looks even more out of place. Relocate the function to kernel/sched/pelt.c where all its callees are. No functional changes. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com>
-
Lai Jiangshan authored
Marc Hartmayer reported: [ 23.133876] Unable to handle kernel pointer dereference in virtual kernel address space [ 23.133950] Failing address: 0000000000000000 TEID: 0000000000000483 [ 23.133954] Fault in home space mode while using kernel ASCE. [ 23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d [ 23.134207] Oops: 0004 ilc:2 [#1] SMP (snip) [ 23.134516] Call Trace: [ 23.134520] [<0000024e326caf28>] worker_thread+0x48/0x430 [ 23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430) [ 23.134528] [<0000024e326d3a3e>] kthread+0x11e/0x130 [ 23.134533] [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60 [ 23.134536] [<0000024e333fb37a>] ret_from_fork+0xa/0x38 [ 23.134552] Last Breaking-Event-Address: [ 23.134553] [<0000024e333f4c04>] mutex_unlock+0x24/0x30 [ 23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops With debuging and analysis, worker_thread() accesses to the nullified worker->pool when the newly created worker is destroyed before being waken-up, in which case worker_thread() can see the result detach_worker() reseting worker->pool to NULL at the begining. Move the code "worker->pool = NULL;" out from detach_worker() to fix the problem. worker->pool had been designed to be constant for regular workers and changeable for rescuer. To share attaching/detaching code for regular and rescuer workers and to avoid worker->pool being accessed inadvertently when the worker has been detached, worker->pool is reset to NULL when detached no matter the worker is rescuer or not. To maintain worker->pool being reset after detached, move the code "worker->pool = NULL;" in the worker thread context after detached. It is either be in the regular worker thread context after PF_WQ_WORKER is cleared or in rescuer worker thread context with wq_pool_attach_mutex held. So it is safe to do so. Cc: Marc Hartmayer <mhartmay@linux.ibm.com> Link: https://lore.kernel.org/lkml/87wmjj971b.fsf@linux.ibm.com/ Reported-by:
Marc Hartmayer <mhartmay@linux.ibm.com> Fixes: f4b7b53c ("workqueue: Detach workers directly in idle_cull_fn()") Cc: stable@vger.kernel.org # v6.11+ Signed-off-by:
Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by:
Tejun Heo <tj@kernel.org>
-
- Sep 11, 2024
-
-
Yonghong Song authored
Salvatore Benedetto reported an issue that when doing syscall tracepoint tracing the kernel stack is empty. For example, using the following command line bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' bpftrace -e 'tracepoint:syscalls:sys_exit_read { print("Kernel Stack\n"); print(kstack()); }' the output for both commands is === Kernel Stack === Further analysis shows that pt_regs used for bpf syscall tracepoint tracing is from the one constructed during user->kernel transition. The call stack looks like perf_syscall_enter+0x88/0x7c0 trace_sys_enter+0x41/0x80 syscall_trace_enter+0x100/0x160 do_syscall_64+0x38/0xf0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The ip address stored in pt_regs is from user space hence no kernel stack is printed. To fix the issue, kernel address from pt_regs is required. In kernel repo, there are already a few cases like this. For example, in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) instances are used to supply ip address or use ip address to construct call stack. Instead of allocate fake_regs in the stack which may consume a lot of bytes, the function perf_trace_buf_alloc() in perf_syscall_{enter, exit}() is leveraged to create fake_regs, which will be passed to perf_call_bpf_{enter,exit}(). For the above bpftrace script, I got the following output with this patch: for tracepoint:syscalls:sys_enter_read === Kernel Stack syscall_trace_enter+407 syscall_trace_enter+407 do_syscall_64+74 entry_SYSCALL_64_after_hwframe+75 === and for tracepoint:syscalls:sys_exit_read === Kernel Stack syscall_exit_work+185 syscall_exit_work+185 syscall_exit_to_user_mode+305 do_syscall_64+118 entry_SYSCALL_64_after_hwframe+75 === Reported-by:
Salvatore Benedetto <salvabenedetto@meta.com> Suggested-by:
Andrii Nakryiko <andrii@kernel.org> Signed-off-by:
Yonghong Song <yonghong.song@linux.dev> Signed-off-by:
Andrii Nakryiko <andrii@kernel.org> Acked-by:
Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240910214037.3663272-1-yonghong.song@linux.dev
-
Tao Chen authored
Percpu map is often used, but the map value size limit often ignored, like issue: https://github.com/iovisor/bcc/issues/2519 . Actually, percpu map value size is bound by PCPU_MIN_UNIT_SIZE, so we can check the value size whether it exceeds PCPU_MIN_UNIT_SIZE first, like percpu map of local_storage. Maybe the error message seems clearer compared with "cannot allocate memory". Signed-off-by:
Jinke Han <jinkehan@didiglobal.com> Signed-off-by:
Tao Chen <chen.dylane@gmail.com> Signed-off-by:
Andrii Nakryiko <andrii@kernel.org> Acked-by:
Jiri Olsa <jolsa@kernel.org> Acked-by:
Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240910144111.1464912-2-chen.dylane@gmail.com
-