Skip to content
Snippets Groups Projects
  1. Sep 27, 2024
    • Tejun Heo's avatar
      sched_ext: Enable scx_ops_init_task() separately · 4269c603
      Tejun Heo authored
      
      scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path
      were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned
      on before the first scx_ops_init_task() loop in scx_ops_enable(). However,
      if an external entity causes sched_class switch before the loop is complete,
      tasks which are not initialized could be switched to SCX.
      
      The following can be reproduced by running a program which keeps toggling a
      process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2).
      
        sched_ext: Invalid task state transition 0 -> 3 for fish[1623]
        WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200
        ...
        Sched_ext: simple (enabling)
        RIP: 0010:scx_ops_enable_task+0x1a1/0x200
        ...
         switching_to_scx+0x13/0xa0
         __sched_setscheduler+0x850/0xa50
         do_sched_setscheduler+0x104/0x1c0
         __x64_sys_sched_setscheduler+0x18/0x30
         do_syscall_64+0x7b/0x140
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      Fix it by gating scx_ops_init_task() separately using
      scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are
      finished with scx_ops_init_task().
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4269c603
    • Tejun Heo's avatar
      sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable() · 9753358a
      Tejun Heo authored
      
      scx_ops_enable() has two task iteration loops. The first one calls
      scx_ops_init_task() on every task and the latter switches the eligible ones
      into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the
      second loop switched it into READY before switching the task into SCX.
      
      The distinction between INIT and READY is only meaningful in the fork path
      where it's used to tell whether the task finished forking so that we can
      tell ops.exit_task() accordingly. Leaving task in INIT state between the two
      loops is incosistent with the fork path and incorrect. The following can be
      triggered by running a program which keeps toggling a task between
      SCHED_OTHER and SCHED_SCX while enabling a task:
      
        sched_ext: Invalid task state transition 1 -> 3 for fish[1526]
        WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200
        ...
        Sched_ext: qmap (enabling+all)
        RIP: 0010:scx_ops_enable_task+0x1a1/0x200
        ...
         switching_to_scx+0x13/0xa0
         __sched_setscheduler+0x850/0xa50
         do_sched_setscheduler+0x104/0x1c0
         __x64_sys_sched_setscheduler+0x18/0x30
         do_syscall_64+0x7b/0x140
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      Fix it by transitioning to READY in the first loop right after
      scx_ops_init_task() succeeds.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Vernet <void@manifault.com>
      9753358a
    • Tejun Heo's avatar
      sched_ext: Initialize in bypass mode · 8c2090c5
      Tejun Heo authored
      
      scx_ops_enable() used preempt_disable() around the task iteration loop to
      switch tasks into SCX to guarantee forward progress of the task which is
      running scx_ops_enable(). However, in the gap between setting
      __scx_ops_enabled and preeempt_disable(), an external entity can put tasks
      including the enabling one into SCX prematurely, which can lead to
      malfunctions including stalls.
      
      The bypass mode can wrap the entire enabling operation and guarantee forward
      progress no matter what the BPF scheduler does. Use the bypass mode instead
      to guarantee forward progress while enabling.
      
      While at it, release and regrab scx_tasks_lock between the two task
      iteration locks in scx_ops_enable() for clarity as there is no reason to
      keep holding the lock between them.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      8c2090c5
    • Tejun Heo's avatar
      sched_ext: Remove SCX_OPS_PREPPING · fc1fcebe
      Tejun Heo authored
      
      The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used
      anywhere and only adds confusion. Drop SCX_OPS_PREPPING.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      fc1fcebe
    • Tejun Heo's avatar
      sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable() · 1bbcfe62
      Tejun Heo authored
      
      check_hotplug_seq() is used to detect CPU hotplug event which occurred while
      the BPF scheduler is being loaded so that initialization can be retried if
      CPU hotplug events take place before the CPU hotplug callbacks are online.
      
      As such, the best place to call it is in the same cpu_read_lock() section
      that enables the CPU hotplug ops. Currently, it is called in the next
      cpus_read_lock() block in scx_ops_enable(). The side effect of this
      placement is a small window in which hotplug sequence detection can trigger
      unnecessarily, which isn't critical.
      
      Move check_hotplug_seq() invocation to the same cpus_read_lock() block as
      the hotplug operation enablement to close the window and get the invocation
      out of the way for planned locking updates.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Vernet <void@manifault.com>
      1bbcfe62
  2. Sep 26, 2024
    • Tejun Heo's avatar
      sched_ext: Use shorter slice while bypassing · 6f34d8d3
      Tejun Heo authored
      
      While bypassing, tasks are scheduled in FIFO order which favors tasks that
      hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While
      bypassing, guaranteeing timely forward progress is the main goal. There's no
      point in giving long slices. Shorten the time slice used while bypassing
      from 20ms to 5ms.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      6f34d8d3
    • Tejun Heo's avatar
      sched_ext: Split the global DSQ per NUMA node · b7b3b2db
      Tejun Heo authored
      
      In the bypass mode, the global DSQ is used to schedule all tasks in simple
      FIFO order. All tasks are queued into the global DSQ and all CPUs try to
      execute tasks from it. This creates a lot of cross-node cacheline accesses
      and scheduling across the node boundaries, and can lead to live-lock
      conditions where the system takes tens of minutes to disable the BPF
      scheduler while executing in the bypass mode.
      
      Split the global DSQ per NUMA node. Each node has its own global DSQ. When a
      task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to
      the task's CPU and all CPUs in a node only consume its node-local global
      DSQ.
      
      This resolves a livelock condition which could be reliably triggered on an
      2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
      `stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
      and disabling a SCX scheduler.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      b7b3b2db
    • Tejun Heo's avatar
      sched_ext: Relocate find_user_dsq() · bba26bf3
      Tejun Heo authored
      
      To prepare for the addition of find_global_dsq(). No functional changes.
      
      Signed-off-by: default avatartejun heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      bba26bf3
    • Tejun Heo's avatar
      sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued()... · 63fb3ec8
      Tejun Heo authored
      sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
      
      SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and
      is consumed implicitly, but all BPF DSQ related kfuncs could be used on it.
      SCX_DSQ_GLOBAL will be split per-node for scalability and those operations
      won't make sense anymore. Disallow SCX_DSQ_GLOBAL on scx_bpf_consume(),
      scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new(). This means that
      SCX_DSQ_GLOBAL can only be used as a dispatch target from BPF schedulers.
      
      With scx_flatcg, which was using SCX_DSQ_GLOBAL as the fallback DSQ,
      updated, this shouldn't affect any schedulers.
      
      This leaves find_dsq_for_dispatch() the only user of find_non_local_dsq().
      Open code and remove find_non_local_dsq().
      
      Signed-off-by: default avatartejun heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      63fb3ec8
  3. Sep 24, 2024
  4. Sep 23, 2024
    • Andrea Righi's avatar
      sched_ext: Provide a sysfs enable_seq counter · 431844b6
      Andrea Righi authored
      
      As discussed during the distro-centric session within the sched_ext
      Microconference at LPC 2024, introduce a sequence counter that is
      incremented every time a BPF scheduler is loaded.
      
      This feature can help distributions in diagnosing potential performance
      regressions by identifying systems where users are running (or have ran)
      custom BPF schedulers.
      
      Example:
      
       arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
       0
       arighi@virtme-ng~> sudo scx_simple
       local=1 global=0
       ^CEXIT: unregistered from user space
       arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
       1
      
      In this way user-space tools (such as Ubuntu's apport and similar) are
      able to gather and include this information in bug reports.
      
      Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com>
      Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
      Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
      Cc: Phil Auld <pauld@redhat.com>
      Signed-off-by: default avatarAndrea Righi <andrea.righi@linux.dev>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      431844b6
    • Tejun Heo's avatar
      sched_ext: Fix build when !CONFIG_STACKTRACE · 62d3726d
      Tejun Heo authored
      
      a2f4b16e ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried
      fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put
      stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix
      build when !CONFIG_STACKTRACE.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/
      62d3726d
    • Pat Somaru's avatar
      sched, sched_ext: Disable SM_IDLE/rq empty path when scx_enabled() · edf1c586
      Pat Somaru authored
      
      Disable the rq empty path when scx is enabled. SCX must consult the BPF
      scheduler (via the dispatch path in balance) to determine if rq is empty.
      
      This fixes stalls when scx is enabled.
      
      Signed-off-by: default avatarPat Somaru <patso@likewhatevs.io>
      Fixes: 3dcac251 ("sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      edf1c586
    • Yu Liao's avatar
      sched: Put task_group::idle under CONFIG_GROUP_SCHED_WEIGHT · 7ebd84d6
      Yu Liao authored
      
      When build with CONFIG_GROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED,
      the idle member is not defined:
      
      kernel/sched/ext.c:3701:16: error: 'struct task_group' has no member named 'idle'
        3701 |         if (!tg->idle)
             |                ^~
      
      Fix this by putting 'idle' under new CONFIG_GROUP_SCHED_WEIGHT.
      
      tj: Move idle field upward to avoid breaking up CONFIG_FAIR_GROUP_SCHED block.
      
      Fixes: e179e80c ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
      
      
      Signed-off-by: default avatarYu Liao <liaoyu15@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      7ebd84d6
    • Yu Liao's avatar
      sched: Add dummy version of sched_group_set_idle() · bdeb868c
      Yu Liao authored
      
      Fix the following error when build with CONFIG_GROUP_SCHED_WEIGHT &&
      !CONFIG_FAIR_GROUP_SCHED:
      
      kernel/sched/core.c:9634:15: error: implicit declaration of function
      'sched_group_set_idle'; did you mean 'scx_group_set_idle'? [-Wimplicit-function-declaration]
        9634 |         ret = sched_group_set_idle(css_tg(css), idle);
             |               ^~~~~~~~~~~~~~~~~~~~
             |               scx_group_set_idle
      
      Fixes: e179e80c ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
      
      
      Signed-off-by: default avatarYu Liao <liaoyu15@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bdeb868c
    • Leon Romanovsky's avatar
      dma-mapping: report unlimited DMA addressing in IOMMU DMA path · b348b6d1
      Leon Romanovsky authored
      While using the IOMMU DMA path, the dma_addressing_limited() function
      checks ops struct which doesn't exist in the IOMMU case. This causes
      to the kernel panic while loading ADMGPU driver.
      
      BUG: kernel NULL pointer dereference, address: 00000000000000a0
      PGD 0 P4D 0
      Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 10 UID: 0 PID: 611 Comm: (udev-worker) Tainted: G                T  6.11.0-clang-07154-g726e2d0cf2bb #257
      Tainted: [T]=RANDSTRUCT
      Hardware name: ASUS System Product Name/ROG STRIX Z690-G GAMING WIFI, BIOS 3701 07/03/2024
      RIP: 0010:dma_addressing_limited+0x53/0xa0
      Code: 8b 93 48 02 00 00 48 39 d1 49 89 d6 4c 0f 42 f1 48 85 d2 4c 0f 44 f1 f6 83 fc 02 00 00 40 75 0a 48 89 df e8 1f 09 00 00 eb 24 <4c> 8b 1c 25 a0 00 00 00 4d 85 db 74 17 48 89 df 41 ba 8b 84 2d 55
      RSP: 0018:ffffa8d2c12cf740 EFLAGS: 00010202
      RAX: 00000000ffffffff RBX: ffff8948820220c8 RCX: 000000ffffffffff
      RDX: 0000000000000000 RSI: ffffffffc124dc6d RDI: ffff8948820220c8
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff894883c3f040
      R13: ffff89488dac8828 R14: 000000ffffffffff R15: ffff8948820220c8
      FS:  00007fe6ba881900(0000) GS:ffff894fdf700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000000a0 CR3: 0000000111984000 CR4: 0000000000f50ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __die_body+0x65/0xc0
       ? page_fault_oops+0x3b9/0x450
       ? _prb_read_valid+0x212/0x390
       ? do_user_addr_fault+0x608/0x680
       ? exc_page_fault+0x4e/0xa0
       ? asm_exc_page_fault+0x26/0x30
       ? dma_addressing_limited+0x53/0xa0
       amdgpu_ttm_init+0x56/0x4b0 [amdgpu]
       gmc_v8_0_sw_init+0x561/0x670 [amdgpu]
       amdgpu_device_ip_init+0xf5/0x570 [amdgpu]
       amdgpu_device_init+0x1a57/0x1ea0 [amdgpu]
       ? _raw_spin_unlock_irqrestore+0x1a/0x40
       ? pci_conf1_read+0xc0/0xe0
       ? pci_bus_read_config_word+0x52/0xa0
       amdgpu_driver_load_kms+0x15/0xa0 [amdgpu]
       amdgpu_pci_probe+0x1b7/0x4c0 [amdgpu]
       pci_device_probe+0x1c5/0x260
       really_probe+0x130/0x470
       __driver_probe_device+0x77/0x150
       driver_probe_device+0x19/0x120
       __driver_attach+0xb1/0x1e0
       ? __cfi___driver_attach+0x10/0x10
       bus_for_each_dev+0x115/0x170
       bus_add_driver+0x192/0x2d0
       driver_register+0x5c/0xf0
       ? __cfi_init_module+0x10/0x10 [amdgpu]
       do_one_initcall+0x128/0x380
       ? idr_alloc_cyclic+0x139/0x1d0
       ? security_kernfs_init_security+0x42/0x140
       ? __kernfs_new_node+0x1be/0x250
       ? sysvec_apic_timer_interrupt+0xb6/0xc0
       ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
       ? _raw_spin_unlock+0x11/0x30
       ? free_unref_page+0x283/0x650
       ? kfree+0x274/0x3a0
       ? kfree+0x274/0x3a0
       ? kfree+0x274/0x3a0
       ? load_module+0xf2e/0x1130
       ? __kmalloc_cache_noprof+0x12a/0x2e0
       do_init_module+0x7d/0x240
       __se_sys_init_module+0x19e/0x220
       do_syscall_64+0x8a/0x150
       ? __irq_exit_rcu+0x5e/0x100
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
      RIP: 0033:0x7fe6bb5980ee
      Code: 48 8b 0d 3d ed 12 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0a ed 12 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffd462219d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000af
      RAX: ffffffffffffffda RBX: 0000556caf0d0670 RCX: 00007fe6bb5980ee
      RDX: 0000556caf0d3080 RSI: 0000000002893458 RDI: 00007fe6b3400010
      RBP: 0000000000020000 R08: 0000000000020010 R09: 0000000000000080
      R10: c26073c166186e00 R11: 0000000000000206 R12: 0000556caf0d3430
      R13: 0000556caf0d0670 R14: 0000556caf0d3080 R15: 0000556caf0ce700
       </TASK>
      Modules linked in: amdgpu(+) i915(+) drm_suballoc_helper intel_gtt drm_exec drm_buddy iTCO_wdt i2c_algo_bit intel_pmc_bxt drm_display_helper iTCO_vendor_support gpu_sched drm_ttm_helper cec ttm amdxcp video backlight pinctrl_alderlake nct6775 hwmon_vid nct6775_core coretemp
      CR2: 00000000000000a0
      ---[ end trace 0000000000000000 ]---
      RIP: 0010:dma_addressing_limited+0x53/0xa0
      Code: 8b 93 48 02 00 00 48 39 d1 49 89 d6 4c 0f 42 f1 48 85 d2 4c 0f 44 f1 f6 83 fc 02 00 00 40 75 0a 48 89 df e8 1f 09 00 00 eb 24 <4c> 8b 1c 25 a0 00 00 00 4d 85 db 74 17 48 89 df 41 ba 8b 84 2d 55
      RSP: 0018:ffffa8d2c12cf740 EFLAGS: 00010202
      RAX: 00000000ffffffff RBX: ffff8948820220c8 RCX: 000000ffffffffff
      RDX: 0000000000000000 RSI: ffffffffc124dc6d RDI: ffff8948820220c8
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff894883c3f040
      R13: ffff89488dac8828 R14: 000000ffffffffff R15: ffff8948820220c8
      FS:  00007fe6ba881900(0000) GS:ffff894fdf700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000000a0 CR3: 0000000111984000 CR4: 0000000000f50ef0
      PKRU: 55555554
      
      Fixes: b5c58b2f ("dma-mapping: direct calls for dma-iommu")
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219292
      
      
      Reported-by: default avatarNiklāvs Koļesņikovs <pinkflames.linux@gmail.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarNiklāvs Koļesņikovs <pinkflames.linux@gmail.com>
      b348b6d1
  5. Sep 22, 2024
  6. Sep 20, 2024
    • Jinjie Ruan's avatar
      crash: Fix riscv64 crash memory reserve dead loop · b3f835cd
      Jinjie Ruan authored
      
      On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high"
      will cause system stall as below:
      
      	 Zone ranges:
      	   DMA32    [mem 0x0000000080000000-0x000000009fffffff]
      	   Normal   empty
      	 Movable zone start for each node
      	 Early memory node ranges
      	   node   0: [mem 0x0000000080000000-0x000000008005ffff]
      	   node   0: [mem 0x0000000080060000-0x000000009fffffff]
      	 Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff]
      	(stall here)
      
      commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop
      bug") fix this on 32-bit architecture. However, the problem is not
      completely solved. If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on 64-bit
      architecture, for example, when system memory is equal to
      CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also occur:
      
      	-> reserve_crashkernel_generic() and high is true
      	   -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail
      	      -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly
      	         (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX).
      
      As Catalin suggested, do not remove the ",high" reservation fallback to
      ",low" logic which will change arm64's kdump behavior, but fix it by
      skipping the above situation similar to commit d2f32f23190b ("crash: fix
      x86_32 crash memory reserve dead loop").
      
      After this patch, it print:
      	cannot allocate crashkernel (size:0x1f400000)
      
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Link: https://lore.kernel.org/r/20240812062017.2674441-1-ruanjinjie@huawei.com
      
      
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      b3f835cd
  7. Sep 17, 2024
    • Oleg Nesterov's avatar
      uprobes: turn xol_area->pages[2] into xol_area->page · 2abbcc09
      Oleg Nesterov authored
      Now that xol_mapping has its own ->fault() method we no longer need
      xol_area->pages[1] == NULL, we need a single page.
      
      Link: https://lkml.kernel.org/r/20240911131437.GC3448@redhat.com
      
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2abbcc09
    • Oleg Nesterov's avatar
      uprobes: introduce the global struct vm_special_mapping xol_mapping · 6d27a31e
      Oleg Nesterov authored
      Currently each xol_area has its own instance of vm_special_mapping, this
      is suboptimal and ugly.  Kill xol_area->xol_mapping and add a single
      global instance of vm_special_mapping, the ->fault() method can use
      area->pages rather than xol_mapping->pages.
      
      As a side effect this fixes the problem introduced by the recent commit
      223febc6 ("mm: add optional close() to struct vm_special_mapping"), if
      special_mapping_close() is called from the __mmput() paths, it will use
      vma->vm_private_data = &area->xol_mapping freed by uprobe_clear_state().
      
      Link: https://lkml.kernel.org/r/20240911131407.GB3448@redhat.com
      
      
      Fixes: 223febc6 ("mm: add optional close() to struct vm_special_mapping")
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Closes: https://lore.kernel.org/all/yt9dy149vprr.fsf@linux.ibm.com/
      
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d27a31e
    • Oleg Nesterov's avatar
      Revert "uprobes: use vm_special_mapping close() functionality" · ed8d5b0c
      Oleg Nesterov authored
      This reverts commit 08e28de1.
      
      A malicious application can munmap() its "[uprobes]" vma and in this case
      xol_mapping.close == uprobe_clear_state() will free the memory which can
      be used by another thread, or the same thread when it hits the uprobe bp
      afterwards.
      
      Link: https://lkml.kernel.org/r/20240911131320.GA3448@redhat.com
      
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed8d5b0c
    • Huang Ying's avatar
      resource, kunit: add test case for region_intersects() · 99185c10
      Huang Ying authored
      Patch series "resource: Fix region_intersects() vs
      add_memory_driver_managed()", v3.
      
      The patchset fixes a bug of region_intersects() for systems with CXL
      memory.  The details of the bug can be found in [1/3].  To avoid similar
      bugs in the future.  A kunit test case for region_intersects() is added in
      [3/3].  [2/3] is a preparation patch for [3/3].
      
      
      This patch (of 3):
      
      region_intersects() is important because it's used for /dev/mem permission
      checking.  To avoid possible bug of region_intersects() in the future, a
      kunit test case for region_intersects() is added.
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20240906030713.204292-4-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99185c10
    • Huang Ying's avatar
      resource: make alloc_free_mem_region() works for iomem_resource · bacf9c3c
      Huang Ying authored
      During developing a kunit test case for region_intersects(), some fake
      resources need to be inserted into iomem_resource.  To do that, a resource
      hole needs to be found first in iomem_resource.
      
      However, alloc_free_mem_region() cannot work for iomem_resource now. 
      Because the start address to check cannot be 0 to detect address wrapping
      0 in gfr_continue(), while iomem_resource.start == 0.  To make
      alloc_free_mem_region() works for iomem_resource, gfr_start() is changed
      to avoid to return 0 even if base->start == 0.  We don't need to check 0
      as start address.
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-3-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bacf9c3c
    • Huang Ying's avatar
      resource: fix region_intersects() vs add_memory_driver_managed() · b4afe418
      Huang Ying authored
      On a system with CXL memory, the resource tree (/proc/iomem) related to
      CXL memory may look like something as follows.
      
      490000000-50fffffff : CXL Window 0
        490000000-50fffffff : region0
          490000000-50fffffff : dax0.0
            490000000-50fffffff : System RAM (kmem)
      
      Because drivers/dax/kmem.c calls add_memory_driver_managed() during
      onlining CXL memory, which makes "System RAM (kmem)" a descendant of "CXL
      Window X".  This confuses region_intersects(), which expects all "System
      RAM" resources to be at the top level of iomem_resource.  This can lead to
      bugs.
      
      For example, when the following command line is executed to write some
      memory in CXL memory range via /dev/mem,
      
       $ dd if=data of=/dev/mem bs=$((1 << 10)) seek=$((0x490000000 >> 10)) count=1
       dd: error writing '/dev/mem': Bad address
       1+0 records in
       0+0 records out
       0 bytes copied, 0.0283507 s, 0.0 kB/s
      
      the command fails as expected.  However, the error code is wrong.  It
      should be "Operation not permitted" instead of "Bad address".  More
      seriously, the /dev/mem permission checking in devmem_is_allowed() passes
      incorrectly.  Although the accessing is prevented later because ioremap()
      isn't allowed to map system RAM, it is a potential security issue.  During
      command executing, the following warning is reported in the kernel log for
      calling ioremap() on system RAM.
      
       ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff
       WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d
       Call Trace:
        memremap+0xcb/0x184
        xlate_dev_mem_ptr+0x25/0x2f
        write_mem+0x94/0xfb
        vfs_write+0x128/0x26d
        ksys_write+0xac/0xfe
        do_syscall_64+0x9a/0xfd
        entry_SYSCALL_64_after_hwframe+0x4b/0x53
      
      The details of command execution process are as follows.  In the above
      resource tree, "System RAM" is a descendant of "CXL Window 0" instead of a
      top level resource.  So, region_intersects() will report no System RAM
      resources in the CXL memory region incorrectly, because it only checks the
      top level resources.  Consequently, devmem_is_allowed() will return 1
      (allow access via /dev/mem) for CXL memory region incorrectly. 
      Fortunately, ioremap() doesn't allow to map System RAM and reject the
      access.
      
      So, region_intersects() needs to be fixed to work correctly with the
      resource tree with "System RAM" not at top level as above.  To fix it, if
      we found a unmatched resource in the top level, we will continue to search
      matched resources in its descendant resources.  So, we will not miss any
      matched resources in resource tree anymore.
      
      In the new implementation, an example resource tree
      
      |------------- "CXL Window 0" ------------|
      |-- "System RAM" --|
      
      will behave similar as the following fake resource tree for
      region_intersects(, IORESOURCE_SYSTEM_RAM, ),
      
      |-- "System RAM" --||-- "CXL Window 0a" --|
      
      Where "CXL Window 0a" is part of the original "CXL Window 0" that
      isn't covered by "System RAM".
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-2-ying.huang@intel.com
      
      
      Fixes: c221c0b0 ("device-dax: "Hotplug" persistent memory for use like normal RAM")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b4afe418
  8. Sep 13, 2024
  9. Sep 12, 2024
    • Christoph Hellwig's avatar
      dma-mapping: reflow dma_supported · a5fb217f
      Christoph Hellwig authored
      
      dma_supported has become too much spaghetti for my taste.  Reflow it to
      remove the duplicate use_dma_iommu condition and make the main path more
      obvious.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLeon Romanovsky <leon@kernel.org>
      a5fb217f
    • Christian Brauner's avatar
      uidgid: make sure we fit into one cacheline · 2077006d
      Christian Brauner authored
      When I expanded uidgid mappings I intended for a struct uid_gid_map to
      fit into a single cacheline on x86 as they tend to be pretty
      performance sensitive (idmapped mounts etc). But a 4 byte hole was added
      that brought it over 64 bytes. Fix that by adding the static extent
      array and the extent counter into a substruct. C's type punning for
      unions guarantees that we can access ->nr_extents even if the last
      written to member wasn't within the same object. This is also what we
      rely on in struct_group() and friends. This of course relies on
      non-strict aliasing which we don't do.
      
      99) If the member used to read the contents of a union object is not the
          same as the member last used to store a value in the object, the
          appropriate part of the object representation of the value is
          reinterpreted as an object representation in the new type as
          described in 6.2.6 (a process sometimes called "type punning").
      
      Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf
      Link: https://lore.kernel.org/r/20240910-work-uid_gid_map-v1-1-e6bc761363ed@kernel.org
      
      
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      2077006d
    • Leon Romanovsky's avatar
      dma-mapping: reliably inform about DMA support for IOMMU · f45cfab2
      Leon Romanovsky authored
      If the DMA IOMMU path is going to be used, the appropriate check should
      return that DMA is supported.
      
      Fixes: b5c58b2f ("dma-mapping: direct calls for dma-iommu")
      Closes: https://lore.kernel.org/all/181e06ff-35a3-434f-b505-672f430bd1cb@notapiano
      
      
      Reported-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> #KernelCI
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Tested-by: default avatarNícolas F. R. A. Prado <nfraprado@collabora.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      f45cfab2
    • Tejun Heo's avatar
      sched: Move update_other_load_avgs() to kernel/sched/pelt.c · 902d67a2
      Tejun Heo authored
      
      96fd6c65 ("sched: Factor out update_other_load_avgs() from
      __update_blocked_others()") added update_other_load_avgs() in
      kernel/sched/syscalls.c right above effective_cpu_util(). This location
      didn't fit that well in the first place, and with 5d871a63 ("sched/fair:
      Move effective_cpu_util() and effective_cpu_util() in fair.c") moving
      effective_cpu_util() to kernel/sched/fair.c, it looks even more out of
      place.
      
      Relocate the function to kernel/sched/pelt.c where all its callees are.
      
      No functional changes.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      902d67a2
    • Lai Jiangshan's avatar
      workqueue: Clear worker->pool in the worker thread context · 73613840
      Lai Jiangshan authored
      Marc Hartmayer reported:
              [   23.133876] Unable to handle kernel pointer dereference in virtual kernel address space
              [   23.133950] Failing address: 0000000000000000 TEID: 0000000000000483
              [   23.133954] Fault in home space mode while using kernel ASCE.
              [   23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d
              [   23.134207] Oops: 0004 ilc:2 [#1] SMP
      	(snip)
              [   23.134516] Call Trace:
              [   23.134520]  [<0000024e326caf28>] worker_thread+0x48/0x430
              [   23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430)
              [   23.134528]  [<0000024e326d3a3e>] kthread+0x11e/0x130
              [   23.134533]  [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60
              [   23.134536]  [<0000024e333fb37a>] ret_from_fork+0xa/0x38
              [   23.134552] Last Breaking-Event-Address:
              [   23.134553]  [<0000024e333f4c04>] mutex_unlock+0x24/0x30
              [   23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops
      
      With debuging and analysis, worker_thread() accesses to the nullified
      worker->pool when the newly created worker is destroyed before being
      waken-up, in which case worker_thread() can see the result detach_worker()
      reseting worker->pool to NULL at the begining.
      
      Move the code "worker->pool = NULL;" out from detach_worker() to fix the
      problem.
      
      worker->pool had been designed to be constant for regular workers and
      changeable for rescuer. To share attaching/detaching code for regular
      and rescuer workers and to avoid worker->pool being accessed inadvertently
      when the worker has been detached, worker->pool is reset to NULL when
      detached no matter the worker is rescuer or not.
      
      To maintain worker->pool being reset after detached, move the code
      "worker->pool = NULL;" in the worker thread context after detached.
      
      It is either be in the regular worker thread context after PF_WQ_WORKER
      is cleared or in rescuer worker thread context with wq_pool_attach_mutex
      held. So it is safe to do so.
      
      Cc: Marc Hartmayer <mhartmay@linux.ibm.com>
      Link: https://lore.kernel.org/lkml/87wmjj971b.fsf@linux.ibm.com/
      
      
      Reported-by: default avatarMarc Hartmayer <mhartmay@linux.ibm.com>
      Fixes: f4b7b53c ("workqueue: Detach workers directly in idle_cull_fn()")
      Cc: stable@vger.kernel.org # v6.11+
      Signed-off-by: default avatarLai Jiangshan <jiangshan.ljs@antgroup.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      73613840
  10. Sep 11, 2024
Loading