Skip to content
Snippets Groups Projects
  1. Oct 03, 2024
    • Johannes Weiner's avatar
      sched: psi: fix bogus pressure spikes from aggregation race · 3840cbe2
      Johannes Weiner authored
      
      Brandon reports sporadic, non-sensical spikes in cumulative pressure
      time (total=) when reading cpu.pressure at a high rate. This is due to
      a race condition between reader aggregation and tasks changing states.
      
      While it affects all states and all resources captured by PSI, in
      practice it most likely triggers with CPU pressure, since scheduling
      events are so frequent compared to other resource events.
      
      The race context is the live snooping of ongoing stalls during a
      pressure read. The read aggregates per-cpu records for stalls that
      have concluded, but will also incorporate ad-hoc the duration of any
      active state that hasn't been recorded yet. This is important to get
      timely measurements of ongoing stalls. Those ad-hoc samples are
      calculated on-the-fly up to the current time on that CPU; since the
      stall hasn't concluded, it's expected that this is the minimum amount
      of stall time that will enter the per-cpu records once it does.
      
      The problem is that the path that concludes the state uses a CPU clock
      read that is not synchronized against aggregators; the clock is read
      outside of the seqlock protection. This allows aggregators to race and
      snoop a stall with a longer duration than will actually be recorded.
      
      With the recorded stall time being less than the last snapshot
      remembered by the aggregator, a subsequent sample will underflow and
      observe a bogus delta value, resulting in an erratic jump in pressure.
      
      Fix this by moving the clock read of the state change into the seqlock
      protection. This ensures no aggregation can snoop live stalls past the
      time that's recorded when the state concludes.
      
      Reported-by: default avatarBrandon Duffany <brandon@buildbuddy.io>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
      Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/
      
      
      Fixes: df774306 ("psi: Reduce calls to sched_clock() in psi")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3840cbe2
    • Wei Li's avatar
      tracing/hwlat: Fix a race during cpuhp processing · 2a13ca2e
      Wei Li authored
      The cpuhp online/offline processing race also exists in percpu-mode hwlat
      tracer in theory, apply the fix too. That is:
      
          T1                       | T2
          [CPUHP_ONLINE]           | cpu_device_down()
           hwlat_hotplug_workfn()  |
                                   |     cpus_write_lock()
                                   |     takedown_cpu(1)
                                   |     cpus_write_unlock()
          [CPUHP_OFFLINE]          |
              cpus_read_lock()     |
              start_kthread(1)     |
              cpus_read_unlock()   |
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20240924094515.3561410-5-liwei391@huawei.com
      
      
      Fixes: ba998f7d ("trace/hwlat: Support hotplug operations")
      Signed-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2a13ca2e
    • Wei Li's avatar
      tracing/timerlat: Fix a race during cpuhp processing · 829e0c9f
      Wei Li authored
      There is another found exception that the "timerlat/1" thread was
      scheduled on CPU0, and lead to timer corruption finally:
      
      ```
      ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220
      WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0
      Modules linked in:
      CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
      RIP: 0010:debug_print_object+0x7d/0xb0
      ...
      Call Trace:
       <TASK>
       ? __warn+0x7c/0x110
       ? debug_print_object+0x7d/0xb0
       ? report_bug+0xf1/0x1d0
       ? prb_read_valid+0x17/0x20
       ? handle_bug+0x3f/0x70
       ? exc_invalid_op+0x13/0x60
       ? asm_exc_invalid_op+0x16/0x20
       ? debug_print_object+0x7d/0xb0
       ? debug_print_object+0x7d/0xb0
       ? __pfx_timerlat_irq+0x10/0x10
       __debug_object_init+0x110/0x150
       hrtimer_init+0x1d/0x60
       timerlat_main+0xab/0x2d0
       ? __pfx_timerlat_main+0x10/0x10
       kthread+0xb7/0xe0
       ? __pfx_kthread+0x10/0x10
       ret_from_fork+0x2d/0x40
       ? __pfx_kthread+0x10/0x10
       ret_from_fork_asm+0x1a/0x30
       </TASK>
      ```
      
      After tracing the scheduling event, it was discovered that the migration
      of the "timerlat/1" thread was performed during thread creation. Further
      analysis confirmed that it is because the CPU online processing for
      osnoise is implemented through workers, which is asynchronous with the
      offline processing. When the worker was scheduled to create a thread, the
      CPU may has already been removed from the cpu_online_mask during the offline
      process, resulting in the inability to select the right CPU:
      
      T1                       | T2
      [CPUHP_ONLINE]           | cpu_device_down()
      osnoise_hotplug_workfn() |
                               |     cpus_write_lock()
                               |     takedown_cpu(1)
                               |     cpus_write_unlock()
      [CPUHP_OFFLINE]          |
          cpus_read_lock()     |
          start_kthread(1)     |
          cpus_read_unlock()   |
      
      To fix this, skip online processing if the CPU is already offline.
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com
      
      
      Fixes: c8895e27 ("trace/osnoise: Support hotplug operations")
      Signed-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      829e0c9f
    • Wei Li's avatar
      tracing/timerlat: Drop interface_lock in stop_kthread() · b484a02c
      Wei Li authored
      stop_kthread() is the offline callback for "trace/osnoise:online", since
      commit 5bfbcd1e ("tracing/timerlat: Add interface_lock around clearing
      of kthread in stop_kthread()"), the following ABBA deadlock scenario is
      introduced:
      
      T1                            | T2 [BP]               | T3 [AP]
      osnoise_hotplug_workfn()      | work_for_cpu_fn()     | cpuhp_thread_fun()
                                    |   _cpu_down()         |   osnoise_cpu_die()
        mutex_lock(&interface_lock) |                       |     stop_kthread()
                                    |     cpus_write_lock() |       mutex_lock(&interface_lock)
        cpus_read_lock()            |     cpuhp_kick_ap()   |
      
      As the interface_lock here in just for protecting the "kthread" field of
      the osn_var, use xchg() instead to fix this issue. Also use
      for_each_online_cpu() back in stop_per_cpu_kthreads() as it can take
      cpu_read_lock() again.
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20240924094515.3561410-3-liwei391@huawei.com
      
      
      Fixes: 5bfbcd1e ("tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()")
      Signed-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      b484a02c
    • Wei Li's avatar
      tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline · 0bb0a5c1
      Wei Li authored
      osnoise_hotplug_workfn() is the asynchronous online callback for
      "trace/osnoise:online". It may be congested when a CPU goes online and
      offline repeatedly and is invoked for multiple times after a certain
      online.
      
      This will lead to kthread leak and timer corruption. Add a check
      in start_kthread() to prevent this situation.
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20240924094515.3561410-2-liwei391@huawei.com
      
      
      Fixes: c8895e27 ("trace/osnoise: Support hotplug operations")
      Signed-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      0bb0a5c1
    • Steven Rostedt's avatar
      tracing: Fix trace_check_vprintf() when tp_printk is used · 50a3242d
      Steven Rostedt authored
      When the tp_printk kernel command line is used, the trace events go
      directly to printk(). It is still checked via the trace_check_vprintf()
      function to make sure the pointers of the trace event are legit.
      
      The addition of reading buffers from previous boots required adding a
      delta between the addresses of the previous boot and the current boot so
      that the pointers in the old buffer can still be used. But this required
      adding a trace_array pointer to acquire the delta offsets.
      
      The tp_printk code does not provide a trace_array (tr) pointer, so when
      the offsets were examined, a NULL pointer dereference happened and the
      kernel crashed.
      
      If the trace_array does not exist, just default the delta offsets to zero,
      as that also means the trace event is not being read from a previous boot.
      
      Link: https://lore.kernel.org/all/Zv3z5UsG_jsO9_Tb@aschofie-mobl2.lan/
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20241003104925.4e1b1fd9@gandalf.local.home
      
      
      Fixes: 07714b4b ("tracing: Handle old buffer mappings for event strings and functions")
      Reported-by: default avatarAlison Schofield <alison.schofield@intel.com>
      Tested-by: default avatarAlison Schofield <alison.schofield@intel.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      50a3242d
  2. Oct 02, 2024
    • Al Viro's avatar
      move asm/unaligned.h to linux/unaligned.h · 5f60d5f6
      Al Viro authored
      asm/unaligned.h is always an include of asm-generic/unaligned.h;
      might as well move that thing to linux/unaligned.h and include
      that - there's nothing arch-specific in that header.
      
      auto-generated by the following:
      
      for i in `git grep -l -w asm/unaligned.h`; do
      	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
      done
      for i in `git grep -l -w asm-generic/unaligned.h`; do
      	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
      done
      git mv include/asm-generic/unaligned.h include/linux/unaligned.h
      git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
      sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
      sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
      5f60d5f6
  3. Oct 01, 2024
  4. Sep 30, 2024
    • Al Viro's avatar
      close_range(): fix the logics in descriptor table trimming · 678379e1
      Al Viro authored
      
      Cloning a descriptor table picks the size that would cover all currently
      opened files.  That's fine for clone() and unshare(), but for close_range()
      there's an additional twist - we clone before we close, and it would be
      a shame to have
      	close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
      leave us with a huge descriptor table when we are not going to keep
      anything past stderr, just because some large file descriptor used to
      be open before our call has taken it out.
      
      Unfortunately, it had been dealt with in an inherently racy way -
      sane_fdtable_size() gets a "don't copy anything past that" argument
      (passed via unshare_fd() and dup_fd()), close_range() decides how much
      should be trimmed and passes that to unshare_fd().
      
      The problem is, a range that used to extend to the end of descriptor
      table back when close_range() had looked at it might very well have stuff
      grown after it by the time dup_fd() has allocated a new files_struct
      and started to figure out the capacity of fdtable to be attached to that.
      
      That leads to interesting pathological cases; at the very least it's a
      QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
      a snapshot of descriptor table one might have observed at some point.
      Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
      of unshare(CLONE_FILES) with plain close_range(), ending up with a
      weird state that would never occur with unshare(2) is confusing, to put
      it mildly.
      
      It's not hard to get rid of - all it takes is passing both ends of the
      range down to sane_fdtable_size().  There we are under ->files_lock,
      so the race is trivially avoided.
      
      So we do the following:
      	* switch close_files() from calling unshare_fd() to calling
      dup_fd().
      	* undo the calling convention change done to unshare_fd() in
      60997c3d "close_range: add CLOSE_RANGE_UNSHARE"
      	* introduce struct fd_range, pass a pointer to that to dup_fd()
      and sane_fdtable_size() instead of "trim everything past that point"
      they are currently getting.  NULL means "we are not going to be punching
      any holes"; NR_OPEN_MAX is gone.
      	* make sane_fdtable_size() use find_last_bit() instead of
      open-coding it; it's easier to follow that way.
      	* while we are at it, have dup_fd() report errors by returning
      ERR_PTR(), no need to use a separate int *errorp argument.
      
      Fixes: 60997c3d "close_range: add CLOSE_RANGE_UNSHARE"
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      678379e1
  5. Sep 29, 2024
  6. Sep 27, 2024
    • Zhang Qiao's avatar
      sched_ext: Remove redundant p->nr_cpus_allowed checker · 95b87369
      Zhang Qiao authored
      
      select_rq_task() already checked that 'p->nr_cpus_allowed > 1',
      'p->nr_cpus_allowed == 1' checker in scx_select_cpu_dfl() is redundant.
      
      Signed-off-by: default avatarZhang Qiao <zhangqiao22@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      95b87369
    • Tejun Heo's avatar
      sched_ext: Decouple locks in scx_ops_enable() · efe231d9
      Tejun Heo authored
      
      The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
      cpus_read_lock. Currently, the locks are grabbed together which is prone to
      locking order problems.
      
      For example, currently, there is a possible deadlock involving
      scx_fork_rwsem and cpus_read_lock. cpus_read_lock has to nest inside
      scx_fork_rwsem due to locking order existing in other subsystems. However,
      there exists a dependency in the other direction during hotplug if hotplug
      needs to fork a new task, which happens in some cases. This leads to the
      following deadlock:
      
             scx_ops_enable()                               hotplug
      
                                                percpu_down_write(&cpu_hotplug_lock)
         percpu_down_write(&scx_fork_rwsem)
         block on cpu_hotplug_lock
                                                kthread_create() waits for kthreadd
      					  kthreadd blocks on scx_fork_rwsem
      
      Note that this doesn't trigger lockdep because the hotplug side dependency
      bounces through kthreadd.
      
      With the preceding scx_cgroup_enabled change, this can be solved by
      decoupling cpus_read_lock, which is needed for static_key manipulations,
      from the other two locks.
      
      - Move the first block of static_key manipulations outside of scx_fork_rwsem
        and scx_cgroup_rwsem. This is now safe with the preceding
        scx_cgroup_enabled change.
      
      - Drop scx_cgroup_rwsem and scx_fork_rwsem between the two task iteration
        blocks so that __scx_ops_enabled static_key enabling is outside the two
        rwsems.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarAboorva Devarajan <aboorvad@linux.ibm.com>
      Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
      efe231d9
    • Tejun Heo's avatar
      sched_ext: Decouple locks in scx_ops_disable_workfn() · 16021656
      Tejun Heo authored
      
      The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
      cpus_read_lock. Currently, the locks are grabbed together which is prone to
      locking order problems. With the preceding scx_cgroup_enabled change, we can
      decouple them:
      
      - As cgroup disabling no longer requires modifying a static_key which
        requires cpus_read_lock(), no need to grab cpus_read_lock() before
        grabbing scx_cgroup_rwsem.
      
      - cgroup can now be independently disabled before tasks are moved back to
        the fair class.
      
      Relocate scx_cgroup_exit() invocation before scx_fork_rwsem is grabbed, drop
      now unnecessary cpus_read_lock() and move static_key operations out of
      scx_fork_rwsem. This decouples all three locks in the disable path.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarAboorva Devarajan <aboorvad@linux.ibm.com>
      Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
      16021656
    • Tejun Heo's avatar
      sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online() · 568894ed
      Tejun Heo authored
      
      If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online()
      didn't set SCX_TG_INITED which meant that ops.cgroup_exit(), even if
      implemented, won't be called from scx_tg_offline(). This is because
      SCX_HAS_OP(cgroupt_init) is used to test both whether SCX cgroup operations
      are enabled and ops.cgroup_init() exists.
      
      Fix it by introducing a separate bool scx_cgroup_enabled to gate cgroup
      operations and use SCX_HAS_OP(cgroup_init) only to test whether
      ops.cgroup_init() exists. Make all cgroup operations consistently use
      scx_cgroup_enabled to test whether cgroup operations are enabled.
      scx_cgroup_enabled is added instead of using scx_enabled() to ease planned
      locking updates.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      568894ed
    • Tejun Heo's avatar
      sched_ext: Enable scx_ops_init_task() separately · 4269c603
      Tejun Heo authored
      
      scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path
      were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned
      on before the first scx_ops_init_task() loop in scx_ops_enable(). However,
      if an external entity causes sched_class switch before the loop is complete,
      tasks which are not initialized could be switched to SCX.
      
      The following can be reproduced by running a program which keeps toggling a
      process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2).
      
        sched_ext: Invalid task state transition 0 -> 3 for fish[1623]
        WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200
        ...
        Sched_ext: simple (enabling)
        RIP: 0010:scx_ops_enable_task+0x1a1/0x200
        ...
         switching_to_scx+0x13/0xa0
         __sched_setscheduler+0x850/0xa50
         do_sched_setscheduler+0x104/0x1c0
         __x64_sys_sched_setscheduler+0x18/0x30
         do_syscall_64+0x7b/0x140
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      Fix it by gating scx_ops_init_task() separately using
      scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are
      finished with scx_ops_init_task().
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4269c603
    • Tejun Heo's avatar
      sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable() · 9753358a
      Tejun Heo authored
      
      scx_ops_enable() has two task iteration loops. The first one calls
      scx_ops_init_task() on every task and the latter switches the eligible ones
      into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the
      second loop switched it into READY before switching the task into SCX.
      
      The distinction between INIT and READY is only meaningful in the fork path
      where it's used to tell whether the task finished forking so that we can
      tell ops.exit_task() accordingly. Leaving task in INIT state between the two
      loops is incosistent with the fork path and incorrect. The following can be
      triggered by running a program which keeps toggling a task between
      SCHED_OTHER and SCHED_SCX while enabling a task:
      
        sched_ext: Invalid task state transition 1 -> 3 for fish[1526]
        WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200
        ...
        Sched_ext: qmap (enabling+all)
        RIP: 0010:scx_ops_enable_task+0x1a1/0x200
        ...
         switching_to_scx+0x13/0xa0
         __sched_setscheduler+0x850/0xa50
         do_sched_setscheduler+0x104/0x1c0
         __x64_sys_sched_setscheduler+0x18/0x30
         do_syscall_64+0x7b/0x140
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      Fix it by transitioning to READY in the first loop right after
      scx_ops_init_task() succeeds.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Vernet <void@manifault.com>
      9753358a
    • Tejun Heo's avatar
      sched_ext: Initialize in bypass mode · 8c2090c5
      Tejun Heo authored
      
      scx_ops_enable() used preempt_disable() around the task iteration loop to
      switch tasks into SCX to guarantee forward progress of the task which is
      running scx_ops_enable(). However, in the gap between setting
      __scx_ops_enabled and preeempt_disable(), an external entity can put tasks
      including the enabling one into SCX prematurely, which can lead to
      malfunctions including stalls.
      
      The bypass mode can wrap the entire enabling operation and guarantee forward
      progress no matter what the BPF scheduler does. Use the bypass mode instead
      to guarantee forward progress while enabling.
      
      While at it, release and regrab scx_tasks_lock between the two task
      iteration locks in scx_ops_enable() for clarity as there is no reason to
      keep holding the lock between them.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      8c2090c5
    • Tejun Heo's avatar
      sched_ext: Remove SCX_OPS_PREPPING · fc1fcebe
      Tejun Heo authored
      
      The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used
      anywhere and only adds confusion. Drop SCX_OPS_PREPPING.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      fc1fcebe
    • Tejun Heo's avatar
      sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable() · 1bbcfe62
      Tejun Heo authored
      
      check_hotplug_seq() is used to detect CPU hotplug event which occurred while
      the BPF scheduler is being loaded so that initialization can be retried if
      CPU hotplug events take place before the CPU hotplug callbacks are online.
      
      As such, the best place to call it is in the same cpu_read_lock() section
      that enables the CPU hotplug ops. Currently, it is called in the next
      cpus_read_lock() block in scx_ops_enable(). The side effect of this
      placement is a small window in which hotplug sequence detection can trigger
      unnecessarily, which isn't critical.
      
      Move check_hotplug_seq() invocation to the same cpus_read_lock() block as
      the hotplug operation enablement to close the window and get the invocation
      out of the way for planned locking updates.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Vernet <void@manifault.com>
      1bbcfe62
    • Al Viro's avatar
      [tree-wide] finally take no_llseek out · cb787f4a
      Al Viro authored
      
      no_llseek had been defined to NULL two years ago, in commit 868941b1
      ("fs: remove no_llseek")
      
      To quote that commit,
      
        At -rc1 we'll need do a mechanical removal of no_llseek -
      
        git grep -l -w no_llseek | grep -v porting.rst | while read i; do
      	sed -i '/\<no_llseek\>/d' $i
        done
      
        would do it.
      
      Unfortunately, that hadn't been done.  Linus, could you do that now, so
      that we could finally put that thing to rest? All instances are of the
      form
      	.llseek = no_llseek,
      so it's obviously safe.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb787f4a
  7. Sep 26, 2024
  8. Sep 25, 2024
  9. Sep 24, 2024
  10. Sep 23, 2024
    • Andrea Righi's avatar
      sched_ext: Provide a sysfs enable_seq counter · 431844b6
      Andrea Righi authored
      
      As discussed during the distro-centric session within the sched_ext
      Microconference at LPC 2024, introduce a sequence counter that is
      incremented every time a BPF scheduler is loaded.
      
      This feature can help distributions in diagnosing potential performance
      regressions by identifying systems where users are running (or have ran)
      custom BPF schedulers.
      
      Example:
      
       arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
       0
       arighi@virtme-ng~> sudo scx_simple
       local=1 global=0
       ^CEXIT: unregistered from user space
       arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
       1
      
      In this way user-space tools (such as Ubuntu's apport and similar) are
      able to gather and include this information in bug reports.
      
      Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com>
      Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
      Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
      Cc: Phil Auld <pauld@redhat.com>
      Signed-off-by: default avatarAndrea Righi <andrea.righi@linux.dev>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      431844b6
    • Tejun Heo's avatar
      sched_ext: Fix build when !CONFIG_STACKTRACE · 62d3726d
      Tejun Heo authored
      
      a2f4b16e ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried
      fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put
      stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix
      build when !CONFIG_STACKTRACE.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/
      62d3726d
    • Pat Somaru's avatar
      sched, sched_ext: Disable SM_IDLE/rq empty path when scx_enabled() · edf1c586
      Pat Somaru authored
      
      Disable the rq empty path when scx is enabled. SCX must consult the BPF
      scheduler (via the dispatch path in balance) to determine if rq is empty.
      
      This fixes stalls when scx is enabled.
      
      Signed-off-by: default avatarPat Somaru <patso@likewhatevs.io>
      Fixes: 3dcac251 ("sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      edf1c586
    • Yu Liao's avatar
      sched: Put task_group::idle under CONFIG_GROUP_SCHED_WEIGHT · 7ebd84d6
      Yu Liao authored
      
      When build with CONFIG_GROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED,
      the idle member is not defined:
      
      kernel/sched/ext.c:3701:16: error: 'struct task_group' has no member named 'idle'
        3701 |         if (!tg->idle)
             |                ^~
      
      Fix this by putting 'idle' under new CONFIG_GROUP_SCHED_WEIGHT.
      
      tj: Move idle field upward to avoid breaking up CONFIG_FAIR_GROUP_SCHED block.
      
      Fixes: e179e80c ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
      
      
      Signed-off-by: default avatarYu Liao <liaoyu15@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      7ebd84d6
    • Yu Liao's avatar
      sched: Add dummy version of sched_group_set_idle() · bdeb868c
      Yu Liao authored
      
      Fix the following error when build with CONFIG_GROUP_SCHED_WEIGHT &&
      !CONFIG_FAIR_GROUP_SCHED:
      
      kernel/sched/core.c:9634:15: error: implicit declaration of function
      'sched_group_set_idle'; did you mean 'scx_group_set_idle'? [-Wimplicit-function-declaration]
        9634 |         ret = sched_group_set_idle(css_tg(css), idle);
             |               ^~~~~~~~~~~~~~~~~~~~
             |               scx_group_set_idle
      
      Fixes: e179e80c ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
      
      
      Signed-off-by: default avatarYu Liao <liaoyu15@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bdeb868c
    • Leon Romanovsky's avatar
      dma-mapping: report unlimited DMA addressing in IOMMU DMA path · b348b6d1
      Leon Romanovsky authored
      While using the IOMMU DMA path, the dma_addressing_limited() function
      checks ops struct which doesn't exist in the IOMMU case. This causes
      to the kernel panic while loading ADMGPU driver.
      
      BUG: kernel NULL pointer dereference, address: 00000000000000a0
      PGD 0 P4D 0
      Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 10 UID: 0 PID: 611 Comm: (udev-worker) Tainted: G                T  6.11.0-clang-07154-g726e2d0cf2bb #257
      Tainted: [T]=RANDSTRUCT
      Hardware name: ASUS System Product Name/ROG STRIX Z690-G GAMING WIFI, BIOS 3701 07/03/2024
      RIP: 0010:dma_addressing_limited+0x53/0xa0
      Code: 8b 93 48 02 00 00 48 39 d1 49 89 d6 4c 0f 42 f1 48 85 d2 4c 0f 44 f1 f6 83 fc 02 00 00 40 75 0a 48 89 df e8 1f 09 00 00 eb 24 <4c> 8b 1c 25 a0 00 00 00 4d 85 db 74 17 48 89 df 41 ba 8b 84 2d 55
      RSP: 0018:ffffa8d2c12cf740 EFLAGS: 00010202
      RAX: 00000000ffffffff RBX: ffff8948820220c8 RCX: 000000ffffffffff
      RDX: 0000000000000000 RSI: ffffffffc124dc6d RDI: ffff8948820220c8
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff894883c3f040
      R13: ffff89488dac8828 R14: 000000ffffffffff R15: ffff8948820220c8
      FS:  00007fe6ba881900(0000) GS:ffff894fdf700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000000a0 CR3: 0000000111984000 CR4: 0000000000f50ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __die_body+0x65/0xc0
       ? page_fault_oops+0x3b9/0x450
       ? _prb_read_valid+0x212/0x390
       ? do_user_addr_fault+0x608/0x680
       ? exc_page_fault+0x4e/0xa0
       ? asm_exc_page_fault+0x26/0x30
       ? dma_addressing_limited+0x53/0xa0
       amdgpu_ttm_init+0x56/0x4b0 [amdgpu]
       gmc_v8_0_sw_init+0x561/0x670 [amdgpu]
       amdgpu_device_ip_init+0xf5/0x570 [amdgpu]
       amdgpu_device_init+0x1a57/0x1ea0 [amdgpu]
       ? _raw_spin_unlock_irqrestore+0x1a/0x40
       ? pci_conf1_read+0xc0/0xe0
       ? pci_bus_read_config_word+0x52/0xa0
       amdgpu_driver_load_kms+0x15/0xa0 [amdgpu]
       amdgpu_pci_probe+0x1b7/0x4c0 [amdgpu]
       pci_device_probe+0x1c5/0x260
       really_probe+0x130/0x470
       __driver_probe_device+0x77/0x150
       driver_probe_device+0x19/0x120
       __driver_attach+0xb1/0x1e0
       ? __cfi___driver_attach+0x10/0x10
       bus_for_each_dev+0x115/0x170
       bus_add_driver+0x192/0x2d0
       driver_register+0x5c/0xf0
       ? __cfi_init_module+0x10/0x10 [amdgpu]
       do_one_initcall+0x128/0x380
       ? idr_alloc_cyclic+0x139/0x1d0
       ? security_kernfs_init_security+0x42/0x140
       ? __kernfs_new_node+0x1be/0x250
       ? sysvec_apic_timer_interrupt+0xb6/0xc0
       ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
       ? _raw_spin_unlock+0x11/0x30
       ? free_unref_page+0x283/0x650
       ? kfree+0x274/0x3a0
       ? kfree+0x274/0x3a0
       ? kfree+0x274/0x3a0
       ? load_module+0xf2e/0x1130
       ? __kmalloc_cache_noprof+0x12a/0x2e0
       do_init_module+0x7d/0x240
       __se_sys_init_module+0x19e/0x220
       do_syscall_64+0x8a/0x150
       ? __irq_exit_rcu+0x5e/0x100
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
      RIP: 0033:0x7fe6bb5980ee
      Code: 48 8b 0d 3d ed 12 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0a ed 12 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffd462219d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000af
      RAX: ffffffffffffffda RBX: 0000556caf0d0670 RCX: 00007fe6bb5980ee
      RDX: 0000556caf0d3080 RSI: 0000000002893458 RDI: 00007fe6b3400010
      RBP: 0000000000020000 R08: 0000000000020010 R09: 0000000000000080
      R10: c26073c166186e00 R11: 0000000000000206 R12: 0000556caf0d3430
      R13: 0000556caf0d0670 R14: 0000556caf0d3080 R15: 0000556caf0ce700
       </TASK>
      Modules linked in: amdgpu(+) i915(+) drm_suballoc_helper intel_gtt drm_exec drm_buddy iTCO_wdt i2c_algo_bit intel_pmc_bxt drm_display_helper iTCO_vendor_support gpu_sched drm_ttm_helper cec ttm amdxcp video backlight pinctrl_alderlake nct6775 hwmon_vid nct6775_core coretemp
      CR2: 00000000000000a0
      ---[ end trace 0000000000000000 ]---
      RIP: 0010:dma_addressing_limited+0x53/0xa0
      Code: 8b 93 48 02 00 00 48 39 d1 49 89 d6 4c 0f 42 f1 48 85 d2 4c 0f 44 f1 f6 83 fc 02 00 00 40 75 0a 48 89 df e8 1f 09 00 00 eb 24 <4c> 8b 1c 25 a0 00 00 00 4d 85 db 74 17 48 89 df 41 ba 8b 84 2d 55
      RSP: 0018:ffffa8d2c12cf740 EFLAGS: 00010202
      RAX: 00000000ffffffff RBX: ffff8948820220c8 RCX: 000000ffffffffff
      RDX: 0000000000000000 RSI: ffffffffc124dc6d RDI: ffff8948820220c8
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff894883c3f040
      R13: ffff89488dac8828 R14: 000000ffffffffff R15: ffff8948820220c8
      FS:  00007fe6ba881900(0000) GS:ffff894fdf700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000000a0 CR3: 0000000111984000 CR4: 0000000000f50ef0
      PKRU: 55555554
      
      Fixes: b5c58b2f ("dma-mapping: direct calls for dma-iommu")
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219292
      
      
      Reported-by: default avatarNiklāvs Koļesņikovs <pinkflames.linux@gmail.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarNiklāvs Koļesņikovs <pinkflames.linux@gmail.com>
      b348b6d1
  11. Sep 22, 2024
Loading