Skip to content
Snippets Groups Projects
  1. Feb 01, 2024
    • Daniel Bristot de Oliveira's avatar
      tracing/timerlat: Move hrtimer_init to timerlat_fd open() · 1389358b
      Daniel Bristot de Oliveira authored
      Currently, the timerlat's hrtimer is initialized at the first read of
      timerlat_fd, and destroyed at close(). It works, but it causes an error
      if the user program open() and close() the file without reading.
      
      Here's an example:
      
       # echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
       # echo timerlat > /sys/kernel/debug/tracing/current_tracer
      
       # cat <<EOF > ./timerlat_load.py
       # !/usr/bin/env python3
      
       timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r')
       timerlat_fd.close();
       EOF
      
       # ./taskset -c 0 ./timerlat_load.py
      <BOOM>
      
       BUG: kernel NULL pointer dereference, address: 0000000000000010
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 1 PID: 2673 Comm: python3 Not tainted 6.6.13-200.fc39.x86_64 #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
       RIP: 0010:hrtimer_active+0xd/0x50
       Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d
       RSP: 0018:ffffb031009b7e10 EFLAGS: 00010286
       RAX: 000000000002db00 RBX: ffff9118f786db08 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffff9117a0e64400 RDI: ffff9118f786db08
       RBP: ffff9118f786db80 R08: ffff9117a0ddd420 R09: ffff9117804d4f70
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9118f786db08
       R13: ffff91178fdd5e20 R14: ffff9117840978c0 R15: 0000000000000000
       FS:  00007f2ffbab1740(0000) GS:ffff9118f7840000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000010 CR3: 00000001b402e000 CR4: 0000000000750ee0
       PKRU: 55555554
       Call Trace:
        <TASK>
        ? __die+0x23/0x70
        ? page_fault_oops+0x171/0x4e0
        ? srso_alias_return_thunk+0x5/0x7f
        ? avc_has_extended_perms+0x237/0x520
        ? exc_page_fault+0x7f/0x180
        ? asm_exc_page_fault+0x26/0x30
        ? hrtimer_active+0xd/0x50
        hrtimer_cancel+0x15/0x40
        timerlat_fd_release+0x48/0xe0
        __fput+0xf5/0x290
        __x64_sys_close+0x3d/0x80
        do_syscall_64+0x60/0x90
        ? srso_alias_return_thunk+0x5/0x7f
        ? __x64_sys_ioctl+0x72/0xd0
        ? srso_alias_return_thunk+0x5/0x7f
        ? syscall_exit_to_user_mode+0x2b/0x40
        ? srso_alias_return_thunk+0x5/0x7f
        ? do_syscall_64+0x6c/0x90
        ? srso_alias_return_thunk+0x5/0x7f
        ? exit_to_user_mode_prepare+0x142/0x1f0
        ? srso_alias_return_thunk+0x5/0x7f
        ? syscall_exit_to_user_mode+0x2b/0x40
        ? srso_alias_return_thunk+0x5/0x7f
        ? do_syscall_64+0x6c/0x90
        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
       RIP: 0033:0x7f2ffb321594
       Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
       RSP: 002b:00007ffe8d8eef18 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
       RAX: ffffffffffffffda RBX: 00007f2ffba4e668 RCX: 00007f2ffb321594
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
       RBP: 00007ffe8d8eef40 R08: 0000000000000000 R09: 0000000000000000
       R10: 55c926e3167eae79 R11: 0000000000000202 R12: 0000000000000003
       R13: 00007ffe8d8ef030 R14: 0000000000000000 R15: 00007f2ffba4e668
        </TASK>
       CR2: 0000000000000010
       ---[ end trace 0000000000000000 ]---
      
      Move hrtimer_init to timerlat_fd open() to avoid this problem.
      
      Link: https://lore.kernel.org/linux-trace-kernel/7324dd3fc0035658c99b825204a66049389c56e3.1706798888.git.bristot@kernel.org
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: stable@vger.kernel.org
      Fixes: e88ed227 ("tracing/timerlat: Add user-space interface")
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1389358b
  2. Jan 31, 2024
  3. Jan 26, 2024
  4. Jan 25, 2024
    • Tim Chen's avatar
      tick/sched: Preserve number of idle sleeps across CPU hotplug events · 9a574ea9
      Tim Chen authored
      
      Commit 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs
      CPU hotplug") preserved total idle sleep time and iowait sleeptime across
      CPU hotplug events.
      
      Similar reasoning applies to the number of idle calls and idle sleeps to
      get the proper average of sleep time per idle invocation.
      
      Preserve those fields too.
      
      Fixes: 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug")
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240122233534.3094238-1-tim.c.chen@linux.intel.com
      9a574ea9
    • Jiri Wiesner's avatar
      clocksource: Skip watchdog check for large watchdog intervals · 64464955
      Jiri Wiesner authored
      
      There have been reports of the watchdog marking clocksources unstable on
      machines with 8 NUMA nodes:
      
        clocksource: timekeeping watchdog on CPU373:
        Marking clocksource 'tsc' as unstable because the skew is too large:
        clocksource:   'hpet' wd_nsec: 14523447520
        clocksource:   'tsc'  cs_nsec: 14524115132
      
      The measured clocksource skew - the absolute difference between cs_nsec
      and wd_nsec - was 668 microseconds:
      
        cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
      
      The kernel used 200 microseconds for the uncertainty_margin of both the
      clocksource and watchdog, resulting in a threshold of 400 microseconds (the
      md variable). Both the cs_nsec and the wd_nsec value indicate that the
      readout interval was circa 14.5 seconds.  The observed behaviour is that
      watchdog checks failed for large readout intervals on 8 NUMA node
      machines. This indicates that the size of the skew was directly proportinal
      to the length of the readout interval on those machines. The measured
      clocksource skew, 668 microseconds, was evaluated against a threshold (the
      md variable) that is suited for readout intervals of roughly
      WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
      
      The intention of 2e27e793 ("clocksource: Reduce clocksource-skew
      threshold") was to tighten the threshold for evaluating skew and set the
      lower bound for the uncertainty_margin of clocksources to twice
      WATCHDOG_MAX_SKEW. Later in c37e85c1 ("clocksource: Loosen clocksource
      watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
      125 microseconds to fit the limit of NTP, which is able to use a
      clocksource that suffers from up to 500 microseconds of skew per second.
      Both the TSC and the HPET use default uncertainty_margin. When the
      readout interval gets stretched the default uncertainty_margin is no
      longer a suitable lower bound for evaluating skew - it imposes a limit
      that is far stricter than the skew with which NTP can deal.
      
      The root causes of the skew being directly proportinal to the length of
      the readout interval are:
      
        * the inaccuracy of the shift/mult pairs of clocksources and the watchdog
        * the conversion to nanoseconds is imprecise for large readout intervals
      
      Prevent this by skipping the current watchdog check if the readout
      interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout
      interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin
      (of the TSC and HPET) corresponds to a limit on clocksource skew of 250
      ppm (microseconds of skew per second).  To keep the limit imposed by NTP
      (500 microseconds of skew per second) for all possible readout intervals,
      the margins would have to be scaled so that the threshold value is
      proportional to the length of the actual readout interval.
      
      As for why the readout interval may get stretched: Since the watchdog is
      executed in softirq context the expiration of the watchdog timer can get
      severely delayed on account of a ksoftirqd thread not getting to run in a
      timely manner. Surely, a system with such belated softirq execution is not
      working well and the scheduling issue should be looked into but the
      clocksource watchdog should be able to deal with it accordingly.
      
      Fixes: 2e27e793 ("clocksource: Reduce clocksource-skew threshold")
      Suggested-by: default avatarFeng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarJiri Wiesner <jwiesner@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarFeng Tang <feng.tang@intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240122172350.GA740@incl
      64464955
  5. Jan 24, 2024
    • Kees Cook's avatar
      exec: Distinguish in_execve from in_exec · 90383cc0
      Kees Cook authored
      
      Just to help distinguish the fs->in_exec flag from the current->in_execve
      flag, add comments in check_unsafe_exec() and copy_fs() for more
      context. Also note that in_execve is only used by TOMOYO now.
      
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      90383cc0
    • Frederic Weisbecker's avatar
      rcu: Defer RCU kthreads wakeup when CPU is dying · e787644c
      Frederic Weisbecker authored
      
      When the CPU goes idle for the last time during the CPU down hotplug
      process, RCU reports a final quiescent state for the current CPU. If
      this quiescent state propagates up to the top, some tasks may then be
      woken up to complete the grace period: the main grace period kthread
      and/or the expedited main workqueue (or kworker).
      
      If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
      arm the RT bandwith timer to the local offline CPU. Since this happens
      after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
      timer gets ignored. Therefore if the RCU kthreads are waiting for RT
      bandwidth to be available, they may never be actually scheduled.
      
      This triggers TREE03 rcutorture hangs:
      
      	 rcu: INFO: rcu_preempt self-detected stall on CPU
      	 rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
      	 rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
      	 rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
      	 rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
      	 rcu: RCU grace-period kthread stack dump:
      	 task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
      	 Call Trace:
      	  <TASK>
      	  __schedule+0x2eb/0xa80
      	  schedule+0x1f/0x90
      	  schedule_timeout+0x163/0x270
      	  ? __pfx_process_timeout+0x10/0x10
      	  rcu_gp_fqs_loop+0x37c/0x5b0
      	  ? __pfx_rcu_gp_kthread+0x10/0x10
      	  rcu_gp_kthread+0x17c/0x200
      	  kthread+0xde/0x110
      	  ? __pfx_kthread+0x10/0x10
      	  ret_from_fork+0x2b/0x40
      	  ? __pfx_kthread+0x10/0x10
      	  ret_from_fork_asm+0x1b/0x30
      	  </TASK>
      
      The situation can't be solved with just unpinning the timer. The hrtimer
      infrastructure and the nohz heuristics involved in finding the best
      remote target for an unpinned timer would then also need to handle
      enqueues from an offline CPU in the most horrendous way.
      
      So fix this on the RCU side instead and defer the wake up to an online
      CPU if it's too late for the local one.
      
      Reported-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Fixes: 5c0930cc ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarNeeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
      e787644c
    • Dawei Li's avatar
      genirq: Initialize resend_node hlist for all interrupt descriptors · b184c8c2
      Dawei Li authored
      
      For a CONFIG_SPARSE_IRQ=n kernel, early_irq_init() is supposed to
      initialize all interrupt descriptors.
      
      It does except for irq_desc::resend_node, which ia only initialized for the
      first descriptor.
      
      Use the indexed decriptor and not the base pointer to address that.
      
      Fixes: bc06a9e0 ("genirq: Use hlist for managing resend handlers")
      Signed-off-by: default avatarDawei Li <dawei.li@shingroup.cn>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240122085716.2999875-5-dawei.li@shingroup.cn
      
      b184c8c2
  6. Jan 22, 2024
    • Petr Pavlu's avatar
      tracing: Ensure visibility when inserting an element into tracing_map · 2b447606
      Petr Pavlu authored
      Running the following two commands in parallel on a multi-processor
      AArch64 machine can sporadically produce an unexpected warning about
      duplicate histogram entries:
      
       $ while true; do
           echo hist:key=id.syscall:val=hitcount > \
             /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
           cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
           sleep 0.001
         done
       $ stress-ng --sysbadaddr $(nproc)
      
      The warning looks as follows:
      
      [ 2911.172474] ------------[ cut here ]------------
      [ 2911.173111] Duplicates detected: 1
      [ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
      [ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
      [ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
      [ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G            E      6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
      [ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
      [ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
      [ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.185310] sp : ffff8000a1513900
      [ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
      [ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
      [ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
      [ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
      [ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
      [ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
      [ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
      [ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
      [ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
      [ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
      [ 2911.194259] Call trace:
      [ 2911.194626]  tracing_map_sort_entries+0x3e0/0x408
      [ 2911.195220]  hist_show+0x124/0x800
      [ 2911.195692]  seq_read_iter+0x1d4/0x4e8
      [ 2911.196193]  seq_read+0xe8/0x138
      [ 2911.196638]  vfs_read+0xc8/0x300
      [ 2911.197078]  ksys_read+0x70/0x108
      [ 2911.197534]  __arm64_sys_read+0x24/0x38
      [ 2911.198046]  invoke_syscall+0x78/0x108
      [ 2911.198553]  el0_svc_common.constprop.0+0xd0/0xf8
      [ 2911.199157]  do_el0_svc+0x28/0x40
      [ 2911.199613]  el0_svc+0x40/0x178
      [ 2911.200048]  el0t_64_sync_handler+0x13c/0x158
      [ 2911.200621]  el0t_64_sync+0x1a8/0x1b0
      [ 2911.201115] ---[ end trace 0000000000000000 ]---
      
      The problem appears to be caused by CPU reordering of writes issued from
      __tracing_map_insert().
      
      The check for the presence of an element with a given key in this
      function is:
      
       val = READ_ONCE(entry->val);
       if (val && keys_match(key, val->key, map->key_size)) ...
      
      The write of a new entry is:
      
       elt = get_free_elt(map);
       memcpy(elt->key, key, map->key_size);
       entry->val = elt;
      
      The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
      stores may become visible in the reversed order on another CPU. This
      second CPU might then incorrectly determine that a new key doesn't match
      an already present val->key and subsequently insert a new element,
      resulting in a duplicate.
      
      Fix the problem by adding a write barrier between
      "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
      good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
      element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
      and the "val->key" check which has an address dependency.
      
      The barrier is placed on a path executed when adding an element for
      a new key. Subsequent updates targeting the same key remain unaffected.
      
      From the user's perspective, the issue was introduced by commit
      c193707d ("tracing: Remove code which merges duplicates"), which
      followed commit cbf4100e ("tracing: Add support to detect and avoid
      duplicates"). The previous code operated differently; it inherently
      expected potential races which result in duplicates but merged them
      later when they occurred.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com
      
      
      
      Fixes: c193707d ("tracing: Remove code which merges duplicates")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Acked-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2b447606
  7. Jan 19, 2024
    • Heiko Carstens's avatar
      tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug · 71fee48f
      Heiko Carstens authored
      
      When offlining and onlining CPUs the overall reported idle and iowait
      times as reported by /proc/stat jump backward and forward:
      
      cpu  132 0 176 225249 47 6 6 21 0 0
      cpu0 80 0 115 112575 33 3 4 18 0 0
      cpu1 52 0 60 112673 13 3 1 2 0 0
      
      cpu  133 0 177 226681 47 6 6 21 0 0
      cpu0 80 0 116 113387 33 3 4 18 0 0
      
      cpu  133 0 178 114431 33 6 6 21 0 0 <---- jump backward
      cpu0 80 0 116 114247 33 3 4 18 0 0
      cpu1 52 0 61 183 0 3 1 2 0 0        <---- idle + iowait start with 0
      
      cpu  133 0 178 228956 47 6 6 21 0 0 <---- jump forward
      cpu0 81 0 117 114929 33 3 4 18 0 0
      
      Reason for this is that get_idle_time() in fs/proc/stat.c has different
      sources for both values depending on if a CPU is online or offline:
      
      - if a CPU is online the values may be taken from its per cpu
        tick_cpu_sched structure
      
      - if a CPU is offline the values are taken from its per cpu cpustat
        structure
      
      The problem is that the per cpu tick_cpu_sched structure is set to zero on
      CPU offline. See tick_cancel_sched_timer() in kernel/time/tick-sched.c.
      
      Therefore when a CPU is brought offline and online afterwards both its idle
      and iowait sleeptime will be zero, causing a jump backward in total system
      idle and iowait sleeptime. In a similar way if a CPU is then brought
      offline again the total idle and iowait sleeptimes will jump forward.
      
      It looks like this behavior was introduced with commit 4b0c0f29
      ("tick: Cleanup NOHZ per cpu data on cpu down").
      
      This was only noticed now on s390, since we switched to generic idle time
      reporting with commit be76ea61 ("s390/idle: remove arch_cpu_idle_time()
      and corresponding code").
      
      Fix this by preserving the values of idle_sleeptime and iowait_sleeptime
      members of the per-cpu tick_sched structure on CPU hotplug.
      
      Fixes: 4b0c0f29 ("tick: Cleanup NOHZ per cpu data on cpu down")
      Reported-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/20240115163555.1004144-1-hca@linux.ibm.com
      71fee48f
    • Sebastian Andrzej Siewior's avatar
      futex: Prevent the reuse of stale pi_state · e626cb02
      Sebastian Andrzej Siewior authored
      
      Jiri Slaby reported a futex state inconsistency resulting in -EINVAL during
      a lock operation for a PI futex. It requires that the a lock process is
      interrupted by a timeout or signal:
      
        T1 Owns the futex in user space.
      
        T2 Tries to acquire the futex in kernel (futex_lock_pi()). Allocates a
           pi_state and attaches itself to it.
      
        T2 Times out and removes its rt_waiter from the rt_mutex. Drops the
           rtmutex lock and tries to acquire the hash bucket lock to remove
           the futex_q. The lock is contended and T2 schedules out.
      
        T1 Unlocks the futex (futex_unlock_pi()). Finds a futex_q but no
           rt_waiter. Unlocks the futex (do_uncontended) and makes it available
           to user space.
      
        T3 Acquires the futex in user space.
      
        T4 Tries to acquire the futex in kernel (futex_lock_pi()). Finds the
           existing futex_q of T2 and tries to attach itself to the existing
           pi_state.  This (attach_to_pi_state()) fails with -EINVAL because uval
           contains the TID of T3 but pi_state points to T1.
      
      It's incorrect to unlock the futex and make it available for user space to
      acquire as long as there is still an existing state attached to it in the
      kernel.
      
      T1 cannot hand over the futex to T2 because T2 already gave up and started
      to clean up and is blocked on the hash bucket lock, so T2's futex_q with
      the pi_state pointing to T1 is still queued.
      
      T2 observes the futex_q, but ignores it as there is no waiter on the
      corresponding rt_mutex and takes the uncontended path which allows the
      subsequent caller of futex_lock_pi() (T4) to observe that stale state.
      
      To prevent this the unlock path must dequeue all futex_q entries which
      point to the same pi_state when there is no waiter on the rt mutex. This
      requires obviously to make the dequeue conditional in the locking path to
      prevent a double dequeue. With that it's guaranteed that user space cannot
      observe an uncontended futex which has kernel state attached.
      
      Fixes: fbeb558b ("futex/pi: Fix recursive rt_mutex waiter state")
      Reported-by: default avatarJiri Slaby <jirislaby@kernel.org>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarJiri Slaby <jirislaby@kernel.org>
      Link: https://lore.kernel.org/r/20240118115451.0TkD_ZhB@linutronix.de
      Closes: https://lore.kernel.org/all/4611bcf2-44d0-4c34-9b84-17406f881003@kernel.org
      e626cb02
  8. Jan 18, 2024
    • Andrii Nakryiko's avatar
      bpf: enforce types for __arg_ctx-tagged arguments in global subprogs · 0ba97151
      Andrii Nakryiko authored
      
      Add enforcement of expected types for context arguments tagged with
      arg:ctx (__arg_ctx) tag.
      
      First, any program type will accept generic `void *` context type when
      combined with __arg_ctx tag.
      
      Besides accepting "canonical" struct names and `void *`, for a bunch of
      program types for which program context is actually a named struct, we
      allows a bunch of pragmatic exceptions to match real-world and expected
      usage:
      
        - for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
          canonical context argument type, where `bpf_user_pt_regs_t` is a
          *typedef*, not a struct;
        - for kprobes, we also always accept `struct pt_regs *`, as that's what
          actually is passed as a context to any kprobe program;
        - for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
          down to actual struct type and accept `struct pt_regs *`, or
          `struct user_pt_regs *`, or `struct user_regs_struct *`, depending
          on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
          typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
          expected;
        - for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
          what's expected with BPF_PROG() usage; otherwise, canonical
          `struct bpf_raw_tracepoint_args *` is expected;
        - tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
          formats, both are coded as expections as tp_btf is actually a TRACING
          program type, which has no canonical context type;
        - iterator programs accept `struct bpf_iter__xxx *` structs, currently
          with no further iterator-type specific enforcement;
        - fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
        - classic tracepoint programs, as well as syscall and freplace
          programs allow any user-provided type.
      
      In all other cases kernel will enforce exact match of struct name to
      expected canonical type. And if user-provided type doesn't match that
      expectation, verifier will emit helpful message with expected type name.
      
      Note a bit unnatural way the check is done after processing all the
      arguments. This is done to avoid conflict between bpf and bpf-next
      trees. Once trees converge, a small follow up patch will place a simple
      btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
      (which bpf-next tree patch refactored already), removing duplicated
      arg:ctx detection logic.
      
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240118033143.3384355-4-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0ba97151
    • Andrii Nakryiko's avatar
      bpf: extract bpf_ctx_convert_map logic and make it more reusable · 66967a32
      Andrii Nakryiko authored
      
      Refactor btf_get_prog_ctx_type() a bit to allow reuse of
      bpf_ctx_convert_map logic in more than one places. Simplify interface by
      returning btf_type instead of btf_member (field reference in BTF).
      
      To do the above we need to touch and start untangling
      btf_translate_to_vmlinux() implementation. We do the bare minimum to
      not regress anything for btf_translate_to_vmlinux(), but its
      implementation is very questionable for what it claims to be doing.
      Mapping kfunc argument types to kernel corresponding types conceptually
      is quite different from recognizing program context types. Fixing this
      is out of scope for this change though.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240118033143.3384355-3-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      66967a32
  9. Jan 17, 2024
  10. Jan 16, 2024
    • Hao Sun's avatar
      bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS · 22c7fa17
      Hao Sun authored
      
      For PTR_TO_FLOW_KEYS, check_flow_keys_access() only uses fixed off
      for validation. However, variable offset ptr alu is not prohibited
      for this ptr kind. So the variable offset is not checked.
      
      The following prog is accepted:
      
        func#0 @0
        0: R1=ctx() R10=fp0
        0: (bf) r6 = r1                       ; R1=ctx() R6_w=ctx()
        1: (79) r7 = *(u64 *)(r6 +144)        ; R6_w=ctx() R7_w=flow_keys()
        2: (b7) r8 = 1024                     ; R8_w=1024
        3: (37) r8 /= 1                       ; R8_w=scalar()
        4: (57) r8 &= 1024                    ; R8_w=scalar(smin=smin32=0,
        smax=umax=smax32=umax32=1024,var_off=(0x0; 0x400))
        5: (0f) r7 += r8
        mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1
        mark_precise: frame0: regs=r8 stack= before 4: (57) r8 &= 1024
        mark_precise: frame0: regs=r8 stack= before 3: (37) r8 /= 1
        mark_precise: frame0: regs=r8 stack= before 2: (b7) r8 = 1024
        6: R7_w=flow_keys(smin=smin32=0,smax=umax=smax32=umax32=1024,var_off
        =(0x0; 0x400)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1024,
        var_off=(0x0; 0x400))
        6: (79) r0 = *(u64 *)(r7 +0)          ; R0_w=scalar()
        7: (95) exit
      
      This prog loads flow_keys to r7, and adds the variable offset r8
      to r7, and finally causes out-of-bounds access:
      
        BUG: unable to handle page fault for address: ffffc90014c80038
        [...]
        Call Trace:
         <TASK>
         bpf_dispatcher_nop_func include/linux/bpf.h:1231 [inline]
         __bpf_prog_run include/linux/filter.h:651 [inline]
         bpf_prog_run include/linux/filter.h:658 [inline]
         bpf_prog_run_pin_on_cpu include/linux/filter.h:675 [inline]
         bpf_flow_dissect+0x15f/0x350 net/core/flow_dissector.c:991
         bpf_prog_test_run_flow_dissector+0x39d/0x620 net/bpf/test_run.c:1359
         bpf_prog_test_run kernel/bpf/syscall.c:4107 [inline]
         __sys_bpf+0xf8f/0x4560 kernel/bpf/syscall.c:5475
         __do_sys_bpf kernel/bpf/syscall.c:5561 [inline]
         __se_sys_bpf kernel/bpf/syscall.c:5559 [inline]
         __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:5559
         do_syscall_x64 arch/x86/entry/common.c:52 [inline]
         do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:83
         entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Fix this by rejecting ptr alu with variable offset on flow_keys.
      Applying the patch rejects the program with "R7 pointer arithmetic
      on flow_keys prohibited".
      
      Fixes: d58e468b ("flow_dissector: implements flow dissector BPF hook")
      Signed-off-by: default avatarHao Sun <sunhao.th@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/bpf/20240115082028.9992-1-sunhao.th@gmail.com
      22c7fa17
    • Vincent Guittot's avatar
      sched/fair: Fix frequency selection for non-invariant case · e37617c8
      Vincent Guittot authored and Ingo Molnar's avatar Ingo Molnar committed
      
      Linus reported a ~50% performance regression on single-threaded
      workloads on his AMD Ryzen system, and bisected it to:
      
        9c0b4bb7 ("sched/cpufreq: Rework schedutil governor performance estimation")
      
      When frequency invariance is not enabled, get_capacity_ref_freq(policy)
      is supposed to return the current frequency and the performance margin
      applied by map_util_perf(), enabling the utilization to go above the
      maximum compute capacity and to select a higher frequency than the current one.
      
      After the changes in 9c0b4bb7, the performance margin was applied
      earlier in the path to take into account utilization clampings and
      we couldn't get a utilization higher than the maximum compute capacity,
      and the CPU remained 'stuck' at lower frequencies.
      
      To fix this, we must use a frequency above the current frequency to
      get a chance to select a higher OPP when the current one becomes fully used.
      Apply the same margin and return a frequency 25% higher than the current
      one in order to switch to the next OPP before we fully use the CPU
      at the current one.
      
      [ mingo: Clarified the changelog. ]
      
      Fixes: 9c0b4bb7 ("sched/cpufreq: Rework schedutil governor performance estimation")
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Bisected-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: default avatarWyes Karny <wkarny@gmail.com>
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Tested-by: default avatarWyes Karny <wkarny@gmail.com>
      Link: https://lore.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org
      e37617c8
  11. Jan 15, 2024
    • Randy Dunlap's avatar
      dma-debug: fix kernel-doc warnings · 7c65aa3c
      Randy Dunlap authored
      
      Update the kernel-doc comments to catch up with the code changes and
      fix the kernel-doc warnings:
      
      debug.c:83: warning: Excess struct member 'stacktrace' description in 'dma_debug_entry'
      debug.c:83: warning: Function parameter or struct member 'stack_len' not described in 'dma_debug_entry'
      debug.c:83: warning: Function parameter or struct member 'stack_entries' not described in 'dma_debug_entry'
      
      Fixes: 746017ed ("dma/debug: Simplify stracktrace retrieval")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: iommu@lists.linux.dev
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      7c65aa3c
  12. Jan 12, 2024
    • Andrew Morton's avatar
      kernel/crash_core.c: make __crash_hotplug_lock static · 4e87ff59
      Andrew Morton authored
      
      sparse warnings:
      kernel/crash_core.c:749:1: sparse: sparse: symbol '__crash_hotplug_lock' was not declared. Should it be static?
      
      Fixes: e2a8f20d ("Crash: add lock to serialize crash hotplug handling")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202401080654.IjjU5oK7-lkp@intel.com/
      
      
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4e87ff59
    • James Gowans's avatar
      kexec: do syscore_shutdown() in kernel_kexec · 7bb94380
      James Gowans authored
      syscore_shutdown() runs driver and module callbacks to get the system into
      a state where it can be correctly shut down.  In commit 6f389a8f ("PM
      / reboot: call syscore_shutdown() after disable_nonboot_cpus()")
      syscore_shutdown() was removed from kernel_restart_prepare() and hence got
      (incorrectly?) removed from the kexec flow.  This was innocuous until
      commit 6735150b ("KVM: Use syscore_ops instead of reboot_notifier to
      hook restart/shutdown") changed the way that KVM registered its shutdown
      callbacks, switching from reboot notifiers to syscore_ops.shutdown.  As
      syscore_shutdown() is missing from kexec, KVM's shutdown hook is not run
      and virtualisation is left enabled on the boot CPU which results in triple
      faults when switching to the new kernel on Intel x86 VT-x with VMXE
      enabled.
      
      Fix this by adding syscore_shutdown() to the kexec sequence.  In terms of
      where to add it, it is being added after migrating the kexec task to the
      boot CPU, but before APs are shut down.  It is not totally clear if this
      is the best place: in commit 6f389a8f ("PM / reboot: call
      syscore_shutdown() after disable_nonboot_cpus()") it is stated that
      "syscore_ops operations should be carried with one CPU on-line and
      interrupts disabled." APs are only offlined later in machine_shutdown(),
      so this syscore_shutdown() is being run while APs are still online.  This
      seems to be the correct place as it matches where syscore_shutdown() is
      run in the reboot and halt flows - they also run it before APs are shut
      down.  The assumption is that the commit message in commit 6f389a8f
      ("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()") is
      no longer valid.
      
      KVM has been discussed here as it is what broke loudly by not having
      syscore_shutdown() in kexec, but this change impacts more than just KVM;
      all drivers/modules which register a syscore_ops.shutdown callback will
      now be invoked in the kexec flow.  Looking at some of them like x86 MCE it
      is probably more correct to also shut these down during kexec. 
      Maintainers of all drivers which use syscore_ops.shutdown are added on CC
      for visibility.  They are:
      
      arch/powerpc/platforms/cell/spu_base.c  .shutdown = spu_shutdown,
      arch/x86/kernel/cpu/mce/core.c	        .shutdown = mce_syscore_shutdown,
      arch/x86/kernel/i8259.c                 .shutdown = i8259A_shutdown,
      drivers/irqchip/irq-i8259.c	        .shutdown = i8259A_shutdown,
      drivers/irqchip/irq-sun6i-r.c	        .shutdown = sun6i_r_intc_shutdown,
      drivers/leds/trigger/ledtrig-cpu.c	.shutdown = ledtrig_cpu_syscore_shutdown,
      drivers/power/reset/sc27xx-poweroff.c	.shutdown = sc27xx_poweroff_shutdown,
      kernel/irq/generic-chip.c	        .shutdown = irq_gc_shutdown,
      virt/kvm/kvm_main.c	                .shutdown = kvm_shutdown,
      
      This has been tested by doing a kexec on x86_64 and aarch64.
      
      Link: https://lkml.kernel.org/r/20231213064004.2419447-1-jgowans@amazon.com
      
      
      Fixes: 6735150b ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
      Signed-off-by: default avatarJames Gowans <jgowans@amazon.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Chen-Yu Tsai <wens@csie.org>
      Cc: Jernej Skrabec <jernej.skrabec@gmail.com>
      Cc: Samuel Holland <samuel@sholland.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Sebastian Reichel <sre@kernel.org>
      Cc: Orson Zhai <orsonzhai@gmail.com>
      Cc: Alexander Graf <graf@amazon.de>
      Cc: Jan H. Schoenherr <jschoenh@amazon.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7bb94380
    • Huacai Chen's avatar
      kdump: defer the insertion of crashkernel resources · 4a693ce6
      Huacai Chen authored
      In /proc/iomem, sub-regions should be inserted after their parent,
      otherwise the insertion of parent resource fails.  But after generic
      crashkernel reservation applied, in both RISC-V and ARM64 (LoongArch will
      also use generic reservation later on), crashkernel resources are inserted
      before their parent, which causes the parent disappear in /proc/iomem.  So
      we defer the insertion of crashkernel resources to an early_initcall().
      
      1, Without 'crashkernel' parameter:
      
       100d0100-100d01ff : LOON0001:00
         100d0100-100d01ff : LOON0001:00 LOON0001:00
       100e0000-100e0bff : LOON0002:00
         100e0000-100e0bff : LOON0002:00 LOON0002:00
       1fe001e0-1fe001e7 : serial
       90400000-fa17ffff : System RAM
         f6220000-f622ffff : Reserved
         f9ee0000-f9ee3fff : Reserved
         fa120000-fa17ffff : Reserved
       fa190000-fe0bffff : System RAM
         fa190000-fa1bffff : Reserved
       fe4e0000-47fffffff : System RAM
         43c000000-441ffffff : Reserved
         47ff98000-47ffa3fff : Reserved
         47ffa4000-47ffa7fff : Reserved
         47ffa8000-47ffabfff : Reserved
         47ffac000-47ffaffff : Reserved
         47ffb0000-47ffb3fff : Reserved
      
      2, With 'crashkernel' parameter, before this patch:
      
       100d0100-100d01ff : LOON0001:00
         100d0100-100d01ff : LOON0001:00 LOON0001:00
       100e0000-100e0bff : LOON0002:00
         100e0000-100e0bff : LOON0002:00 LOON0002:00
       1fe001e0-1fe001e7 : serial
       e6200000-f61fffff : Crash kernel
       fa190000-fe0bffff : System RAM
         fa190000-fa1bffff : Reserved
       fe4e0000-47fffffff : System RAM
         43c000000-441ffffff : Reserved
         47ff98000-47ffa3fff : Reserved
         47ffa4000-47ffa7fff : Reserved
         47ffa8000-47ffabfff : Reserved
         47ffac000-47ffaffff : Reserved
         47ffb0000-47ffb3fff : Reserved
      
      3, With 'crashkernel' parameter, after this patch:
      
       100d0100-100d01ff : LOON0001:00
         100d0100-100d01ff : LOON0001:00 LOON0001:00
       100e0000-100e0bff : LOON0002:00
         100e0000-100e0bff : LOON0002:00 LOON0002:00
       1fe001e0-1fe001e7 : serial
       90400000-fa17ffff : System RAM
         e6200000-f61fffff : Crash kernel
         f6220000-f622ffff : Reserved
         f9ee0000-f9ee3fff : Reserved
         fa120000-fa17ffff : Reserved
       fa190000-fe0bffff : System RAM
         fa190000-fa1bffff : Reserved
       fe4e0000-47fffffff : System RAM
         43c000000-441ffffff : Reserved
         47ff98000-47ffa3fff : Reserved
         47ffa4000-47ffa7fff : Reserved
         47ffa8000-47ffabfff : Reserved
         47ffac000-47ffaffff : Reserved
         47ffb0000-47ffb3fff : Reserved
      
      Link: https://lkml.kernel.org/r/20231229080213.2622204-1-chenhuacai@loongson.cn
      
      
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      Fixes: 0ab97169 ("crash_core: add generic function to do reservation")
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Zhen Lei <thunder.leizhen@huawei.com>
      Cc: <stable@vger.kernel.org>	[6.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a693ce6
  13. Jan 10, 2024
  14. Jan 09, 2024
  15. Jan 08, 2024
  16. Jan 05, 2024
  17. Jan 04, 2024
  18. Jan 03, 2024
    • Andrei Matei's avatar
      bpf: Simplify checking size of helper accesses · 8a021e7f
      Andrei Matei authored
      
      This patch simplifies the verification of size arguments associated to
      pointer arguments to helpers and kfuncs. Many helpers take a pointer
      argument followed by the size of the memory access performed to be
      performed through that pointer. Before this patch, the handling of the
      size argument in check_mem_size_reg() was confusing and wasteful: if the
      size register's lower bound was 0, then the verification was done twice:
      once considering the size of the access to be the lower-bound of the
      respective argument, and once considering the upper bound (even if the
      two are the same). The upper bound checking is a super-set of the
      lower-bound checking(*), except: the only point of the lower-bound check
      is to handle the case where zero-sized-accesses are explicitly not
      allowed and the lower-bound is zero. This static condition is now
      checked explicitly, replacing a much more complex, expensive and
      confusing verification call to check_helper_mem_access().
      
      Error messages change in this patch. Before, messages about illegal
      zero-size accesses depended on the type of the pointer and on other
      conditions, and sometimes the message was plain wrong: in some tests
      that changed you'll see that the old message was something like "R1 min
      value is outside of the allowed memory range", where R1 is the pointer
      register; the error was wrongly claiming that the pointer was bad
      instead of the size being bad. Other times the information that the size
      came for a register with a possible range of values was wrong, and the
      error presented the size as a fixed zero. Now the errors refer to the
      right register. However, the old error messages did contain useful
      information about the pointer register which is now lost; recovering
      this information was deemed not important enough.
      
      (*) Besides standing to reason that the checks for a bigger size access
      are a super-set of the checks for a smaller size access, I have also
      mechanically verified this by reading the code for all types of
      pointers. I could convince myself that it's true for all but
      PTR_TO_BTF_ID (check_ptr_to_btf_access). There, simply looking
      line-by-line does not immediately prove what we want. If anyone has any
      qualms, let me know.
      
      Signed-off-by: default avatarAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231221232225.568730-2-andreimatei1@gmail.com
      8a021e7f
    • Rafael J. Wysocki's avatar
      async: Introduce async_schedule_dev_nocall() · 7d4b5d7a
      Rafael J. Wysocki authored
      
      In preparation for subsequent changes, introduce a specialized variant
      of async_schedule_dev() that will not invoke the argument function
      synchronously when it cannot be scheduled for asynchronous execution.
      
      The new function, async_schedule_dev_nocall(), will be used for fixing
      possible deadlocks in the system-wide power management core code.
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> for the series.
      Tested-by: default avatarYoungmin Nam <youngmin.nam@samsung.com>
      Reviewed-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      7d4b5d7a
Loading