Skip to content
Snippets Groups Projects
  1. Oct 03, 2024
    • Wei Li's avatar
      tracing/hwlat: Fix a race during cpuhp processing · 2a13ca2e
      Wei Li authored
      The cpuhp online/offline processing race also exists in percpu-mode hwlat
      tracer in theory, apply the fix too. That is:
      
          T1                       | T2
          [CPUHP_ONLINE]           | cpu_device_down()
           hwlat_hotplug_workfn()  |
                                   |     cpus_write_lock()
                                   |     takedown_cpu(1)
                                   |     cpus_write_unlock()
          [CPUHP_OFFLINE]          |
              cpus_read_lock()     |
              start_kthread(1)     |
              cpus_read_unlock()   |
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20240924094515.3561410-5-liwei391@huawei.com
      
      
      Fixes: ba998f7d ("trace/hwlat: Support hotplug operations")
      Signed-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2a13ca2e
    • Wei Li's avatar
      tracing/timerlat: Fix a race during cpuhp processing · 829e0c9f
      Wei Li authored
      There is another found exception that the "timerlat/1" thread was
      scheduled on CPU0, and lead to timer corruption finally:
      
      ```
      ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220
      WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0
      Modules linked in:
      CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
      RIP: 0010:debug_print_object+0x7d/0xb0
      ...
      Call Trace:
       <TASK>
       ? __warn+0x7c/0x110
       ? debug_print_object+0x7d/0xb0
       ? report_bug+0xf1/0x1d0
       ? prb_read_valid+0x17/0x20
       ? handle_bug+0x3f/0x70
       ? exc_invalid_op+0x13/0x60
       ? asm_exc_invalid_op+0x16/0x20
       ? debug_print_object+0x7d/0xb0
       ? debug_print_object+0x7d/0xb0
       ? __pfx_timerlat_irq+0x10/0x10
       __debug_object_init+0x110/0x150
       hrtimer_init+0x1d/0x60
       timerlat_main+0xab/0x2d0
       ? __pfx_timerlat_main+0x10/0x10
       kthread+0xb7/0xe0
       ? __pfx_kthread+0x10/0x10
       ret_from_fork+0x2d/0x40
       ? __pfx_kthread+0x10/0x10
       ret_from_fork_asm+0x1a/0x30
       </TASK>
      ```
      
      After tracing the scheduling event, it was discovered that the migration
      of the "timerlat/1" thread was performed during thread creation. Further
      analysis confirmed that it is because the CPU online processing for
      osnoise is implemented through workers, which is asynchronous with the
      offline processing. When the worker was scheduled to create a thread, the
      CPU may has already been removed from the cpu_online_mask during the offline
      process, resulting in the inability to select the right CPU:
      
      T1                       | T2
      [CPUHP_ONLINE]           | cpu_device_down()
      osnoise_hotplug_workfn() |
                               |     cpus_write_lock()
                               |     takedown_cpu(1)
                               |     cpus_write_unlock()
      [CPUHP_OFFLINE]          |
          cpus_read_lock()     |
          start_kthread(1)     |
          cpus_read_unlock()   |
      
      To fix this, skip online processing if the CPU is already offline.
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com
      
      
      Fixes: c8895e27 ("trace/osnoise: Support hotplug operations")
      Signed-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      829e0c9f
    • Wei Li's avatar
      tracing/timerlat: Drop interface_lock in stop_kthread() · b484a02c
      Wei Li authored
      stop_kthread() is the offline callback for "trace/osnoise:online", since
      commit 5bfbcd1e ("tracing/timerlat: Add interface_lock around clearing
      of kthread in stop_kthread()"), the following ABBA deadlock scenario is
      introduced:
      
      T1                            | T2 [BP]               | T3 [AP]
      osnoise_hotplug_workfn()      | work_for_cpu_fn()     | cpuhp_thread_fun()
                                    |   _cpu_down()         |   osnoise_cpu_die()
        mutex_lock(&interface_lock) |                       |     stop_kthread()
                                    |     cpus_write_lock() |       mutex_lock(&interface_lock)
        cpus_read_lock()            |     cpuhp_kick_ap()   |
      
      As the interface_lock here in just for protecting the "kthread" field of
      the osn_var, use xchg() instead to fix this issue. Also use
      for_each_online_cpu() back in stop_per_cpu_kthreads() as it can take
      cpu_read_lock() again.
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20240924094515.3561410-3-liwei391@huawei.com
      
      
      Fixes: 5bfbcd1e ("tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()")
      Signed-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      b484a02c
    • Wei Li's avatar
      tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline · 0bb0a5c1
      Wei Li authored
      osnoise_hotplug_workfn() is the asynchronous online callback for
      "trace/osnoise:online". It may be congested when a CPU goes online and
      offline repeatedly and is invoked for multiple times after a certain
      online.
      
      This will lead to kthread leak and timer corruption. Add a check
      in start_kthread() to prevent this situation.
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20240924094515.3561410-2-liwei391@huawei.com
      
      
      Fixes: c8895e27 ("trace/osnoise: Support hotplug operations")
      Signed-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      0bb0a5c1
    • Steven Rostedt's avatar
      tracing: Fix trace_check_vprintf() when tp_printk is used · 50a3242d
      Steven Rostedt authored
      When the tp_printk kernel command line is used, the trace events go
      directly to printk(). It is still checked via the trace_check_vprintf()
      function to make sure the pointers of the trace event are legit.
      
      The addition of reading buffers from previous boots required adding a
      delta between the addresses of the previous boot and the current boot so
      that the pointers in the old buffer can still be used. But this required
      adding a trace_array pointer to acquire the delta offsets.
      
      The tp_printk code does not provide a trace_array (tr) pointer, so when
      the offsets were examined, a NULL pointer dereference happened and the
      kernel crashed.
      
      If the trace_array does not exist, just default the delta offsets to zero,
      as that also means the trace event is not being read from a previous boot.
      
      Link: https://lore.kernel.org/all/Zv3z5UsG_jsO9_Tb@aschofie-mobl2.lan/
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20241003104925.4e1b1fd9@gandalf.local.home
      
      
      Fixes: 07714b4b ("tracing: Handle old buffer mappings for event strings and functions")
      Reported-by: default avatarAlison Schofield <alison.schofield@intel.com>
      Tested-by: default avatarAlison Schofield <alison.schofield@intel.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      50a3242d
  2. Sep 27, 2024
    • Al Viro's avatar
      [tree-wide] finally take no_llseek out · cb787f4a
      Al Viro authored
      
      no_llseek had been defined to NULL two years ago, in commit 868941b1
      ("fs: remove no_llseek")
      
      To quote that commit,
      
        At -rc1 we'll need do a mechanical removal of no_llseek -
      
        git grep -l -w no_llseek | grep -v porting.rst | while read i; do
      	sed -i '/\<no_llseek\>/d' $i
        done
      
        would do it.
      
      Unfortunately, that hadn't been done.  Linus, could you do that now, so
      that we could finally put that thing to rest? All instances are of the
      form
      	.llseek = no_llseek,
      so it's obviously safe.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb787f4a
  3. Sep 26, 2024
  4. Sep 25, 2024
  5. Sep 23, 2024
    • Andrea Righi's avatar
      sched_ext: Provide a sysfs enable_seq counter · 431844b6
      Andrea Righi authored
      
      As discussed during the distro-centric session within the sched_ext
      Microconference at LPC 2024, introduce a sequence counter that is
      incremented every time a BPF scheduler is loaded.
      
      This feature can help distributions in diagnosing potential performance
      regressions by identifying systems where users are running (or have ran)
      custom BPF schedulers.
      
      Example:
      
       arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
       0
       arighi@virtme-ng~> sudo scx_simple
       local=1 global=0
       ^CEXIT: unregistered from user space
       arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
       1
      
      In this way user-space tools (such as Ubuntu's apport and similar) are
      able to gather and include this information in bug reports.
      
      Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com>
      Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
      Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
      Cc: Phil Auld <pauld@redhat.com>
      Signed-off-by: default avatarAndrea Righi <andrea.righi@linux.dev>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      431844b6
    • Tejun Heo's avatar
      sched_ext: Fix build when !CONFIG_STACKTRACE · 62d3726d
      Tejun Heo authored
      
      a2f4b16e ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried
      fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put
      stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix
      build when !CONFIG_STACKTRACE.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/
      62d3726d
    • Pat Somaru's avatar
      sched, sched_ext: Disable SM_IDLE/rq empty path when scx_enabled() · edf1c586
      Pat Somaru authored
      
      Disable the rq empty path when scx is enabled. SCX must consult the BPF
      scheduler (via the dispatch path in balance) to determine if rq is empty.
      
      This fixes stalls when scx is enabled.
      
      Signed-off-by: default avatarPat Somaru <patso@likewhatevs.io>
      Fixes: 3dcac251 ("sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      edf1c586
    • Yu Liao's avatar
      sched: Put task_group::idle under CONFIG_GROUP_SCHED_WEIGHT · 7ebd84d6
      Yu Liao authored
      
      When build with CONFIG_GROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED,
      the idle member is not defined:
      
      kernel/sched/ext.c:3701:16: error: 'struct task_group' has no member named 'idle'
        3701 |         if (!tg->idle)
             |                ^~
      
      Fix this by putting 'idle' under new CONFIG_GROUP_SCHED_WEIGHT.
      
      tj: Move idle field upward to avoid breaking up CONFIG_FAIR_GROUP_SCHED block.
      
      Fixes: e179e80c ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
      
      
      Signed-off-by: default avatarYu Liao <liaoyu15@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      7ebd84d6
    • Yu Liao's avatar
      sched: Add dummy version of sched_group_set_idle() · bdeb868c
      Yu Liao authored
      
      Fix the following error when build with CONFIG_GROUP_SCHED_WEIGHT &&
      !CONFIG_FAIR_GROUP_SCHED:
      
      kernel/sched/core.c:9634:15: error: implicit declaration of function
      'sched_group_set_idle'; did you mean 'scx_group_set_idle'? [-Wimplicit-function-declaration]
        9634 |         ret = sched_group_set_idle(css_tg(css), idle);
             |               ^~~~~~~~~~~~~~~~~~~~
             |               scx_group_set_idle
      
      Fixes: e179e80c ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
      
      
      Signed-off-by: default avatarYu Liao <liaoyu15@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bdeb868c
    • Leon Romanovsky's avatar
      dma-mapping: report unlimited DMA addressing in IOMMU DMA path · b348b6d1
      Leon Romanovsky authored
      While using the IOMMU DMA path, the dma_addressing_limited() function
      checks ops struct which doesn't exist in the IOMMU case. This causes
      to the kernel panic while loading ADMGPU driver.
      
      BUG: kernel NULL pointer dereference, address: 00000000000000a0
      PGD 0 P4D 0
      Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 10 UID: 0 PID: 611 Comm: (udev-worker) Tainted: G                T  6.11.0-clang-07154-g726e2d0cf2bb #257
      Tainted: [T]=RANDSTRUCT
      Hardware name: ASUS System Product Name/ROG STRIX Z690-G GAMING WIFI, BIOS 3701 07/03/2024
      RIP: 0010:dma_addressing_limited+0x53/0xa0
      Code: 8b 93 48 02 00 00 48 39 d1 49 89 d6 4c 0f 42 f1 48 85 d2 4c 0f 44 f1 f6 83 fc 02 00 00 40 75 0a 48 89 df e8 1f 09 00 00 eb 24 <4c> 8b 1c 25 a0 00 00 00 4d 85 db 74 17 48 89 df 41 ba 8b 84 2d 55
      RSP: 0018:ffffa8d2c12cf740 EFLAGS: 00010202
      RAX: 00000000ffffffff RBX: ffff8948820220c8 RCX: 000000ffffffffff
      RDX: 0000000000000000 RSI: ffffffffc124dc6d RDI: ffff8948820220c8
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff894883c3f040
      R13: ffff89488dac8828 R14: 000000ffffffffff R15: ffff8948820220c8
      FS:  00007fe6ba881900(0000) GS:ffff894fdf700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000000a0 CR3: 0000000111984000 CR4: 0000000000f50ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __die_body+0x65/0xc0
       ? page_fault_oops+0x3b9/0x450
       ? _prb_read_valid+0x212/0x390
       ? do_user_addr_fault+0x608/0x680
       ? exc_page_fault+0x4e/0xa0
       ? asm_exc_page_fault+0x26/0x30
       ? dma_addressing_limited+0x53/0xa0
       amdgpu_ttm_init+0x56/0x4b0 [amdgpu]
       gmc_v8_0_sw_init+0x561/0x670 [amdgpu]
       amdgpu_device_ip_init+0xf5/0x570 [amdgpu]
       amdgpu_device_init+0x1a57/0x1ea0 [amdgpu]
       ? _raw_spin_unlock_irqrestore+0x1a/0x40
       ? pci_conf1_read+0xc0/0xe0
       ? pci_bus_read_config_word+0x52/0xa0
       amdgpu_driver_load_kms+0x15/0xa0 [amdgpu]
       amdgpu_pci_probe+0x1b7/0x4c0 [amdgpu]
       pci_device_probe+0x1c5/0x260
       really_probe+0x130/0x470
       __driver_probe_device+0x77/0x150
       driver_probe_device+0x19/0x120
       __driver_attach+0xb1/0x1e0
       ? __cfi___driver_attach+0x10/0x10
       bus_for_each_dev+0x115/0x170
       bus_add_driver+0x192/0x2d0
       driver_register+0x5c/0xf0
       ? __cfi_init_module+0x10/0x10 [amdgpu]
       do_one_initcall+0x128/0x380
       ? idr_alloc_cyclic+0x139/0x1d0
       ? security_kernfs_init_security+0x42/0x140
       ? __kernfs_new_node+0x1be/0x250
       ? sysvec_apic_timer_interrupt+0xb6/0xc0
       ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
       ? _raw_spin_unlock+0x11/0x30
       ? free_unref_page+0x283/0x650
       ? kfree+0x274/0x3a0
       ? kfree+0x274/0x3a0
       ? kfree+0x274/0x3a0
       ? load_module+0xf2e/0x1130
       ? __kmalloc_cache_noprof+0x12a/0x2e0
       do_init_module+0x7d/0x240
       __se_sys_init_module+0x19e/0x220
       do_syscall_64+0x8a/0x150
       ? __irq_exit_rcu+0x5e/0x100
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
      RIP: 0033:0x7fe6bb5980ee
      Code: 48 8b 0d 3d ed 12 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0a ed 12 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffd462219d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000af
      RAX: ffffffffffffffda RBX: 0000556caf0d0670 RCX: 00007fe6bb5980ee
      RDX: 0000556caf0d3080 RSI: 0000000002893458 RDI: 00007fe6b3400010
      RBP: 0000000000020000 R08: 0000000000020010 R09: 0000000000000080
      R10: c26073c166186e00 R11: 0000000000000206 R12: 0000556caf0d3430
      R13: 0000556caf0d0670 R14: 0000556caf0d3080 R15: 0000556caf0ce700
       </TASK>
      Modules linked in: amdgpu(+) i915(+) drm_suballoc_helper intel_gtt drm_exec drm_buddy iTCO_wdt i2c_algo_bit intel_pmc_bxt drm_display_helper iTCO_vendor_support gpu_sched drm_ttm_helper cec ttm amdxcp video backlight pinctrl_alderlake nct6775 hwmon_vid nct6775_core coretemp
      CR2: 00000000000000a0
      ---[ end trace 0000000000000000 ]---
      RIP: 0010:dma_addressing_limited+0x53/0xa0
      Code: 8b 93 48 02 00 00 48 39 d1 49 89 d6 4c 0f 42 f1 48 85 d2 4c 0f 44 f1 f6 83 fc 02 00 00 40 75 0a 48 89 df e8 1f 09 00 00 eb 24 <4c> 8b 1c 25 a0 00 00 00 4d 85 db 74 17 48 89 df 41 ba 8b 84 2d 55
      RSP: 0018:ffffa8d2c12cf740 EFLAGS: 00010202
      RAX: 00000000ffffffff RBX: ffff8948820220c8 RCX: 000000ffffffffff
      RDX: 0000000000000000 RSI: ffffffffc124dc6d RDI: ffff8948820220c8
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff894883c3f040
      R13: ffff89488dac8828 R14: 000000ffffffffff R15: ffff8948820220c8
      FS:  00007fe6ba881900(0000) GS:ffff894fdf700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000000a0 CR3: 0000000111984000 CR4: 0000000000f50ef0
      PKRU: 55555554
      
      Fixes: b5c58b2f ("dma-mapping: direct calls for dma-iommu")
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219292
      
      
      Reported-by: default avatarNiklāvs Koļesņikovs <pinkflames.linux@gmail.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarNiklāvs Koļesņikovs <pinkflames.linux@gmail.com>
      b348b6d1
  6. Sep 22, 2024
  7. Sep 20, 2024
    • Jinjie Ruan's avatar
      crash: Fix riscv64 crash memory reserve dead loop · b3f835cd
      Jinjie Ruan authored
      
      On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high"
      will cause system stall as below:
      
      	 Zone ranges:
      	   DMA32    [mem 0x0000000080000000-0x000000009fffffff]
      	   Normal   empty
      	 Movable zone start for each node
      	 Early memory node ranges
      	   node   0: [mem 0x0000000080000000-0x000000008005ffff]
      	   node   0: [mem 0x0000000080060000-0x000000009fffffff]
      	 Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff]
      	(stall here)
      
      commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop
      bug") fix this on 32-bit architecture. However, the problem is not
      completely solved. If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on 64-bit
      architecture, for example, when system memory is equal to
      CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also occur:
      
      	-> reserve_crashkernel_generic() and high is true
      	   -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail
      	      -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly
      	         (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX).
      
      As Catalin suggested, do not remove the ",high" reservation fallback to
      ",low" logic which will change arm64's kdump behavior, but fix it by
      skipping the above situation similar to commit d2f32f23190b ("crash: fix
      x86_32 crash memory reserve dead loop").
      
      After this patch, it print:
      	cannot allocate crashkernel (size:0x1f400000)
      
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Link: https://lore.kernel.org/r/20240812062017.2674441-1-ruanjinjie@huawei.com
      
      
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      b3f835cd
  8. Sep 17, 2024
    • Oleg Nesterov's avatar
      uprobes: turn xol_area->pages[2] into xol_area->page · 2abbcc09
      Oleg Nesterov authored
      Now that xol_mapping has its own ->fault() method we no longer need
      xol_area->pages[1] == NULL, we need a single page.
      
      Link: https://lkml.kernel.org/r/20240911131437.GC3448@redhat.com
      
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2abbcc09
    • Oleg Nesterov's avatar
      uprobes: introduce the global struct vm_special_mapping xol_mapping · 6d27a31e
      Oleg Nesterov authored
      Currently each xol_area has its own instance of vm_special_mapping, this
      is suboptimal and ugly.  Kill xol_area->xol_mapping and add a single
      global instance of vm_special_mapping, the ->fault() method can use
      area->pages rather than xol_mapping->pages.
      
      As a side effect this fixes the problem introduced by the recent commit
      223febc6 ("mm: add optional close() to struct vm_special_mapping"), if
      special_mapping_close() is called from the __mmput() paths, it will use
      vma->vm_private_data = &area->xol_mapping freed by uprobe_clear_state().
      
      Link: https://lkml.kernel.org/r/20240911131407.GB3448@redhat.com
      
      
      Fixes: 223febc6 ("mm: add optional close() to struct vm_special_mapping")
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Closes: https://lore.kernel.org/all/yt9dy149vprr.fsf@linux.ibm.com/
      
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d27a31e
    • Oleg Nesterov's avatar
      Revert "uprobes: use vm_special_mapping close() functionality" · ed8d5b0c
      Oleg Nesterov authored
      This reverts commit 08e28de1.
      
      A malicious application can munmap() its "[uprobes]" vma and in this case
      xol_mapping.close == uprobe_clear_state() will free the memory which can
      be used by another thread, or the same thread when it hits the uprobe bp
      afterwards.
      
      Link: https://lkml.kernel.org/r/20240911131320.GA3448@redhat.com
      
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed8d5b0c
    • Huang Ying's avatar
      resource, kunit: add test case for region_intersects() · 99185c10
      Huang Ying authored
      Patch series "resource: Fix region_intersects() vs
      add_memory_driver_managed()", v3.
      
      The patchset fixes a bug of region_intersects() for systems with CXL
      memory.  The details of the bug can be found in [1/3].  To avoid similar
      bugs in the future.  A kunit test case for region_intersects() is added in
      [3/3].  [2/3] is a preparation patch for [3/3].
      
      
      This patch (of 3):
      
      region_intersects() is important because it's used for /dev/mem permission
      checking.  To avoid possible bug of region_intersects() in the future, a
      kunit test case for region_intersects() is added.
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20240906030713.204292-4-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99185c10
    • Huang Ying's avatar
      resource: make alloc_free_mem_region() works for iomem_resource · bacf9c3c
      Huang Ying authored
      During developing a kunit test case for region_intersects(), some fake
      resources need to be inserted into iomem_resource.  To do that, a resource
      hole needs to be found first in iomem_resource.
      
      However, alloc_free_mem_region() cannot work for iomem_resource now. 
      Because the start address to check cannot be 0 to detect address wrapping
      0 in gfr_continue(), while iomem_resource.start == 0.  To make
      alloc_free_mem_region() works for iomem_resource, gfr_start() is changed
      to avoid to return 0 even if base->start == 0.  We don't need to check 0
      as start address.
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-3-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bacf9c3c
    • Huang Ying's avatar
      resource: fix region_intersects() vs add_memory_driver_managed() · b4afe418
      Huang Ying authored
      On a system with CXL memory, the resource tree (/proc/iomem) related to
      CXL memory may look like something as follows.
      
      490000000-50fffffff : CXL Window 0
        490000000-50fffffff : region0
          490000000-50fffffff : dax0.0
            490000000-50fffffff : System RAM (kmem)
      
      Because drivers/dax/kmem.c calls add_memory_driver_managed() during
      onlining CXL memory, which makes "System RAM (kmem)" a descendant of "CXL
      Window X".  This confuses region_intersects(), which expects all "System
      RAM" resources to be at the top level of iomem_resource.  This can lead to
      bugs.
      
      For example, when the following command line is executed to write some
      memory in CXL memory range via /dev/mem,
      
       $ dd if=data of=/dev/mem bs=$((1 << 10)) seek=$((0x490000000 >> 10)) count=1
       dd: error writing '/dev/mem': Bad address
       1+0 records in
       0+0 records out
       0 bytes copied, 0.0283507 s, 0.0 kB/s
      
      the command fails as expected.  However, the error code is wrong.  It
      should be "Operation not permitted" instead of "Bad address".  More
      seriously, the /dev/mem permission checking in devmem_is_allowed() passes
      incorrectly.  Although the accessing is prevented later because ioremap()
      isn't allowed to map system RAM, it is a potential security issue.  During
      command executing, the following warning is reported in the kernel log for
      calling ioremap() on system RAM.
      
       ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff
       WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d
       Call Trace:
        memremap+0xcb/0x184
        xlate_dev_mem_ptr+0x25/0x2f
        write_mem+0x94/0xfb
        vfs_write+0x128/0x26d
        ksys_write+0xac/0xfe
        do_syscall_64+0x9a/0xfd
        entry_SYSCALL_64_after_hwframe+0x4b/0x53
      
      The details of command execution process are as follows.  In the above
      resource tree, "System RAM" is a descendant of "CXL Window 0" instead of a
      top level resource.  So, region_intersects() will report no System RAM
      resources in the CXL memory region incorrectly, because it only checks the
      top level resources.  Consequently, devmem_is_allowed() will return 1
      (allow access via /dev/mem) for CXL memory region incorrectly. 
      Fortunately, ioremap() doesn't allow to map System RAM and reject the
      access.
      
      So, region_intersects() needs to be fixed to work correctly with the
      resource tree with "System RAM" not at top level as above.  To fix it, if
      we found a unmatched resource in the top level, we will continue to search
      matched resources in its descendant resources.  So, we will not miss any
      matched resources in resource tree anymore.
      
      In the new implementation, an example resource tree
      
      |------------- "CXL Window 0" ------------|
      |-- "System RAM" --|
      
      will behave similar as the following fake resource tree for
      region_intersects(, IORESOURCE_SYSTEM_RAM, ),
      
      |-- "System RAM" --||-- "CXL Window 0a" --|
      
      Where "CXL Window 0a" is part of the original "CXL Window 0" that
      isn't covered by "System RAM".
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-2-ying.huang@intel.com
      
      
      Fixes: c221c0b0 ("device-dax: "Hotplug" persistent memory for use like normal RAM")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b4afe418
  9. Sep 13, 2024
  10. Sep 12, 2024
Loading