Skip to content
Snippets Groups Projects
  1. Sep 27, 2024
    • Al Viro's avatar
      [tree-wide] finally take no_llseek out · cb787f4a
      Al Viro authored
      
      no_llseek had been defined to NULL two years ago, in commit 868941b1
      ("fs: remove no_llseek")
      
      To quote that commit,
      
        At -rc1 we'll need do a mechanical removal of no_llseek -
      
        git grep -l -w no_llseek | grep -v porting.rst | while read i; do
      	sed -i '/\<no_llseek\>/d' $i
        done
      
        would do it.
      
      Unfortunately, that hadn't been done.  Linus, could you do that now, so
      that we could finally put that thing to rest? All instances are of the
      form
      	.llseek = no_llseek,
      so it's obviously safe.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb787f4a
  2. Sep 26, 2024
  3. Sep 25, 2024
  4. Sep 23, 2024
    • Andrea Righi's avatar
      sched_ext: Provide a sysfs enable_seq counter · 431844b6
      Andrea Righi authored
      
      As discussed during the distro-centric session within the sched_ext
      Microconference at LPC 2024, introduce a sequence counter that is
      incremented every time a BPF scheduler is loaded.
      
      This feature can help distributions in diagnosing potential performance
      regressions by identifying systems where users are running (or have ran)
      custom BPF schedulers.
      
      Example:
      
       arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
       0
       arighi@virtme-ng~> sudo scx_simple
       local=1 global=0
       ^CEXIT: unregistered from user space
       arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
       1
      
      In this way user-space tools (such as Ubuntu's apport and similar) are
      able to gather and include this information in bug reports.
      
      Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com>
      Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
      Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
      Cc: Phil Auld <pauld@redhat.com>
      Signed-off-by: default avatarAndrea Righi <andrea.righi@linux.dev>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      431844b6
    • Tejun Heo's avatar
      sched_ext: Fix build when !CONFIG_STACKTRACE · 62d3726d
      Tejun Heo authored
      
      a2f4b16e ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried
      fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put
      stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix
      build when !CONFIG_STACKTRACE.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/
      62d3726d
    • Pat Somaru's avatar
      sched, sched_ext: Disable SM_IDLE/rq empty path when scx_enabled() · edf1c586
      Pat Somaru authored
      
      Disable the rq empty path when scx is enabled. SCX must consult the BPF
      scheduler (via the dispatch path in balance) to determine if rq is empty.
      
      This fixes stalls when scx is enabled.
      
      Signed-off-by: default avatarPat Somaru <patso@likewhatevs.io>
      Fixes: 3dcac251 ("sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      edf1c586
    • Yu Liao's avatar
      sched: Put task_group::idle under CONFIG_GROUP_SCHED_WEIGHT · 7ebd84d6
      Yu Liao authored
      
      When build with CONFIG_GROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED,
      the idle member is not defined:
      
      kernel/sched/ext.c:3701:16: error: 'struct task_group' has no member named 'idle'
        3701 |         if (!tg->idle)
             |                ^~
      
      Fix this by putting 'idle' under new CONFIG_GROUP_SCHED_WEIGHT.
      
      tj: Move idle field upward to avoid breaking up CONFIG_FAIR_GROUP_SCHED block.
      
      Fixes: e179e80c ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
      
      
      Signed-off-by: default avatarYu Liao <liaoyu15@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      7ebd84d6
    • Yu Liao's avatar
      sched: Add dummy version of sched_group_set_idle() · bdeb868c
      Yu Liao authored
      
      Fix the following error when build with CONFIG_GROUP_SCHED_WEIGHT &&
      !CONFIG_FAIR_GROUP_SCHED:
      
      kernel/sched/core.c:9634:15: error: implicit declaration of function
      'sched_group_set_idle'; did you mean 'scx_group_set_idle'? [-Wimplicit-function-declaration]
        9634 |         ret = sched_group_set_idle(css_tg(css), idle);
             |               ^~~~~~~~~~~~~~~~~~~~
             |               scx_group_set_idle
      
      Fixes: e179e80c ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
      
      
      Signed-off-by: default avatarYu Liao <liaoyu15@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bdeb868c
    • Leon Romanovsky's avatar
      dma-mapping: report unlimited DMA addressing in IOMMU DMA path · b348b6d1
      Leon Romanovsky authored
      While using the IOMMU DMA path, the dma_addressing_limited() function
      checks ops struct which doesn't exist in the IOMMU case. This causes
      to the kernel panic while loading ADMGPU driver.
      
      BUG: kernel NULL pointer dereference, address: 00000000000000a0
      PGD 0 P4D 0
      Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 10 UID: 0 PID: 611 Comm: (udev-worker) Tainted: G                T  6.11.0-clang-07154-g726e2d0cf2bb #257
      Tainted: [T]=RANDSTRUCT
      Hardware name: ASUS System Product Name/ROG STRIX Z690-G GAMING WIFI, BIOS 3701 07/03/2024
      RIP: 0010:dma_addressing_limited+0x53/0xa0
      Code: 8b 93 48 02 00 00 48 39 d1 49 89 d6 4c 0f 42 f1 48 85 d2 4c 0f 44 f1 f6 83 fc 02 00 00 40 75 0a 48 89 df e8 1f 09 00 00 eb 24 <4c> 8b 1c 25 a0 00 00 00 4d 85 db 74 17 48 89 df 41 ba 8b 84 2d 55
      RSP: 0018:ffffa8d2c12cf740 EFLAGS: 00010202
      RAX: 00000000ffffffff RBX: ffff8948820220c8 RCX: 000000ffffffffff
      RDX: 0000000000000000 RSI: ffffffffc124dc6d RDI: ffff8948820220c8
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff894883c3f040
      R13: ffff89488dac8828 R14: 000000ffffffffff R15: ffff8948820220c8
      FS:  00007fe6ba881900(0000) GS:ffff894fdf700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000000a0 CR3: 0000000111984000 CR4: 0000000000f50ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __die_body+0x65/0xc0
       ? page_fault_oops+0x3b9/0x450
       ? _prb_read_valid+0x212/0x390
       ? do_user_addr_fault+0x608/0x680
       ? exc_page_fault+0x4e/0xa0
       ? asm_exc_page_fault+0x26/0x30
       ? dma_addressing_limited+0x53/0xa0
       amdgpu_ttm_init+0x56/0x4b0 [amdgpu]
       gmc_v8_0_sw_init+0x561/0x670 [amdgpu]
       amdgpu_device_ip_init+0xf5/0x570 [amdgpu]
       amdgpu_device_init+0x1a57/0x1ea0 [amdgpu]
       ? _raw_spin_unlock_irqrestore+0x1a/0x40
       ? pci_conf1_read+0xc0/0xe0
       ? pci_bus_read_config_word+0x52/0xa0
       amdgpu_driver_load_kms+0x15/0xa0 [amdgpu]
       amdgpu_pci_probe+0x1b7/0x4c0 [amdgpu]
       pci_device_probe+0x1c5/0x260
       really_probe+0x130/0x470
       __driver_probe_device+0x77/0x150
       driver_probe_device+0x19/0x120
       __driver_attach+0xb1/0x1e0
       ? __cfi___driver_attach+0x10/0x10
       bus_for_each_dev+0x115/0x170
       bus_add_driver+0x192/0x2d0
       driver_register+0x5c/0xf0
       ? __cfi_init_module+0x10/0x10 [amdgpu]
       do_one_initcall+0x128/0x380
       ? idr_alloc_cyclic+0x139/0x1d0
       ? security_kernfs_init_security+0x42/0x140
       ? __kernfs_new_node+0x1be/0x250
       ? sysvec_apic_timer_interrupt+0xb6/0xc0
       ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
       ? _raw_spin_unlock+0x11/0x30
       ? free_unref_page+0x283/0x650
       ? kfree+0x274/0x3a0
       ? kfree+0x274/0x3a0
       ? kfree+0x274/0x3a0
       ? load_module+0xf2e/0x1130
       ? __kmalloc_cache_noprof+0x12a/0x2e0
       do_init_module+0x7d/0x240
       __se_sys_init_module+0x19e/0x220
       do_syscall_64+0x8a/0x150
       ? __irq_exit_rcu+0x5e/0x100
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
      RIP: 0033:0x7fe6bb5980ee
      Code: 48 8b 0d 3d ed 12 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0a ed 12 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffd462219d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000af
      RAX: ffffffffffffffda RBX: 0000556caf0d0670 RCX: 00007fe6bb5980ee
      RDX: 0000556caf0d3080 RSI: 0000000002893458 RDI: 00007fe6b3400010
      RBP: 0000000000020000 R08: 0000000000020010 R09: 0000000000000080
      R10: c26073c166186e00 R11: 0000000000000206 R12: 0000556caf0d3430
      R13: 0000556caf0d0670 R14: 0000556caf0d3080 R15: 0000556caf0ce700
       </TASK>
      Modules linked in: amdgpu(+) i915(+) drm_suballoc_helper intel_gtt drm_exec drm_buddy iTCO_wdt i2c_algo_bit intel_pmc_bxt drm_display_helper iTCO_vendor_support gpu_sched drm_ttm_helper cec ttm amdxcp video backlight pinctrl_alderlake nct6775 hwmon_vid nct6775_core coretemp
      CR2: 00000000000000a0
      ---[ end trace 0000000000000000 ]---
      RIP: 0010:dma_addressing_limited+0x53/0xa0
      Code: 8b 93 48 02 00 00 48 39 d1 49 89 d6 4c 0f 42 f1 48 85 d2 4c 0f 44 f1 f6 83 fc 02 00 00 40 75 0a 48 89 df e8 1f 09 00 00 eb 24 <4c> 8b 1c 25 a0 00 00 00 4d 85 db 74 17 48 89 df 41 ba 8b 84 2d 55
      RSP: 0018:ffffa8d2c12cf740 EFLAGS: 00010202
      RAX: 00000000ffffffff RBX: ffff8948820220c8 RCX: 000000ffffffffff
      RDX: 0000000000000000 RSI: ffffffffc124dc6d RDI: ffff8948820220c8
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff894883c3f040
      R13: ffff89488dac8828 R14: 000000ffffffffff R15: ffff8948820220c8
      FS:  00007fe6ba881900(0000) GS:ffff894fdf700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000000a0 CR3: 0000000111984000 CR4: 0000000000f50ef0
      PKRU: 55555554
      
      Fixes: b5c58b2f ("dma-mapping: direct calls for dma-iommu")
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219292
      
      
      Reported-by: default avatarNiklāvs Koļesņikovs <pinkflames.linux@gmail.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarNiklāvs Koļesņikovs <pinkflames.linux@gmail.com>
      b348b6d1
  5. Sep 22, 2024
  6. Sep 20, 2024
    • Jinjie Ruan's avatar
      crash: Fix riscv64 crash memory reserve dead loop · b3f835cd
      Jinjie Ruan authored
      
      On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high"
      will cause system stall as below:
      
      	 Zone ranges:
      	   DMA32    [mem 0x0000000080000000-0x000000009fffffff]
      	   Normal   empty
      	 Movable zone start for each node
      	 Early memory node ranges
      	   node   0: [mem 0x0000000080000000-0x000000008005ffff]
      	   node   0: [mem 0x0000000080060000-0x000000009fffffff]
      	 Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff]
      	(stall here)
      
      commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop
      bug") fix this on 32-bit architecture. However, the problem is not
      completely solved. If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on 64-bit
      architecture, for example, when system memory is equal to
      CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also occur:
      
      	-> reserve_crashkernel_generic() and high is true
      	   -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail
      	      -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly
      	         (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX).
      
      As Catalin suggested, do not remove the ",high" reservation fallback to
      ",low" logic which will change arm64's kdump behavior, but fix it by
      skipping the above situation similar to commit d2f32f23190b ("crash: fix
      x86_32 crash memory reserve dead loop").
      
      After this patch, it print:
      	cannot allocate crashkernel (size:0x1f400000)
      
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Link: https://lore.kernel.org/r/20240812062017.2674441-1-ruanjinjie@huawei.com
      
      
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      b3f835cd
  7. Sep 17, 2024
    • Oleg Nesterov's avatar
      uprobes: turn xol_area->pages[2] into xol_area->page · 2abbcc09
      Oleg Nesterov authored
      Now that xol_mapping has its own ->fault() method we no longer need
      xol_area->pages[1] == NULL, we need a single page.
      
      Link: https://lkml.kernel.org/r/20240911131437.GC3448@redhat.com
      
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2abbcc09
    • Oleg Nesterov's avatar
      uprobes: introduce the global struct vm_special_mapping xol_mapping · 6d27a31e
      Oleg Nesterov authored
      Currently each xol_area has its own instance of vm_special_mapping, this
      is suboptimal and ugly.  Kill xol_area->xol_mapping and add a single
      global instance of vm_special_mapping, the ->fault() method can use
      area->pages rather than xol_mapping->pages.
      
      As a side effect this fixes the problem introduced by the recent commit
      223febc6 ("mm: add optional close() to struct vm_special_mapping"), if
      special_mapping_close() is called from the __mmput() paths, it will use
      vma->vm_private_data = &area->xol_mapping freed by uprobe_clear_state().
      
      Link: https://lkml.kernel.org/r/20240911131407.GB3448@redhat.com
      
      
      Fixes: 223febc6 ("mm: add optional close() to struct vm_special_mapping")
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Closes: https://lore.kernel.org/all/yt9dy149vprr.fsf@linux.ibm.com/
      
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d27a31e
    • Oleg Nesterov's avatar
      Revert "uprobes: use vm_special_mapping close() functionality" · ed8d5b0c
      Oleg Nesterov authored
      This reverts commit 08e28de1.
      
      A malicious application can munmap() its "[uprobes]" vma and in this case
      xol_mapping.close == uprobe_clear_state() will free the memory which can
      be used by another thread, or the same thread when it hits the uprobe bp
      afterwards.
      
      Link: https://lkml.kernel.org/r/20240911131320.GA3448@redhat.com
      
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed8d5b0c
    • Huang Ying's avatar
      resource, kunit: add test case for region_intersects() · 99185c10
      Huang Ying authored
      Patch series "resource: Fix region_intersects() vs
      add_memory_driver_managed()", v3.
      
      The patchset fixes a bug of region_intersects() for systems with CXL
      memory.  The details of the bug can be found in [1/3].  To avoid similar
      bugs in the future.  A kunit test case for region_intersects() is added in
      [3/3].  [2/3] is a preparation patch for [3/3].
      
      
      This patch (of 3):
      
      region_intersects() is important because it's used for /dev/mem permission
      checking.  To avoid possible bug of region_intersects() in the future, a
      kunit test case for region_intersects() is added.
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20240906030713.204292-4-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99185c10
    • Huang Ying's avatar
      resource: make alloc_free_mem_region() works for iomem_resource · bacf9c3c
      Huang Ying authored
      During developing a kunit test case for region_intersects(), some fake
      resources need to be inserted into iomem_resource.  To do that, a resource
      hole needs to be found first in iomem_resource.
      
      However, alloc_free_mem_region() cannot work for iomem_resource now. 
      Because the start address to check cannot be 0 to detect address wrapping
      0 in gfr_continue(), while iomem_resource.start == 0.  To make
      alloc_free_mem_region() works for iomem_resource, gfr_start() is changed
      to avoid to return 0 even if base->start == 0.  We don't need to check 0
      as start address.
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-3-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bacf9c3c
    • Huang Ying's avatar
      resource: fix region_intersects() vs add_memory_driver_managed() · b4afe418
      Huang Ying authored
      On a system with CXL memory, the resource tree (/proc/iomem) related to
      CXL memory may look like something as follows.
      
      490000000-50fffffff : CXL Window 0
        490000000-50fffffff : region0
          490000000-50fffffff : dax0.0
            490000000-50fffffff : System RAM (kmem)
      
      Because drivers/dax/kmem.c calls add_memory_driver_managed() during
      onlining CXL memory, which makes "System RAM (kmem)" a descendant of "CXL
      Window X".  This confuses region_intersects(), which expects all "System
      RAM" resources to be at the top level of iomem_resource.  This can lead to
      bugs.
      
      For example, when the following command line is executed to write some
      memory in CXL memory range via /dev/mem,
      
       $ dd if=data of=/dev/mem bs=$((1 << 10)) seek=$((0x490000000 >> 10)) count=1
       dd: error writing '/dev/mem': Bad address
       1+0 records in
       0+0 records out
       0 bytes copied, 0.0283507 s, 0.0 kB/s
      
      the command fails as expected.  However, the error code is wrong.  It
      should be "Operation not permitted" instead of "Bad address".  More
      seriously, the /dev/mem permission checking in devmem_is_allowed() passes
      incorrectly.  Although the accessing is prevented later because ioremap()
      isn't allowed to map system RAM, it is a potential security issue.  During
      command executing, the following warning is reported in the kernel log for
      calling ioremap() on system RAM.
      
       ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff
       WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d
       Call Trace:
        memremap+0xcb/0x184
        xlate_dev_mem_ptr+0x25/0x2f
        write_mem+0x94/0xfb
        vfs_write+0x128/0x26d
        ksys_write+0xac/0xfe
        do_syscall_64+0x9a/0xfd
        entry_SYSCALL_64_after_hwframe+0x4b/0x53
      
      The details of command execution process are as follows.  In the above
      resource tree, "System RAM" is a descendant of "CXL Window 0" instead of a
      top level resource.  So, region_intersects() will report no System RAM
      resources in the CXL memory region incorrectly, because it only checks the
      top level resources.  Consequently, devmem_is_allowed() will return 1
      (allow access via /dev/mem) for CXL memory region incorrectly. 
      Fortunately, ioremap() doesn't allow to map System RAM and reject the
      access.
      
      So, region_intersects() needs to be fixed to work correctly with the
      resource tree with "System RAM" not at top level as above.  To fix it, if
      we found a unmatched resource in the top level, we will continue to search
      matched resources in its descendant resources.  So, we will not miss any
      matched resources in resource tree anymore.
      
      In the new implementation, an example resource tree
      
      |------------- "CXL Window 0" ------------|
      |-- "System RAM" --|
      
      will behave similar as the following fake resource tree for
      region_intersects(, IORESOURCE_SYSTEM_RAM, ),
      
      |-- "System RAM" --||-- "CXL Window 0a" --|
      
      Where "CXL Window 0a" is part of the original "CXL Window 0" that
      isn't covered by "System RAM".
      
      Link: https://lkml.kernel.org/r/20240906030713.204292-2-ying.huang@intel.com
      
      
      Fixes: c221c0b0 ("device-dax: "Hotplug" persistent memory for use like normal RAM")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b4afe418
  8. Sep 13, 2024
  9. Sep 12, 2024
    • Christoph Hellwig's avatar
      dma-mapping: reflow dma_supported · a5fb217f
      Christoph Hellwig authored
      
      dma_supported has become too much spaghetti for my taste.  Reflow it to
      remove the duplicate use_dma_iommu condition and make the main path more
      obvious.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLeon Romanovsky <leon@kernel.org>
      a5fb217f
    • Christian Brauner's avatar
      uidgid: make sure we fit into one cacheline · 2077006d
      Christian Brauner authored
      When I expanded uidgid mappings I intended for a struct uid_gid_map to
      fit into a single cacheline on x86 as they tend to be pretty
      performance sensitive (idmapped mounts etc). But a 4 byte hole was added
      that brought it over 64 bytes. Fix that by adding the static extent
      array and the extent counter into a substruct. C's type punning for
      unions guarantees that we can access ->nr_extents even if the last
      written to member wasn't within the same object. This is also what we
      rely on in struct_group() and friends. This of course relies on
      non-strict aliasing which we don't do.
      
      99) If the member used to read the contents of a union object is not the
          same as the member last used to store a value in the object, the
          appropriate part of the object representation of the value is
          reinterpreted as an object representation in the new type as
          described in 6.2.6 (a process sometimes called "type punning").
      
      Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf
      Link: https://lore.kernel.org/r/20240910-work-uid_gid_map-v1-1-e6bc761363ed@kernel.org
      
      
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      2077006d
    • Leon Romanovsky's avatar
      dma-mapping: reliably inform about DMA support for IOMMU · f45cfab2
      Leon Romanovsky authored
      If the DMA IOMMU path is going to be used, the appropriate check should
      return that DMA is supported.
      
      Fixes: b5c58b2f ("dma-mapping: direct calls for dma-iommu")
      Closes: https://lore.kernel.org/all/181e06ff-35a3-434f-b505-672f430bd1cb@notapiano
      
      
      Reported-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> #KernelCI
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Tested-by: default avatarNícolas F. R. A. Prado <nfraprado@collabora.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      f45cfab2
    • Tejun Heo's avatar
      sched: Move update_other_load_avgs() to kernel/sched/pelt.c · 902d67a2
      Tejun Heo authored
      
      96fd6c65 ("sched: Factor out update_other_load_avgs() from
      __update_blocked_others()") added update_other_load_avgs() in
      kernel/sched/syscalls.c right above effective_cpu_util(). This location
      didn't fit that well in the first place, and with 5d871a63 ("sched/fair:
      Move effective_cpu_util() and effective_cpu_util() in fair.c") moving
      effective_cpu_util() to kernel/sched/fair.c, it looks even more out of
      place.
      
      Relocate the function to kernel/sched/pelt.c where all its callees are.
      
      No functional changes.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      902d67a2
    • Lai Jiangshan's avatar
      workqueue: Clear worker->pool in the worker thread context · 73613840
      Lai Jiangshan authored
      Marc Hartmayer reported:
              [   23.133876] Unable to handle kernel pointer dereference in virtual kernel address space
              [   23.133950] Failing address: 0000000000000000 TEID: 0000000000000483
              [   23.133954] Fault in home space mode while using kernel ASCE.
              [   23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d
              [   23.134207] Oops: 0004 ilc:2 [#1] SMP
      	(snip)
              [   23.134516] Call Trace:
              [   23.134520]  [<0000024e326caf28>] worker_thread+0x48/0x430
              [   23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430)
              [   23.134528]  [<0000024e326d3a3e>] kthread+0x11e/0x130
              [   23.134533]  [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60
              [   23.134536]  [<0000024e333fb37a>] ret_from_fork+0xa/0x38
              [   23.134552] Last Breaking-Event-Address:
              [   23.134553]  [<0000024e333f4c04>] mutex_unlock+0x24/0x30
              [   23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops
      
      With debuging and analysis, worker_thread() accesses to the nullified
      worker->pool when the newly created worker is destroyed before being
      waken-up, in which case worker_thread() can see the result detach_worker()
      reseting worker->pool to NULL at the begining.
      
      Move the code "worker->pool = NULL;" out from detach_worker() to fix the
      problem.
      
      worker->pool had been designed to be constant for regular workers and
      changeable for rescuer. To share attaching/detaching code for regular
      and rescuer workers and to avoid worker->pool being accessed inadvertently
      when the worker has been detached, worker->pool is reset to NULL when
      detached no matter the worker is rescuer or not.
      
      To maintain worker->pool being reset after detached, move the code
      "worker->pool = NULL;" in the worker thread context after detached.
      
      It is either be in the regular worker thread context after PF_WQ_WORKER
      is cleared or in rescuer worker thread context with wq_pool_attach_mutex
      held. So it is safe to do so.
      
      Cc: Marc Hartmayer <mhartmay@linux.ibm.com>
      Link: https://lore.kernel.org/lkml/87wmjj971b.fsf@linux.ibm.com/
      
      
      Reported-by: default avatarMarc Hartmayer <mhartmay@linux.ibm.com>
      Fixes: f4b7b53c ("workqueue: Detach workers directly in idle_cull_fn()")
      Cc: stable@vger.kernel.org # v6.11+
      Signed-off-by: default avatarLai Jiangshan <jiangshan.ljs@antgroup.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      73613840
  10. Sep 11, 2024
    • Yonghong Song's avatar
      bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing · 376bd59e
      Yonghong Song authored
      
      Salvatore Benedetto reported an issue that when doing syscall tracepoint
      tracing the kernel stack is empty. For example, using the following
      command line
        bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
        bpftrace -e 'tracepoint:syscalls:sys_exit_read { print("Kernel Stack\n"); print(kstack()); }'
      the output for both commands is
      ===
        Kernel Stack
      ===
      
      Further analysis shows that pt_regs used for bpf syscall tracepoint
      tracing is from the one constructed during user->kernel transition.
      The call stack looks like
        perf_syscall_enter+0x88/0x7c0
        trace_sys_enter+0x41/0x80
        syscall_trace_enter+0x100/0x160
        do_syscall_64+0x38/0xf0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      The ip address stored in pt_regs is from user space hence no kernel
      stack is printed.
      
      To fix the issue, kernel address from pt_regs is required.
      In kernel repo, there are already a few cases like this. For example,
      in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
      instances are used to supply ip address or use ip address to construct
      call stack.
      
      Instead of allocate fake_regs in the stack which may consume
      a lot of bytes, the function perf_trace_buf_alloc() in
      perf_syscall_{enter, exit}() is leveraged to create fake_regs,
      which will be passed to perf_call_bpf_{enter,exit}().
      
      For the above bpftrace script, I got the following output with this patch:
      for tracepoint:syscalls:sys_enter_read
      ===
        Kernel Stack
      
              syscall_trace_enter+407
              syscall_trace_enter+407
              do_syscall_64+74
              entry_SYSCALL_64_after_hwframe+75
      ===
      and for tracepoint:syscalls:sys_exit_read
      ===
      Kernel Stack
      
              syscall_exit_work+185
              syscall_exit_work+185
              syscall_exit_to_user_mode+305
              do_syscall_64+118
              entry_SYSCALL_64_after_hwframe+75
      ===
      
      Reported-by: default avatarSalvatore Benedetto <salvabenedetto@meta.com>
      Suggested-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240910214037.3663272-1-yonghong.song@linux.dev
      376bd59e
Loading