Skip to content
Snippets Groups Projects
  1. Jan 24, 2024
    • Kees Cook's avatar
      exec: Distinguish in_execve from in_exec · 90383cc0
      Kees Cook authored
      
      Just to help distinguish the fs->in_exec flag from the current->in_execve
      flag, add comments in check_unsafe_exec() and copy_fs() for more
      context. Also note that in_execve is only used by TOMOYO now.
      
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      90383cc0
  2. Jan 05, 2024
  3. Dec 27, 2023
  4. Dec 21, 2023
  5. Dec 20, 2023
  6. Dec 12, 2023
  7. Dec 11, 2023
    • Heiko Carstens's avatar
      arch: remove ARCH_TASK_STRUCT_ALLOCATOR · 3888750e
      Heiko Carstens authored
      IA-64 was the only architecture which selected ARCH_TASK_STRUCT_ALLOCATOR.
      IA-64 was removed with commit cf8e8658 ("arch: Remove Itanium (IA-64)
      architecture"). Therefore remove support for ARCH_THREAD_STACK_ALLOCATOR
      as well.
      
      Link: https://lkml.kernel.org/r/20231116133638.1636277-3-hca@linux.ibm.com
      
      
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3888750e
    • Heiko Carstens's avatar
      arch: remove ARCH_THREAD_STACK_ALLOCATOR · f72709ab
      Heiko Carstens authored
      Patch series "Remove unused code after IA-64 removal".
      
      While looking into something different I noticed that there are a couple
      of Kconfig options which were only selected by IA-64 and which are now
      unused.
      
      So remove them and simplify the code a bit.
      
      
      This patch (of 3):
      
      IA-64 was the only architecture which selected ARCH_THREAD_STACK_ALLOCATOR.
      IA-64 was removed with commit cf8e8658 ("arch: Remove Itanium (IA-64)
      architecture"). Therefore remove support for ARCH_THREAD_STACK_ALLOCATOR as
      well.
      
      Link: https://lkml.kernel.org/r/20231116133638.1636277-1-hca@linux.ibm.com
      Link: https://lkml.kernel.org/r/20231116133638.1636277-2-hca@linux.ibm.com
      
      
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f72709ab
    • Peng Zhang's avatar
      fork: use __mt_dup() to duplicate maple tree in dup_mmap() · d2406291
      Peng Zhang authored
      In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
      directly replacing the entries of VMAs in the new maple tree can result in
      better performance.  __mt_dup() uses DFS pre-order to duplicate the maple
      tree, so it is efficient.
      
      The average time complexity of __mt_dup() is O(n), where n is the number
      of VMAs.  The proof of the time complexity is provided in the commit log
      that introduces __mt_dup().  After duplicating the maple tree, each
      element is traversed and replaced (ignoring the cases of deletion, which
      are rare).  Since it is only a replacement operation for each element,
      this process is also O(n).
      
      Analyzing the exact time complexity of the previous algorithm is
      challenging because each insertion can involve appending to a node,
      pushing data to adjacent nodes, or even splitting nodes.  The frequency of
      each action is difficult to calculate.  The worst-case scenario for a
      single insertion is when the tree undergoes splitting at every level.  If
      we consider each insertion as the worst-case scenario, we can determine
      that the upper bound of the time complexity is O(n*log(n)), although this
      is a loose upper bound.  However, based on the test data, it appears that
      the actual time complexity is likely to be O(n).
      
      As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
      fails, there will be a portion of VMAs that have not been duplicated in
      the maple tree.  To handle this, we mark the failure point with
      XA_ZERO_ENTRY.  In exit_mmap(), if this marker is encountered, stop
      releasing VMAs that have not been duplicated after this point.
      
      There is a "spawn" in byte-unixbench[1], which can be used to test the
      performance of fork().  I modified it slightly to make it work with
      different number of VMAs.
      
      Below are the test results.  The first row shows the number of VMAs.  The
      second and third rows show the number of fork() calls per ten seconds,
      corresponding to next-20231006 and the this patchset, respectively.  The
      test results were obtained with CPU binding to avoid scheduler load
      balancing that could cause unstable results.  There are still some
      fluctuations in the test results, but at least they are better than the
      original performance.
      
      21     121   221    421    821    1621   3221   6421   12821  25621  51221
      112100 76261 54227  34035  20195  11112  6017   3161   1606   802    393
      114558 83067 65008  45824  28751  16072  8922   4747   2436   1233   599
      2.19%  8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
      
      [1] https://github.com/kdlucas/byte-unixbench/tree/master
      
      Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
      
      
      Signed-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Suggested-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Mike Christie <michael.christie@oracle.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2406291
  8. Oct 19, 2023
    • Christian Brauner's avatar
      file: convert to SLAB_TYPESAFE_BY_RCU · 0ede61d8
      Christian Brauner authored
      
      In recent discussions around some performance improvements in the file
      handling area we discussed switching the file cache to rely on
      SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based
      freeing for files completely. This is a pretty sensitive change overall
      but it might actually be worth doing.
      
      The main downside is the subtlety. The other one is that we should
      really wait for Jann's patch to land that enables KASAN to handle
      SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this
      exists.
      
      With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times
      which requires a few changes. So it isn't sufficient anymore to just
      acquire a reference to the file in question under rcu using
      atomic_long_inc_not_zero() since the file might have already been
      recycled and someone else might have bumped the reference.
      
      In other words, callers might see reference count bumps from newer
      users. For this reason it is necessary to verify that the pointer is the
      same before and after the reference count increment. This pattern can be
      seen in get_file_rcu() and __files_get_rcu().
      
      In addition, it isn't possible to access or check fields in struct file
      without first aqcuiring a reference on it. Not doing that was always
      very dodgy and it was only usable for non-pointer data in struct file.
      With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a
      reference under rcu or they must hold the files_lock of the fdtable.
      Failing to do either one of this is a bug.
      
      Thanks to Jann for pointing out that we need to ensure memory ordering
      between reallocations and pointer check by ensuring that all subsequent
      loads have a dependency on the second load in get_file_rcu() and
      providing a fixup that was folded into this patch.
      
      Cc: Jann Horn <jannh@google.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      0ede61d8
  9. Oct 18, 2023
    • Lorenzo Stoakes's avatar
      mm: drop the assumption that VM_SHARED always implies writable · e8e17ee9
      Lorenzo Stoakes authored
      Patch series "permit write-sealed memfd read-only shared mappings", v4.
      
      The man page for fcntl() describing memfd file seals states the following
      about F_SEAL_WRITE:-
      
          Furthermore, trying to create new shared, writable memory-mappings via
          mmap(2) will also fail with EPERM.
      
      With emphasis on 'writable'.  In turns out in fact that currently the
      kernel simply disallows all new shared memory mappings for a memfd with
      F_SEAL_WRITE applied, rendering this documentation inaccurate.
      
      This matters because users are therefore unable to obtain a shared mapping
      to a memfd after write sealing altogether, which limits their usefulness. 
      This was reported in the discussion thread [1] originating from a bug
      report [2].
      
      This is a product of both using the struct address_space->i_mmap_writable
      atomic counter to determine whether writing may be permitted, and the
      kernel adjusting this counter when any VM_SHARED mapping is performed and
      more generally implicitly assuming VM_SHARED implies writable.
      
      It seems sensible that we should only update this mapping if VM_MAYWRITE
      is specified, i.e.  whether it is possible that this mapping could at any
      point be written to.
      
      If we do so then all we need to do to permit write seals to function as
      documented is to clear VM_MAYWRITE when mapping read-only.  It turns out
      this functionality already exists for F_SEAL_FUTURE_WRITE - we can
      therefore simply adapt this logic to do the same for F_SEAL_WRITE.
      
      We then hit a chicken and egg situation in mmap_region() where the check
      for VM_MAYWRITE occurs before we are able to clear this flag.  To work
      around this, perform this check after we invoke call_mmap(), with careful
      consideration of error paths.
      
      Thanks to Andy Lutomirski for the suggestion!
      
      [1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/
      [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238
      
      
      This patch (of 3):
      
      There is a general assumption that VMAs with the VM_SHARED flag set are
      writable.  If the VM_MAYWRITE flag is not set, then this is simply not the
      case.
      
      Update those checks which affect the struct address_space->i_mmap_writable
      field to explicitly test for this by introducing
      [vma_]is_shared_maywrite() helper functions.
      
      This remains entirely conservative, as the lack of VM_MAYWRITE guarantees
      that the VMA cannot be written to.
      
      Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com
      Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.com
      
      
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e8e17ee9
  10. Oct 10, 2023
  11. Oct 06, 2023
  12. Oct 04, 2023
  13. Sep 11, 2023
    • Ard Biesheuvel's avatar
      arch: Remove Itanium (IA-64) architecture · cf8e8658
      Ard Biesheuvel authored
      The Itanium architecture is obsolete, and an informal survey [0] reveals
      that any residual use of Itanium hardware in production is mostly HP-UX
      or OpenVMS based. The use of Linux on Itanium appears to be limited to
      enthusiasts that occasionally boot a fresh Linux kernel to see whether
      things are still working as intended, and perhaps to churn out some
      distro packages that are rarely used in practice.
      
      None of the original companies behind Itanium still produce or support
      any hardware or software for the architecture, and it is listed as
      'Orphaned' in the MAINTAINERS file, as apparently, none of the engineers
      that contributed on behalf of those companies (nor anyone else, for that
      matter) have been willing to support or maintain the architecture
      upstream or even be responsible for applying the odd fix. The Intel
      firmware team removed all IA-64 support from the Tianocore/EDK2
      reference implementation of EFI in 2018. (Itanium is the original
      architecture for which EFI was developed, and the way Linux supports it
      deviates significantly from other architectures.) Some distros, such as
      Debian and Gentoo, still maintain [unofficial] ia64 ports, but many have
      dropped support years ago.
      
      While the argument is being made [1] that there is a 'for the common
      good' angle to being able to build and run existing projects such as the
      Grid Community Toolkit [2] on Itanium for interoperability testing, the
      fact remains that none of those projects are known to be deployed on
      Linux/ia64, and very few people actually have access to such a system in
      the first place. Even if there were ways imaginable in which Linux/ia64
      could be put to good use today, what matters is whether anyone is
      actually doing that, and this does not appear to be the case.
      
      There are no emulators widely available, and so boot testing Itanium is
      generally infeasible for ordinary contributors. GCC still supports IA-64
      but its compile farm [3] no longer has any IA-64 machines. GLIBC would
      like to get rid of IA-64 [4] too because it would permit some overdue
      code cleanups. In summary, the benefits to the ecosystem of having IA-64
      be part of it are mostly theoretical, whereas the maintenance overhead
      of keeping it supported is real.
      
      So let's rip off the band aid, and remove the IA-64 arch code entirely.
      This follows the timeline proposed by the Debian/ia64 maintainer [5],
      which removes support in a controlled manner, leaving IA-64 in a known
      good state in the most recent LTS release. Other projects will follow
      once the kernel support is removed.
      
      [0] https://lore.kernel.org/all/CAMj1kXFCMh_578jniKpUtx_j8ByHnt=s7S+yQ+vGbKt9ud7+kQ@mail.gmail.com/
      [1] https://lore.kernel.org/all/0075883c-7c51-00f5-2c2d-5119c1820410@web.de/
      [2] https://gridcf.org/gct-docs/latest/index.html
      [3] https://cfarm.tetaneutral.net/machines/list/
      [4] https://lore.kernel.org/all/87bkiilpc4.fsf@mid.deneb.enyo.de/
      [5] https://lore.kernel.org/all/ff58a3e76e5102c94bb5946d99187b358def688a.camel@physik.fu-berlin.de/
      
      
      
      Acked-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      cf8e8658
  14. Aug 25, 2023
  15. Aug 21, 2023
    • Mateusz Guzik's avatar
      kernel/fork: stop playing lockless games for exe_file replacement · a7031f14
      Mateusz Guzik authored
      xchg originated in 6e399cd1 ("prctl: avoid using mmap_sem for exe_file
      serialization").  While the commit message does not explain *why* the
      change, I found the original submission [1] which ultimately claims it
      cleans things up by removing dependency of exe_file on the semaphore.
      
      However, fe69d560 ("kernel/fork: always deny write access to current
      MM exe_file") added a semaphore up/down cycle to synchronize the state of
      exe_file against fork, defeating the point of the original change.
      
      This is on top of semaphore trips already present both in the replacing
      function and prctl (the only consumer).
      
      Normally replacing exe_file does not happen for busy processes, thus
      write-locking is not an impediment to performance in the intended use
      case.  If someone keeps invoking the routine for a busy processes they are
      trying to play dirty and that's another reason to avoid any trickery.
      
      As such I think the atomic here only adds complexity for no benefit.
      
      Just write-lock around the replacement.
      
      I also note that replacement races against the mapping check loop as
      nothing synchronizes actual assignment with with said checks but I am not
      addressing it in this patch.  (Is the loop of any use to begin with?)
      
      Link: https://lore.kernel.org/linux-mm/1424979417.10344.14.camel@stgolabs.net/ [1]
      Link: https://lkml.kernel.org/r/20230814172140.1777161-1-mjguzik@gmail.com
      
      
      Signed-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7031f14
  16. Jul 13, 2023
    • Wander Lairson Costa's avatar
      kernel/fork: beware of __put_task_struct() calling context · d243b344
      Wander Lairson Costa authored
      
      Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
      locks. Therefore, it can't be called from an non-preemptible context.
      
      One practical example is splat inside inactive_task_timer(), which is
      called in a interrupt context:
      
        CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
         Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
         Call Trace:
         dump_stack_lvl+0x57/0x7d
         mark_lock_irq.cold+0x33/0xba
         mark_lock+0x1e7/0x400
         mark_usage+0x11d/0x140
         __lock_acquire+0x30d/0x930
         lock_acquire.part.0+0x9c/0x210
         rt_spin_lock+0x27/0xe0
         refill_obj_stock+0x3d/0x3a0
         kmem_cache_free+0x357/0x560
         inactive_task_timer+0x1ad/0x340
         __run_hrtimer+0x8a/0x1a0
         __hrtimer_run_queues+0x91/0x130
         hrtimer_interrupt+0x10f/0x220
         __sysvec_apic_timer_interrupt+0x7b/0xd0
         sysvec_apic_timer_interrupt+0x4f/0xd0
         asm_sysvec_apic_timer_interrupt+0x12/0x20
         RIP: 0033:0x7fff196bf6f5
      
      Instead of calling __put_task_struct() directly, we defer it using
      call_rcu(). A more natural approach would use a workqueue, but since
      in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
      the code would become more complex because we would need to put the
      work_struct instance in the task_struct and initialize it when we
      allocate a new task_struct.
      
      The issue is reproducible with stress-ng:
      
        while true; do
            stress-ng --sched deadline --sched-period 1000000000 \
      	      --sched-runtime 800000000 --sched-deadline \
      	      1000000000 --mmapfork 23 -t 20
        done
      
      Reported-by: default avatarHu Chunyu <chuhu@redhat.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Suggested-by: default avatarValentin Schneider <vschneid@redhat.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWander Lairson Costa <wander@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20230614122323.37957-2-wander@redhat.com
      d243b344
  17. Jul 08, 2023
  18. Jun 10, 2023
    • Haifeng Xu's avatar
      fork: optimize memcg_charge_kernel_stack() a bit · 4e2f6342
      Haifeng Xu authored
      Since commit f1c1a9ee ("fork: Move memcg_charge_kernel_stack()
      into CONFIG_VMAP_STACK"), memcg_charge_kernel_stack() has been moved
      into CONFIG_VMAP_STACK block, so the CONFIG_VMAP_STACK check can be
      removed.
      
      Furthermore, memcg_charge_kernel_stack() is only invoked by
      alloc_thread_stack_node() instead of dup_task_struct(). If
      memcg_kmem_charge_page() fails, the uncharge process is handled in
      memcg_charge_kernel_stack() itself instead of free_thread_stack(),
      so remove the incorrect comments.
      
      If memcg_charge_kernel_stack() fails to charge pages used by kernel
      stack, only charged pages need to be uncharged. It's unnecessary to
      uncharge those pages which memory cgroup pointer is NULL.
      
      [akpm@linux-foundation.org: remove assertion that PAGE_SIZE is a multiple of 1k]
      Link: https://lkml.kernel.org/r/20230508064458.32855-1-haifeng.xu@shopee.com
      
      
      Signed-off-by: default avatarHaifeng Xu <haifeng.xu@shopee.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4e2f6342
  19. Jun 02, 2023
  20. Jun 01, 2023
    • Mike Christie's avatar
      fork, vhost: Use CLONE_THREAD to fix freezer/ps regression · f9010dbd
      Mike Christie authored
      
      When switching from kthreads to vhost_tasks two bugs were added:
      1. The vhost worker tasks's now show up as processes so scripts doing
      ps or ps a would not incorrectly detect the vhost task as another
      process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
      vhost tasks's didn't disable or add support for them.
      
      To fix both bugs, this switches the vhost task to be thread in the
      process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
      get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
      SIGKILL/STOP support is required because CLONE_THREAD requires
      CLONE_SIGHAND which requires those 2 signals to be supported.
      
      This is a modified version of the patch written by Mike Christie
      <michael.christie@oracle.com> which was a modified version of patch
      originally written by Linus.
      
      Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
      Including ignoring signals, setting up the register state, and having
      get_signal return instead of calling do_group_exit.
      
      Tidied up the vhost_task abstraction so that the definition of
      vhost_task only needs to be visible inside of vhost_task.c.  Making
      it easier to review the code and tell what needs to be done where.
      As part of this the main loop has been moved from vhost_worker into
      vhost_task_fn.  vhost_worker now returns true if work was done.
      
      The main loop has been updated to call get_signal which handles
      SIGSTOP, freezing, and collects the message that tells the thread to
      exit as part of process exit.  This collection clears
      __fatal_signal_pending.  This collection is not guaranteed to
      clear signal_pending() so clear that explicitly so the schedule()
      sleeps.
      
      For now the vhost thread continues to exist and run work until the
      last file descriptor is closed and the release function is called as
      part of freeing struct file.  To avoid hangs in the coredump
      rendezvous and when killing threads in a multi-threaded exec.  The
      coredump code and de_thread have been modified to ignore vhost threads.
      
      Remvoing the special case for exec appears to require teaching
      vhost_dev_flush how to directly complete transactions in case
      the vhost thread is no longer running.
      
      Removing the special case for coredump rendezvous requires either the
      above fix needed for exec or moving the coredump rendezvous into
      get_signal.
      
      Fixes: 6e890c5d ("vhost: use vhost_tasks for worker threads")
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Co-developed-by: default avatarMike Christie <michael.christie@oracle.com>
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9010dbd
  21. Apr 21, 2023
    • Mathieu Desnoyers's avatar
      sched: Fix performance regression introduced by mm_cid · 223baf9d
      Mathieu Desnoyers authored
      
      Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
      sysbench regression reported by Aaron Lu.
      
      Keep track of the currently allocated mm_cid for each mm/cpu rather than
      freeing them immediately on context switch. This eliminates most atomic
      operations when context switching back and forth between threads
      belonging to different memory spaces in multi-threaded scenarios (many
      processes, each with many threads). The per-mm/per-cpu mm_cid values are
      serialized by their respective runqueue locks.
      
      Thread migration is handled by introducing invocation to
      sched_mm_cid_migrate_to() (with destination runqueue lock held) in
      activate_task() for migrating tasks. If the destination cpu's mm_cid is
      unset, and if the source runqueue is not actively using its mm_cid, then
      the source cpu's mm_cid is moved to the destination cpu on migration.
      
      Introduce a task-work executed periodically, similarly to NUMA work,
      which delays reclaim of cid values when they are unused for a period of
      time.
      
      Keep track of the allocation time for each per-cpu cid, and let the task
      work clear them when they are observed to be older than
      SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
      mm_cids which are greater or equal to the Hamming weight of the mm
      cidmask to keep concurrency ids compact.
      
      Because we want to ensure the mm_cid converges towards the smaller
      values as migrations happen, the prior optimization that was done when
      context switching between threads belonging to the same mm is removed,
      because it could delay the lazy release of the destination runqueue
      mm_cid after it has been replaced by a migration. Removing this prior
      optimization is not an issue performance-wise because the introduced
      per-mm/per-cpu mm_cid tracking also covers this more specific case.
      
      Fixes: af7f588d ("sched: Introduce per-memory-map concurrency ID")
      Reported-by: default avatarAaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
      223baf9d
  22. Apr 18, 2023
  23. Apr 06, 2023
    • Mel Gorman's avatar
      sched/numa: apply the scan delay to every new vma · ef6a22b7
      Mel Gorman authored
      Pach series "sched/numa: Enhance vma scanning", v3.
      
      The patchset proposes one of the enhancements to numa vma scanning
      suggested by Mel.  This is continuation of [3].
      
      Reposting the rebased patchset to akpm mm-unstable tree (March 1) 
      
      Existing mechanism of scan period involves, scan period derived from
      per-thread stats.  Process Adaptive autoNUMA [1] proposed to gather NUMA
      fault stats at per-process level to capture aplication behaviour better.
      
      During that course of discussion, Mel proposed several ideas to enhance
      current numa balancing.  One of the suggestion was below
      
      Track what threads access a VMA.  The suggestion was to use an unsigned
      long pid_mask and use the lower bits to tag approximately what threads
      access a VMA.  Skip VMAs that did not trap a fault.  This would be
      approximate because of PID collisions but would reduce scanning of areas
      the thread is not interested in.  The above suggestion intends not to
      penalize threads that has no interest in the vma, thus reduce scanning
      overhead.
      
      V3 changes are mostly based on PeterZ comments (details below in changes)
      
      Summary of patchset:
      
      Current patchset implements:
      
      1. Delay the vma scanning logic for newly created VMA's so that
         additional overhead of scanning is not incurred for short lived tasks
         (implementation by Mel)
      
      2. Store the information of tasks accessing VMA in 2 windows.  It is
         regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval. 
         The above time is derived from experimenting (Suggested by PeterZ) to
         balance between frequent clearing vs obsolete access data
      
      3. hash_32 used to encode task index accessing VMA information
      
      4. VMA's acess information is used to skip scanning for the tasks
         which had not accessed VMA
      
      Changes since V2:
      patch1: 
       - Renaming of structure, macro to function,
       - Add explanation to heuristics
       - Adding more details from result (PeterZ)
       Patch2:
       - Usage of test and set bit (PeterZ)
       - Move storing access PID info to numa_migrate_prep()
       - Add a note on fainess among tasks allowed to scan
         (PeterZ)
       Patch3:
       - Maintain two windows of access PID information
        (PeterZ supported implementation and Gave idea to extend
         to N if needed)
       Patch4:
       - Apply hash_32 function to track VMA accessing PIDs (PeterZ)
      
      Changes since RFC V1:
       - Include Mel's vma scan delay patch
       - Change the accessing pid store logic (Thanks Mel)
       - Fencing structure / code to NUMA_BALANCING (David, Mel)
       - Adding clearing access PID logic (Mel)
       - Descriptive change log ( Mike Rapoport)
      
      Things to ponder over:
      ==========================================
      
      - Improvement to clearing accessing PIDs logic (discussed in-detail in
        patch3 itself (Done in this patchset by implementing 2 window history)
      
      - Current scan period is not changed in the patchset, so we do see
        frequent tries to scan.  Relaxing scan period dynamically could improve
        results further.
      
      [1] sched/numa: Process Adaptive autoNUMA 
       Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/
      
      [2] RFC V1 Link: 
        https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/
      
      [3] V2 Link:
        https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/
      
      
      Results:
      Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement 
      is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
      (dbench had huge std deviation to post)
      
      kernbench
      ===========
                            6.2.0-mmunstable-base  6.2.0-mmunstable-patched
      Amean     user-256    22002.51 (   0.00%)    22649.95 *  -2.94%*
      Amean     syst-256    10162.78 (   0.00%)     8214.13 *  19.17%*
      Amean     elsp-256      160.74 (   0.00%)      156.92 *   2.38%*
      
      Duration User       66017.43    67959.84
      Duration System     30503.15    24657.03
      Duration Elapsed      504.61      493.12
      
                            6.2.0-mmunstable-base  6.2.0-mmunstable-patched
      Ops NUMA alloc hit                1738835089.00  1738780310.00
      Ops NUMA alloc local              1738834448.00  1738779711.00
      Ops NUMA base-page range updates      477310.00      392566.00
      Ops NUMA PTE updates                  477310.00      392566.00
      Ops NUMA hint faults                   96817.00       87555.00
      Ops NUMA hint local faults %           10150.00        2192.00
      Ops NUMA hint local percent               10.48           2.50
      Ops NUMA pages migrated                86660.00       85363.00
      Ops AutoNUMA cost                        489.07         442.14
      
      autonumabench
      ===============
                            6.2.0-mmunstable-base  6.2.0-mmunstable-patched
      Amean     syst-NUMA01                  399.50 (   0.00%)       52.05 *  86.97%*
      Amean     syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.22 *  -5.41%*
      Amean     syst-NUMA02                    0.80 (   0.00%)        0.78 *   2.68%*
      Amean     syst-NUMA02_SMT                0.65 (   0.00%)        0.68 *  -3.95%*
      Amean     elsp-NUMA01                  313.26 (   0.00%)      313.11 *   0.05%*
      Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.08 *  -1.76%*
      Amean     elsp-NUMA02                    3.19 (   0.00%)        3.24 *  -1.52%*
      Amean     elsp-NUMA02_SMT                3.72 (   0.00%)        3.61 *   2.92%*
      
      Duration User      396433.47   324835.96
      Duration System      2808.70      376.66
      Duration Elapsed     2258.61     2258.12
      
                            6.2.0-mmunstable-base  6.2.0-mmunstable-patched
      Ops NUMA alloc hit                  59921806.00    49623489.00
      Ops NUMA alloc miss                        0.00           0.00
      Ops NUMA interleave hit                    0.00           0.00
      Ops NUMA alloc local                59920880.00    49622594.00
      Ops NUMA base-page range updates   152259275.00       50075.00
      Ops NUMA PTE updates               152259275.00       50075.00
      Ops NUMA PMD updates                       0.00           0.00
      Ops NUMA hint faults               154660352.00       39014.00
      Ops NUMA hint local faults %       138550501.00       23139.00
      Ops NUMA hint local percent               89.58          59.31
      Ops NUMA pages migrated              8179067.00       14147.00
      Ops AutoNUMA cost                     774522.98         195.69
      
      
      This patch (of 4):
      
      Currently whenever a new task is created we wait for
      sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead. 
      Extend the same logic to new or very short-lived VMAs.
      
      [raghavendra.kt@amd.com: add initialization in vm_area_dup())]
      Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com
      Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarRaghavendra K T <raghavendra.kt@amd.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Disha Talreja <dishaa.talreja@amd.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef6a22b7
    • Suren Baghdasaryan's avatar
      mm: separate vma->lock from vm_area_struct · c7f8f31c
      Suren Baghdasaryan authored
      vma->lock being part of the vm_area_struct causes performance regression
      during page faults because during contention its count and owner fields
      are constantly updated and having other parts of vm_area_struct used
      during page fault handling next to them causes constant cache line
      bouncing.  Fix that by moving the lock outside of the vm_area_struct.
      
      All attempts to keep vma->lock inside vm_area_struct in a separate cache
      line still produce performance regression especially on NUMA machines. 
      Smallest regression was achieved when lock is placed in the fourth cache
      line but that bloats vm_area_struct to 256 bytes.
      
      Considering performance and memory impact, separate lock looks like the
      best option.  It increases memory footprint of each VMA but that can be
      optimized later if the new size causes issues.  Note that after this
      change vma_init() does not allocate or initialize vma->lock anymore.  A
      number of drivers allocate a pseudo VMA on the stack but they never use
      the VMA's lock, therefore it does not need to be allocated.  The future
      drivers which might need the VMA lock should use
      vm_area_alloc()/vm_area_free() to allocate the VMA.
      
      Link: https://lkml.kernel.org/r/20230227173632.3292573-34-surenb@google.com
      
      
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c7f8f31c
    • Suren Baghdasaryan's avatar
      mm/mmap: free vm_area_struct without call_rcu in exit_mmap · 0d2ebf9c
      Suren Baghdasaryan authored
      call_rcu() can take a long time when callback offloading is enabled.  Its
      use in the vm_area_free can cause regressions in the exit path when
      multiple VMAs are being freed.
      
      Because exit_mmap() is called only after the last mm user drops its
      refcount, the page fault handlers can't be racing with it.  Any other
      possible user like oom-reaper or process_mrelease are already synchronized
      using mmap_lock.  Therefore exit_mmap() can free VMAs directly, without
      the use of call_rcu().
      
      Expose __vm_area_free() and use it from exit_mmap() to avoid possible
      call_rcu() floods and performance regressions caused by it.
      
      Link: https://lkml.kernel.org/r/20230227173632.3292573-33-surenb@google.com
      
      
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0d2ebf9c
    • Suren Baghdasaryan's avatar
      kernel/fork: assert no VMA readers during its destruction · f2e13784
      Suren Baghdasaryan authored
      Assert there are no holders of VMA lock for reading when it is about to be
      destroyed.
      
      Link: https://lkml.kernel.org/r/20230227173632.3292573-21-surenb@google.com
      
      
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f2e13784
    • Suren Baghdasaryan's avatar
      mm: add per-VMA lock and helper functions to control it · 5e31275c
      Suren Baghdasaryan authored
      Introduce per-VMA locking.  The lock implementation relies on a per-vma
      and per-mm sequence counters to note exclusive locking:
      
        - read lock - (implemented by vma_start_read) requires the vma
          (vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ.
          If they match then there must be a vma exclusive lock held somewhere.
        - read unlock - (implemented by vma_end_read) is a trivial vma->lock
          unlock.
        - write lock - (vma_start_write) requires the mmap_lock to be held
          exclusively and the current mm counter is assigned to the vma counter.
          This will allow multiple vmas to be locked under a single mmap_lock
          write lock (e.g. during vma merging). The vma counter is modified
          under exclusive vma lock.
        - write unlock - (vma_end_write_all) is a batch release of all vma
          locks held. It doesn't pair with a specific vma_start_write! It is
          done before exclusive mmap_lock is released by incrementing mm
          sequence counter (mm_lock_seq).
        - write downgrade - if the mmap_lock is downgraded to the read lock, all
          vma write locks are released as well (effectivelly same as write
          unlock).
      
      Link: https://lkml.kernel.org/r/20230227173632.3292573-13-surenb@google.com
      
      
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e31275c
    • Michel Lespinasse's avatar
      mm: rcu safe VMA freeing · 20cce633
      Michel Lespinasse authored
      This prepares for page faults handling under VMA lock, looking up VMAs
      under protection of an rcu read lock, instead of the usual mmap read lock.
      
      Link: https://lkml.kernel.org/r/20230227173632.3292573-11-surenb@google.com
      
      
      Signed-off-by: default avatarMichel Lespinasse <michel@lespinasse.org>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20cce633
    • Liam R. Howlett's avatar
      mm: enable maple tree RCU mode by default · 3dd44325
      Liam R. Howlett authored
      Use the maple tree in RCU mode for VMA tracking.
      
      The maple tree tracks the stack and is able to update the pivot
      (lower/upper boundary) in-place to allow the page fault handler to write
      to the tree while holding just the mmap read lock.  This is safe as the
      writes to the stack have a guard VMA which ensures there will always be a
      NULL in the direction of the growth and thus will only update a pivot.
      
      It is possible, but not recommended, to have VMAs that grow up/down
      without guard VMAs.  syzbot has constructed a testcase which sets up a VMA
      to grow and consume the empty space.  Overwriting the entire NULL entry
      causes the tree to be altered in a way that is not safe for concurrent
      readers; the readers may see a node being rewritten or one that does not
      match the maple state they are using.
      
      Enabling RCU mode allows the concurrent readers to see a stable node and
      will return the expected result.
      
      [Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[
      Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/
      Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
      
      
      Fixes: d4af56c5 ("mm: start tracking VMAs with maple tree")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: default avatar <syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3dd44325
  24. Apr 03, 2023
    • Christian Brauner's avatar
      fork: use pidfd_prepare() · ca7707f5
      Christian Brauner authored
      
      Stop open-coding get_unused_fd_flags() and anon_inode_getfile(). That's
      brittle just for keeping the flags between both calls in sync. Use the
      dedicated helper.
      
      Message-Id: <20230327-pidfd-file-api-v1-2-5c0e9a3158e4@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      ca7707f5
    • Christian Brauner's avatar
      pid: add pidfd_prepare() · 6ae930d9
      Christian Brauner authored
      
      Add a new helper that allows to reserve a pidfd and allocates a new
      pidfd file that stashes the provided struct pid. This will allow us to
      remove places that either open code this function or that call
      pidfd_create() but then have to call close_fd() because there are still
      failure points after pidfd_create() has been called.
      
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Message-Id: <20230327-pidfd-file-api-v1-1-5c0e9a3158e4@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      6ae930d9
  25. Mar 31, 2023
  26. Mar 29, 2023
  27. Mar 28, 2023
    • Nicholas Piggin's avatar
      lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme · 2655421a
      Nicholas Piggin authored
      On big systems, the mm refcount can become highly contented when doing a
      lot of context switching with threaded applications.  user<->idle switch
      is one of the important cases.  Abandoning lazy tlb entirely slows this
      switching down quite a bit in the common uncontended case, so that is not
      viable.
      
      Implement a scheme where lazy tlb mm references do not contribute to the
      refcount, instead they get explicitly removed when the refcount reaches
      zero.
      
      The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
      switch away from this mm to init_mm if it was being used as the lazy tlb
      mm.  Enabling the shoot lazies option therefore requires that the arch
      ensures that mm_cpumask contains all CPUs that could possibly be using mm.
      A DEBUG_VM option IPIs every CPU in the system after this to ensure there
      are no references remaining before the mm is freed.
      
      Shootdown IPIs cost could be an issue, but they have not been observed to
      be a serious problem with this scheme, because short-lived processes tend
      not to migrate CPUs much, therefore they don't get much chance to leave
      lazy tlb mm references on remote CPUs.  There are a lot of options to
      reduce them if necessary, described in comments.
      
      The near-worst-case can be benchmarked with will-it-scale:
      
        context_switch1_threads -t $(($(nproc) / 2))
      
      This will create nproc threads (nproc / 2 switching pairs) all sharing the
      same mm that spread over all CPUs so each CPU does thread->idle->thread
      switching.
      
      [ Rik came up with basically the same idea a few years ago, so credit
        to him for that. ]
      
      Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
      Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
      Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com
      
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2655421a
  28. Mar 19, 2023
  29. Mar 12, 2023
Loading