Skip to content
Snippets Groups Projects
  1. Sep 04, 2024
  2. Jul 20, 2024
  3. Feb 15, 2024
  4. Jan 11, 2024
    • Alexandre Ghiti's avatar
      riscv: Add support for BATCHED_UNMAP_TLB_FLUSH · 54d7431a
      Alexandre Ghiti authored
      
      Allow to defer the flushing of the TLB when unmapping pages, which allows
      to reduce the numbers of IPI and the number of sfence.vma.
      
      The ubenchmarch used in commit 43b3dfdd ("arm64: support
      batched/deferred tlb shootdown during page reclamation/migration") that
      was multithreaded to force the usage of IPI shows good performance
      improvement on all platforms:
      
      * Unmatched: ~34%
      * TH1520   : ~78%
      * Qemu     : ~81%
      
      In addition, perf on qemu reports an important decrease in time spent
      dealing with IPIs:
      
      Before:  68.17%  main     [kernel.kallsyms]            [k] __sbi_rfence_v02_call
      After :   8.64%  main     [kernel.kallsyms]            [k] __sbi_rfence_v02_call
      
      * Benchmark:
      
      int stick_this_thread_to_core(int core_id) {
              int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
              if (core_id < 0 || core_id >= num_cores)
                 return EINVAL;
      
              cpu_set_t cpuset;
              CPU_ZERO(&cpuset);
              CPU_SET(core_id, &cpuset);
      
              pthread_t current_thread = pthread_self();
              return pthread_setaffinity_np(current_thread,
      sizeof(cpu_set_t), &cpuset);
      }
      
      static void *fn_thread (void *p_data)
      {
              int ret;
              pthread_t thread;
      
              stick_this_thread_to_core((int)p_data);
      
              while (1) {
                      sleep(1);
              }
      
              return NULL;
      }
      
      int main()
      {
              volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                               MAP_SHARED | MAP_ANONYMOUS, -1, 0);
              pthread_t threads[4];
              int ret;
      
              for (int i = 0; i < 4; ++i) {
                      ret = pthread_create(&threads[i], NULL, fn_thread, (void *)i);
                      if (ret)
                      {
                              printf("%s", strerror (ret));
                      }
              }
      
              memset(p, 0x88, SIZE);
      
              for (int k = 0; k < 10000; k++) {
                      /* swap in */
                      for (int i = 0; i < SIZE; i += 4096) {
                              (void)p[i];
                      }
      
                      /* swap out */
                      madvise(p, SIZE, MADV_PAGEOUT);
              }
      
              for (int i = 0; i < 4; i++)
              {
                      pthread_cancel(threads[i]);
              }
      
              for (int i = 0; i < 4; i++)
              {
                      pthread_join(threads[i], NULL);
              }
      
              return 0;
      }
      
      Signed-off-by: default avatarAlexandre Ghiti <alexghiti@rivosinc.com>
      Reviewed-by: default avatarJisheng Zhang <jszhang@kernel.org>
      Tested-by: Jisheng Zhang <jszhang@kernel.org> # Tested on TH1520
      Tested-by: default avatarNam Cao <namcao@linutronix.de>
      Link: https://lore.kernel.org/r/20240108193640.344929-1-alexghiti@rivosinc.com
      
      
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      54d7431a
  5. Sep 11, 2023
  6. Sep 06, 2023
    • Qing Zhang's avatar
      LoongArch: Add KASAN (Kernel Address Sanitizer) support · 5aa4ac64
      Qing Zhang authored
      
      1/8 of kernel addresses reserved for shadow memory. But for LoongArch,
      There are a lot of holes between different segments and valid address
      space (256T available) is insufficient to map all these segments to kasan
      shadow memory with the common formula provided by kasan core, saying
      (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET
      
      So LoongArch has a arch-specific mapping formula, different segments are
      mapped individually, and only limited space lengths of these specific
      segments are mapped to shadow.
      
      At early boot stage the whole shadow region populated with just one
      physical page (kasan_early_shadow_page). Later, this page is reused as
      readonly zero shadow for some memory that kasan currently don't track.
      After mapping the physical memory, pages for shadow memory are allocated
      and mapped.
      
      Functions like memset()/memcpy()/memmove() do a lot of memory accesses.
      If bad pointer passed to one of these function it is important to be
      caught. Compiler's instrumentation cannot do this since these functions
      are written in assembly.
      
      KASan replaces memory functions with manually instrumented variants.
      Original functions declared as weak symbols so strong definitions in
      mm/kasan/kasan.c could replace them. Original functions have aliases
      with '__' prefix in names, so we could call non-instrumented variant
      if needed.
      
      Signed-off-by: default avatarQing Zhang <zhangqing@loongson.cn>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      5aa4ac64
    • Feiyang Chen's avatar
      LoongArch: Allow building with kcov coverage · 2363088e
      Feiyang Chen authored
      
      Add ARCH_HAS_KCOV and HAVE_GCC_PLUGINS to the LoongArch Kconfig. And
      also disable instrumentation of vdso.
      
      Signed-off-by: default avatarFeiyang Chen <chenfeiyang@loongson.cn>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      2363088e
    • Qing Zhang's avatar
      LoongArch: Add basic KGDB & KDB support · e14dd076
      Qing Zhang authored
      
      KGDB is intended to be used as a source level debugger for the Linux
      kernel. It is used along with gdb to debug a Linux kernel. GDB can be
      used to "break in" to the kernel to inspect memory, variables and regs
      similar to the way an application developer would use GDB to debug an
      application. KDB is a frontend of KGDB which is similar to GDB.
      
      By now, in addition to the generic KGDB features, the LoongArch KGDB
      implements the following features:
      - Hardware breakpoints/watchpoints;
      - Software single-step support for KDB.
      
      Signed-off-by: Qing Zhang <zhangqing@loongson.cn>   # Framework & CoreFeature
      Signed-off-by: Binbin Zhou <zhoubinbin@loongson.cn> # BreakPoint & SingleStep
      Signed-off-by: Hui Li <lihui@loongson.cn>           # Some Minor Improvements
      Signed-off-by: Randy Dunlap <rdunlap@infradead.org> # Some Build Error Fixes
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      e14dd076
  7. Aug 18, 2023
    • Bjorn Helgaas's avatar
      d56b699d
    • Barry Song's avatar
      arm64: support batched/deferred tlb shootdown during page reclamation/migration · 43b3dfdd
      Barry Song authored
      On x86, batched and deferred tlb shootdown has lead to 90% performance
      increase on tlb shootdown.  on arm64, HW can do tlb shootdown without
      software IPI.  But sync tlbi is still quite expensive.
      
      Even running a simplest program which requires swapout can
      prove this is true,
       #include <sys/types.h>
       #include <unistd.h>
       #include <sys/mman.h>
       #include <string.h>
      
       int main()
       {
       #define SIZE (1 * 1024 * 1024)
               volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                                MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      
               memset(p, 0x88, SIZE);
      
               for (int k = 0; k < 10000; k++) {
                       /* swap in */
                       for (int i = 0; i < SIZE; i += 4096) {
                               (void)p[i];
                       }
      
                       /* swap out */
                       madvise(p, SIZE, MADV_PAGEOUT);
               }
       }
      
      Perf result on snapdragon 888 with 8 cores by using zRAM
      as the swap block device.
      
       ~ # perf record taskset -c 4 ./a.out
       [ perf record: Woken up 10 times to write data ]
       [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
       ~ # perf report
       # To display the perf.data header info, please use --header/--header-only options.
       # To display the perf.data header info, please use --header/--header-only options.
       #
       #
       # Total Lost Samples: 0
       #
       # Samples: 60K of event 'cycles'
       # Event count (approx.): 35706225414
       #
       # Overhead  Command  Shared Object      Symbol
       # ........  .......  .................  ......
       #
          21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
           8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
           6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
           6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
           5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
           3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
           3.49%  a.out    [kernel.kallsyms]  [k] memset64
           1.63%  a.out    [kernel.kallsyms]  [k] clear_page
           1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
           1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
           1.23%  a.out    [kernel.kallsyms]  [k] xas_load
           1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock
      
      ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out
      a page mapped by only one process.  If the page is mapped by multiple
      processes, typically, like more than 100 on a phone, the overhead would be
      much higher as we have to run tlb flush 100 times for one single page. 
      Plus, tlb flush overhead will increase with the number of CPU cores due to
      the bad scalability of tlb shootdown in HW, so those ARM64 servers should
      expect much higher overhead.
      
      Further perf annonate shows 95% cpu time of ptep_clear_flush is actually
      used by the final dsb() to wait for the completion of tlb flush.  This
      provides us a very good chance to leverage the existing batched tlb in
      kernel.  The minimum modification is that we only send async tlbi in the
      first stage and we send dsb while we have to sync in the second stage.
      
      With the above simplest micro benchmark, collapsed time to finish the
      program decreases around 5%.
      
      Typical collapsed time w/o patch:
       ~ # time taskset -c 4 ./a.out
       0.21user 14.34system 0:14.69elapsed
      w/ patch:
       ~ # time taskset -c 4 ./a.out
       0.22user 13.45system 0:13.80elapsed
      
      Also tested with benchmark in the commit on Kunpeng920 arm64 server
      and observed an improvement around 12.5% with command
      `time ./swap_bench`.
              w/o             w/
      real    0m13.460s       0m11.771s
      user    0m0.248s        0m0.279s
      sys     0m12.039s       0m11.458s
      
      Originally it's noticed a 16.99% overhead of ptep_clear_flush()
      which has been eliminated by this patch:
      
      [root@localhost yang]# perf record -- ./swap_bench && perf report
      [...]
      16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush
      
      It is tested on 4,8,128 CPU platforms and shows to be beneficial on
      large systems but may not have improvement on small systems like on
      a 4 CPU platform.
      
      Also this patch improve the performance of page migration. Using pmbench
      and tries to migrate the pages of pmbench between node 0 and node 1 for
      100 times for 1G memory, this patch decrease the time used around 20%
      (prev 18.338318910 sec after 13.981866350 sec) and saved the time used
      by ptep_clear_flush().
      
      Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com
      
      
      Tested-by: default avatarYicong Yang <yangyicong@hisilicon.com>
      Tested-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Tested-by: default avatarPunit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Signed-off-by: default avatarYicong Yang <yangyicong@hisilicon.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Darren Hart <darren@os.amperecomputing.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: lipeifeng <lipeifeng@oppo.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zeng Tao <prime.zeng@hisilicon.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      43b3dfdd
  8. Jul 14, 2023
  9. Jun 29, 2023
  10. Mar 27, 2023
  11. Jan 30, 2023
  12. Dec 05, 2022
  13. Dec 03, 2022
  14. Oct 29, 2022
  15. Jul 14, 2022
  16. Jun 30, 2022
    • Frederic Weisbecker's avatar
      context_tracking: Split user tracking Kconfig · 24a9c541
      Frederic Weisbecker authored
      
      Context tracking is going to be used not only to track user transitions
      but also idle/IRQs/NMIs. The user tracking part will then become a
      separate feature. Prepare Kconfig for that.
      
      [ frederic: Apply Max Filippov feedback. ]
      
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
      Cc: Yu Liao <liaoyu15@huawei.com>
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
      Cc: Alex Belits <abelits@marvell.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNicolas Saenz Julienne <nsaenzju@redhat.com>
      Tested-by: default avatarNicolas Saenz Julienne <nsaenzju@redhat.com>
      24a9c541
  17. Jun 27, 2022
  18. Jun 09, 2022
  19. May 02, 2022
  20. Mar 07, 2022
  21. Feb 23, 2022
  22. Dec 17, 2021
  23. Nov 01, 2021
    • Helge Deller's avatar
      parisc: Move thread_info into task struct · 2214c0e7
      Helge Deller authored
      
      This implements the CONFIG_THREAD_INFO_IN_TASK option.
      
      With this change:
      - before thread_info was part of the stack and located at the beginning of the stack
      - now the thread_info struct is moved and located inside the task_struct structure
      - the stack is allocated and handled like the major other platforms
      - drop the cpu field of thread_info and use instead the one in task_struct
      
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarSven Schnelle <svens@stackframe.org>
      2214c0e7
  24. Sep 11, 2021
  25. Aug 24, 2021
  26. Aug 12, 2021
  27. Jul 15, 2021
  28. Mar 31, 2021
  29. Mar 15, 2021
  30. Feb 25, 2021
Loading