1. 16 Nov, 2017 1 commit
  2. 08 Nov, 2017 1 commit
    • Ricardo Neri's avatar
      x86/umip: Enable User-Mode Instruction Prevention at runtime · aa35f896
      Ricardo Neri authored
      User-Mode Instruction Prevention (UMIP) is enabled by setting/clearing a
      bit in %cr4.
      
      It makes sense to enable UMIP at some point while booting, before user
      spaces come up. Like SMAP and SMEP, is not critical to have it enabled
      very early during boot. This is because UMIP is relevant only when there is
      a user space to be protected from. Given these similarities, UMIP can be
      enabled along with SMAP and SMEP.
      
      At the moment, UMIP is disabled by default at build time. It can be enabled
      at build time by selecting CONFIG_X86_INTEL_UMIP. If enabled at build time,
      it can be disabled at run time by adding clearcpuid=514 to the kernel
      parameters.
      Signed-off-by: default avatarRicardo Neri <ricardo.neri-calderon@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: ricardo.neri@intel.com
      Link: http://lkml.kernel.org/r/1509935277-22138-10-git-send-email-ricardo.neri-calderon@linux.intel.comSigned-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      aa35f896
  3. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  4. 20 Oct, 2017 1 commit
    • Andrey Ryabinin's avatar
      x86/kasan: Use the same shadow offset for 4- and 5-level paging · 12a8cc7f
      Andrey Ryabinin authored
      We are going to support boot-time switching between 4- and 5-level
      paging. For KASAN it means we cannot have different KASAN_SHADOW_OFFSET
      for different paging modes: the constant is passed to gcc to generate
      code and cannot be changed at runtime.
      
      This patch changes KASAN code to use 0xdffffc0000000000 as shadow offset
      for both 4- and 5-level paging.
      
      For 5-level paging it means that shadow memory region is not aligned to
      PGD boundary anymore and we have to handle unaligned parts of the region
      properly.
      
      In addition, we have to exclude paravirt code from KASAN instrumentation
      as we now use set_pgd() before KASAN is fully ready.
      
      [kirill.shutemov@linux.intel.com: clenaup, changelog message]
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20170929140821.37654-4-kirill.shutemov@linux.intel.comSigned-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      12a8cc7f
  5. 18 Oct, 2017 1 commit
    • Thomas Gleixner's avatar
      x86/vector/msi: Select CONFIG_GENERIC_IRQ_RESERVATION_MODE · c201c917
      Thomas Gleixner authored
      Select CONFIG_GENERIC_IRQ_RESERVATION_MODE so PCI/MSI domains get the
      MSI_FLAG_MUST_REACTIVATE flag set in pci_msi_create_irq_domain().
      
      Remove the explicit setters of this flag in the apic/msi code as they are
      not longer required.
      
      Fixes: 4900be83 ("x86/vector/msi: Switch to global reservation mode")
      Reported-and-tested-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Link: https://lkml.kernel.org/r/20171017075600.527569354@linutronix.de
      c201c917
  6. 14 Oct, 2017 1 commit
  7. 28 Sep, 2017 1 commit
  8. 25 Sep, 2017 1 commit
  9. 09 Sep, 2017 2 commits
  10. 07 Sep, 2017 1 commit
    • Rik van Riel's avatar
      x86,mpx: make mpx depend on x86-64 to free up VMA flag · df3735c5
      Rik van Riel authored
      Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      This patch (of 2):
      
      MPX only seems to be available on 64 bit CPUs, starting with Skylake and
      Goldmont.  Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
      order to free up a VMA flag.
      
      Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.comSigned-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Colm MacCártaigh <colm@allcosts.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df3735c5
  11. 31 Aug, 2017 2 commits
    • Robin Murphy's avatar
      libnvdimm, nd_blk: remove mmio_flush_range() · 5deb67f7
      Robin Murphy authored
      mmio_flush_range() suffers from a lack of clearly-defined semantics,
      and is somewhat ambiguous to port to other architectures where the
      scope of the writeback implied by "flush" and ordering might matter,
      but MMIO would tend to imply non-cacheable anyway. Per the rationale
      in 67a3e8fe ("nd_blk: change aperture mapping from WC to WB"), the
      only existing use is actually to invalidate clean cache lines for
      ARCH_MEMREMAP_PMEM type mappings *without* writeback. Since the recent
      cleanup of the pmem API, that also now happens to be the exact purpose
      of arch_invalidate_pmem(), which would be a far more well-defined tool
      for the job.
      
      Rather than risk potentially inconsistent implementations of
      mmio_flush_range() for the sake of one callsite, streamline things by
      removing it entirely and instead move the ARCH_MEMREMAP_PMEM related
      definitions up to the libnvdimm level, so they can be shared by NFIT
      as well. This allows NFIT to be enabled for arm64.
      Signed-off-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      5deb67f7
    • Vitaly Kuznetsov's avatar
      x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y) · 9e52fc2b
      Vitaly Kuznetsov authored
      There's a subtle bug in how some of the paravirt guest code handles
      page table freeing on x86:
      
      On x86 software page table walkers depend on the fact that remote TLB flush
      does an IPI: walk is performed lockless but with interrupts disabled and in
      case the page table is freed the freeing CPU will get blocked as remote TLB
      flush is required. On other architectures which don't require an IPI to do
      remote TLB flush we have an RCU-based mechanism (see
      include/asm-generic/tlb.h for more details).
      
      In virtualized environments we may want to override the ->flush_tlb_others
      callback in pv_mmu_ops and use a hypercall asking the hypervisor to do a
      remote TLB flush for us. This breaks the assumption about IPIs. Xen PV has
      been doing this for years and the upcoming remote TLB flush for Hyper-V will
      do it too.
      
      This is not safe, as software page table walkers may step on an already
      freed page.
      
      Fix the bug by enabling the RCU-based page table freeing mechanism,
      CONFIG_HAVE_RCU_TABLE_FREE=y.
      
      Testing with kernbench and mmap/munmap microbenchmarks, and neither showed
      any noticeable performance impact.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarJuergen Gross <jgross@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/20170828082251.5562-1-vkuznets@redhat.com
      [ Rewrote/fixed/clarified the changelog. ]
      Signed-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      9e52fc2b
  12. 29 Aug, 2017 1 commit
    • Ingo Molnar's avatar
      locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being · 7b3d61cc
      Ingo Molnar authored
      Mike Galbraith bisected a boot crash back to the following commit:
      
        7a46ec0e ("locking/refcounts, x86/asm: Implement fast refcount overflow protection")
      
      The crash/hang pattern is:
      
       > Symptom is a few splats as below, with box finally hanging.  Network
       > comes up, but neither ssh nor console login is possible.
       >
       >  ------------[ cut here ]------------
       >  WARNING: CPU: 4 PID: 0 at net/netlink/af_netlink.c:374 netlink_sock_destruct+0x82/0xa0
       >  ...
       >  __sk_destruct()
       >  rcu_process_callbacks()
       >  __do_softirq()
       >  irq_exit()
       >  smp_apic_timer_interrupt()
       >  apic_timer_interrupt()
      
      We are at -rc7 already, and the code has grown some dependencies, so
      instead of a plain revert disable the config temporarily, in the hope
      of getting real fixes.
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/tip-7a46ec0e2f4850407de5e1d19a44edee6efa58ec@git.kernel.orgSigned-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      7b3d61cc
  13. 24 Aug, 2017 1 commit
  14. 18 Aug, 2017 2 commits
    • Nicholas Piggin's avatar
      kernel/watchdog: fix Kconfig constraints for perf hardlockup watchdog · 92e5aae4
      Nicholas Piggin authored
      Commit 05a4a952 ("kernel/watchdog: split up config options") lost
      the perf-based hardlockup detector's dependency on PERF_EVENTS, which
      can result in broken builds with some powerpc configurations.
      
      Restore the dependency.  Add it in for x86 too, despite x86 always
      selecting PERF_EVENTS it seems reasonable to make the dependency
      explicit.
      
      Link: http://lkml.kernel.org/r/20170810114452.6673-1-npiggin@gmail.com
      Fixes: 05a4a952 ("kernel/watchdog: split up config options")
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      92e5aae4
    • Thomas Gleixner's avatar
      kernel/watchdog: Prevent false positives with turbo modes · 7edaeb68
      Thomas Gleixner authored
      The hardlockup detector on x86 uses a performance counter based on unhalted
      CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
      performance counter period, so the hrtimer should fire 2-3 times before the
      performance counter NMI fires. The NMI code checks whether the hrtimer
      fired since the last invocation. If not, it assumess a hard lockup.
      
      The calculation of those periods is based on the nominal CPU
      frequency. Turbo modes increase the CPU clock frequency and therefore
      shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
      nominal frequency) the perf/NMI period is shorter than the hrtimer period
      which leads to false positives.
      
      A simple fix would be to shorten the hrtimer period, but that comes with
      the side effect of more frequent hrtimer and softlockup thread wakeups,
      which is not desired.
      
      Implement a low pass filter, which checks the perf/NMI period against
      kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
      elapsed then the event is ignored and postponed to the next perf/NMI.
      
      That solves the problem and avoids the overhead of shorter hrtimer periods
      and more frequent softlockup thread wakeups.
      
      Fixes: 58687acb ("lockup_detector: Combine nmi_watchdog and softlockup detector")
      Reported-and-tested-by: default avatarKan Liang <Kan.liang@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: dzickus@redhat.com
      Cc: prarit@redhat.com
      Cc: ak@linux.intel.com
      Cc: babu.moger@oracle.com
      Cc: peterz@infradead.org
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: stable@vger.kernel.org
      Cc: atomlin@redhat.com
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
      7edaeb68
  15. 17 Aug, 2017 1 commit
    • Kees Cook's avatar
      locking/refcounts, x86/asm: Implement fast refcount overflow protection · 7a46ec0e
      Kees Cook authored
      This implements refcount_t overflow protection on x86 without a noticeable
      performance impact, though without the fuller checking of REFCOUNT_FULL.
      
      This is done by duplicating the existing atomic_t refcount implementation
      but with normally a single instruction added to detect if the refcount
      has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
      the handler saturates the refcount_t to INT_MIN / 2. With this overflow
      protection, the erroneous reference release that would follow a wrap back
      to zero is blocked from happening, avoiding the class of refcount-overflow
      use-after-free vulnerabilities entirely.
      
      Only the overflow case of refcounting can be perfectly protected, since
      it can be detected and stopped before the reference is freed and left to
      be abused by an attacker. There isn't a way to block early decrements,
      and while REFCOUNT_FULL stops increment-from-zero cases (which would
      be the state _after_ an early decrement and stops potential double-free
      conditions), this fast implementation does not, since it would require
      the more expensive cmpxchg loops. Since the overflow case is much more
      common (e.g. missing a "put" during an error path), this protection
      provides real-world protection. For example, the two public refcount
      overflow use-after-free exploits published in 2016 would have been
      rendered unexploitable:
      
        http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/
      
        http://cyseclabs.com/page?n=02012016
      
      This implementation does, however, notice an unchecked decrement to zero
      (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
      resulted in a zero). Decrements under zero are noticed (since they will
      have resulted in a negative value), though this only indicates that a
      use-after-free may have already happened. Such notifications are likely
      avoidable by an attacker that has already exploited a use-after-free
      vulnerability, but it's better to have them reported than allow such
      conditions to remain universally silent.
      
      On first overflow detection, the refcount value is reset to INT_MIN / 2
      (which serves as a saturation value) and a report and stack trace are
      produced. When operations detect only negative value results (such as
      changing an already saturated value), saturation still happens but no
      notification is performed (since the value was already saturated).
      
      On the matter of races, since the entire range beyond INT_MAX but before
      0 is negative, every operation at INT_MIN / 2 will trap, leaving no
      overflow-only race condition.
      
      As for performance, this implementation adds a single "js" instruction
      to the regular execution flow of a copy of the standard atomic_t refcount
      operations. (The non-"and_test" refcount_dec() function, which is uncommon
      in regular refcount design patterns, has an additional "jz" instruction
      to detect reaching exactly zero.) Since this is a forward jump, it is by
      default the non-predicted path, which will be reinforced by dynamic branch
      prediction. The result is this protection having virtually no measurable
      change in performance over standard atomic_t operations. The error path,
      located in .text.unlikely, saves the refcount location and then uses UD0
      to fire a refcount exception handler, which resets the refcount, handles
      reporting, and returns to regular execution. This keeps the changes to
      .text size minimal, avoiding return jumps and open-coded calls to the
      error reporting routine.
      
      Example assembly comparison:
      
      refcount_inc() before:
      
        .text:
        ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
      
      refcount_inc() after:
      
        .text:
        ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
        ffffffff8154614d:       0f 88 80 d5 17 00       js     ffffffff816c36d3
        ...
        .text.unlikely:
        ffffffff816c36d3:       48 8d 4d f4             lea    -0xc(%rbp),%rcx
        ffffffff816c36d7:       0f ff                   (bad)
      
      These are the cycle counts comparing a loop of refcount_inc() from 1
      to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
      unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
      (refcount_t-full), and this overflow-protected refcount (refcount_t-fast):
      
        2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
      		    cycles		protections
        atomic_t           82249267387	none
        refcount_t-fast    82211446892	overflow, untested dec-to-zero
        refcount_t-full   144814735193	overflow, untested dec-to-zero, inc-from-zero
      
      This code is a modified version of the x86 PAX_REFCOUNT atomic_t
      overflow defense from the last public patch of PaX/grsecurity, based
      on my understanding of the code. Changes or omissions from the original
      code are mine and don't reflect the original grsecurity/PaX code. Thanks
      to PaX Team for various suggestions for improvement for repurposing this
      code to be a refcount-only protection.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Hans Liljestrand <ishkamiel@gmail.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Serge E. Hallyn <serge@hallyn.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arozansk@redhat.com
      Cc: axboe@kernel.dk
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-arch <linux-arch@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20170815161924.GA133115@beastSigned-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      7a46ec0e
  16. 01 Aug, 2017 1 commit
  17. 26 Jul, 2017 2 commits
    • Josh Poimboeuf's avatar
      x86/kconfig: Consolidate unwinders into multiple choice selection · 81d38719
      Josh Poimboeuf authored
      There are three mutually exclusive unwinders.  Make that more obvious by
      combining them into a multiple-choice selection:
      
        CONFIG_FRAME_POINTER_UNWINDER
        CONFIG_ORC_UNWINDER
        CONFIG_GUESS_UNWINDER (if CONFIG_EXPERT=y)
      
      Frame pointers are still the default (for now).
      
      The old CONFIG_FRAME_POINTER option is still used in some
      arch-independent places, so keep it around, but make it
      invisible to the user on x86 - it's now selected by
      CONFIG_FRAME_POINTER_UNWINDER=y.
      Suggested-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170725135424.zukjmgpz3plf5pmt@trebleSigned-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      81d38719
    • Josh Poimboeuf's avatar
      x86/unwind: Add the ORC unwinder · ee9f8fce
      Josh Poimboeuf authored
      Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
      It plugs into the existing x86 unwinder framework.
      
      It relies on objtool to generate the needed .orc_unwind and
      .orc_unwind_ip sections.
      
      For more details on why ORC is used instead of DWARF, see
      Documentation/x86/orc-unwinder.txt - but the short version is
      that it's a simplified, fundamentally more robust debugninfo
      data structure, which also allows up to two orders of magnitude
      faster lookups than the DWARF unwinder - which matters to
      profiling workloads like perf.
      
      Thanks to Andy Lutomirski for the performance improvement ideas:
      splitting the ORC unwind table into two parallel arrays and creating a
      fast lookup table to search a subset of the unwind table.
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com
      [ Extended the changelog. ]
      Signed-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      ee9f8fce
  18. 21 Jul, 2017 1 commit
  19. 18 Jul, 2017 2 commits
    • Tom Lendacky's avatar
      x86/mm: Extend early_memremap() support with additional attrs · f88a68fa
      Tom Lendacky authored
      Add early_memremap() support to be able to specify encrypted and
      decrypted mappings with and without write-protection. The use of
      write-protection is necessary when encrypting data "in place". The
      write-protect attribute is considered cacheable for loads, but not
      stores. This implies that the hardware will never give the core a
      dirty line with this memtype.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Toshimitsu Kani <toshi.kani@hpe.com>
      Cc: kasan-dev@googlegroups.com
      Cc: kvm@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/479b5832c30fae3efa7932e48f81794e86397229.1500319216.git.thomas.lendacky@amd.comSigned-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      f88a68fa
    • Tom Lendacky's avatar
      x86/mm: Add Secure Memory Encryption (SME) support · 7744ccdb
      Tom Lendacky authored
      Add support for Secure Memory Encryption (SME). This initial support
      provides a Kconfig entry to build the SME support into the kernel and
      defines the memory encryption mask that will be used in subsequent
      patches to mark pages as encrypted.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Toshimitsu Kani <toshi.kani@hpe.com>
      Cc: kasan-dev@googlegroups.com
      Cc: kvm@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/a6c34d16caaed3bc3e2d6f0987554275bd291554.1500319216.git.thomas.lendacky@amd.comSigned-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      7744ccdb
  20. 12 Jul, 2017 2 commits
    • Daniel Micay's avatar
      include/linux/string.h: add the option of fortified string.h functions · 6974f0c4
      Daniel Micay authored
      This adds support for compiling with a rough equivalent to the glibc
      _FORTIFY_SOURCE=1 feature, providing compile-time and runtime buffer
      overflow checks for string.h functions when the compiler determines the
      size of the source or destination buffer at compile-time.  Unlike glibc,
      it covers buffer reads in addition to writes.
      
      GNU C __builtin_*_chk intrinsics are avoided because they would force a
      much more complex implementation.  They aren't designed to detect read
      overflows and offer no real benefit when using an implementation based
      on inline checks.  Inline checks don't add up to much code size and
      allow full use of the regular string intrinsics while avoiding the need
      for a bunch of _chk functions and per-arch assembly to avoid wrapper
      overhead.
      
      This detects various overflows at compile-time in various drivers and
      some non-x86 core kernel code.  There will likely be issues caught in
      regular use at runtime too.
      
      Future improvements left out of initial implementation for simplicity,
      as it's all quite optional and can be done incrementally:
      
      * Some of the fortified string functions (strncpy, strcat), don't yet
        place a limit on reads from the source based on __builtin_object_size of
        the source buffer.
      
      * Extending coverage to more string functions like strlcat.
      
      * It should be possible to optionally use __builtin_object_size(x, 1) for
        some functions (C strings) to detect intra-object overflows (like
        glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative
        approach to avoid likely compatibility issues.
      
      * The compile-time checks should be made available via a separate config
        option which can be enabled by default (or always enabled) once enough
        time has passed to get the issues it catches fixed.
      
      Kees said:
       "This is great to have. While it was out-of-tree code, it would have
        blocked at least CVE-2016-3858 from being exploitable (improper size
        argument to strlcpy()). I've sent a number of fixes for
        out-of-bounds-reads that this detected upstream already"
      
      [arnd@arndb.de: x86: fix fortified memcpy]
        Link: http://lkml.kernel.org/r/20170627150047.660360-1-arnd@arndb.de
      [keescook@chromium.org: avoid panic() in favor of BUG()]
        Link: http://lkml.kernel.org/r/20170626235122.GA25261@beast
      [keescook@chromium.org: move from -mm, add ARCH_HAS_FORTIFY_SOURCE, tweak Kconfig help]
      Link: http://lkml.kernel.org/r/20170526095404.20439-1-danielmicay@gmail.com
      Link: http://lkml.kernel.org/r/1497903987-21002-8-git-send-email-keescook@chromium.orgSigned-off-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6974f0c4
    • Nicholas Piggin's avatar
      kernel/watchdog: split up config options · 05a4a952
      Nicholas Piggin authored
      Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
      HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.
      
      LOCKUP_DETECTOR implies the general boot, sysctl, and programming
      interfaces for the lockup detectors.
      
      An architecture that wants to use a hard lockup detector must define
      HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.
      
      Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
      minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
      does not implement the LOCKUP_DETECTOR interfaces.
      
      sparc is unusual in that it has started to implement some of the
      interfaces, but not fully yet.  It should probably be converted to a full
      HAVE_HARDLOCKUP_DETECTOR_ARCH.
      
      [npiggin@gmail.com: fix]
        Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
      Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.comSigned-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarDon Zickus <dzickus@redhat.com>
      Reviewed-by: default avatarBabu Moger <babu.moger@oracle.com>
      Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05a4a952
  21. 06 Jul, 2017 2 commits
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: clean up ARCH_HAS_GIGANTIC_PAGE · e1073d1e
      Aneesh Kumar K.V authored
      This moves the #ifdef in C code to a Kconfig dependency.  Also we move
      the gigantic_page_supported() function to be arch specific.
      
      This allows architectures to conditionally enable runtime allocation of
      gigantic huge page.  Architectures like ppc64 supports different
      gigantic huge page size (16G and 1G) based on the translation mode
      selected.  This provides an opportunity for ppc64 to enable runtime
      allocation only w.r.t 1G hugepage.
      
      No functional change in this patch.
      
      Link: http://lkml.kernel.org/r/1494995292-4443-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1073d1e
    • Huang Ying's avatar
      mm, THP, swap: delay splitting THP during swap out · 38d8b4e6
      Huang Ying authored
      Patch series "THP swap: Delay splitting THP during swapping out", v11.
      
      This patchset is to optimize the performance of Transparent Huge Page
      (THP) swap.
      
      Recently, the performance of the storage devices improved so fast that
      we cannot saturate the disk bandwidth with single logical CPU when do
      page swap out even on a high-end server machine.  Because the
      performance of the storage device improved faster than that of single
      logical CPU.  And it seems that the trend will not change in the near
      future.  On the other hand, the THP becomes more and more popular
      because of increased memory size.  So it becomes necessary to optimize
      THP swap performance.
      
      The advantages of the THP swap support include:
      
       - Batch the swap operations for the THP to reduce lock
         acquiring/releasing, including allocating/freeing the swap space,
         adding/deleting to/from the swap cache, and writing/reading the swap
         space, etc. This will help improve the performance of the THP swap.
      
       - The THP swap space read/write will be 2M sequential IO. It is
         particularly helpful for the swap read, which are usually 4k random
         IO. This will improve the performance of the THP swap too.
      
       - It will help the memory fragmentation, especially when the THP is
         heavily used by the applications. The 2M continuous pages will be
         free up after THP swapping out.
      
       - It will improve the THP utilization on the system with the swap
         turned on. Because the speed for khugepaged to collapse the normal
         pages into the THP is quite slow. After the THP is split during the
         swapping out, it will take quite long time for the normal pages to
         collapse back into the THP after being swapped in. The high THP
         utilization helps the efficiency of the page based memory management
         too.
      
      There are some concerns regarding THP swap in, mainly because possible
      enlarged read/write IO size (for swap in/out) may put more overhead on
      the storage device.  To deal with that, the THP swap in should be turned
      on only when necessary.  For example, it can be selected via
      "always/never/madvise" logic, to be turned on globally, turned off
      globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
      
      This patchset is the first step for the THP swap support.  The plan is
      to delay splitting THP step by step, finally avoid splitting THP during
      the THP swapping out and swap out/in the THP as a whole.
      
      As the first step, in this patchset, the splitting huge page is delayed
      from almost the first step of swapping out to after allocating the swap
      space for the THP and adding the THP into the swap cache.  This will
      reduce lock acquiring/releasing for the locks used for the swap cache
      management.
      
      With the patchset, the swap out throughput improves 15.5% (from about
      3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
      with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      This patch (of 5):
      
      In this patch, splitting huge page is delayed from almost the first step
      of swapping out to after allocating the swap space for the THP
      (Transparent Huge Page) and adding the THP into the swap cache.  This
      will batch the corresponding operation, thus improve THP swap out
      throughput.
      
      This is the first step for the THP swap optimization.  The plan is to
      delay splitting the THP step by step and avoid splitting the THP
      finally.
      
      In this patch, one swap cluster is used to hold the contents of each THP
      swapped out.  So, the size of the swap cluster is changed to that of the
      THP (Transparent Huge Page) on x86_64 architecture (512).  For other
      architectures which want such THP swap optimization,
      ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
      the architecture.  In effect, this will enlarge swap cluster size by 2
      times on x86_64.  Which may make it harder to find a free cluster when
      the swap space becomes fragmented.  So that, this may reduce the
      continuous swap space allocation and sequential write in theory.  The
      performance test in 0day shows no regressions caused by this.
      
      In the future of THP swap optimization, some information of the swapped
      out THP (such as compound map count) will be recorded in the
      swap_cluster_info data structure.
      
      The mem cgroup swap accounting functions are enhanced to support charge
      or uncharge a swap cluster backing a THP as a whole.
      
      The swap cluster allocate/free functions are added to allocate/free a
      swap cluster for a THP.  A fair simple algorithm is used for swap
      cluster allocation, that is, only the first swap device in priority list
      will be tried to allocate the swap cluster.  The function will fail if
      the trying is not successful, and the caller will fallback to allocate a
      single swap slot instead.  This works good enough for normal cases.  If
      the difference of the number of the free swap clusters among multiple
      swap devices is significant, it is possible that some THPs are split
      earlier than necessary.  For example, this could be caused by big size
      difference among multiple swap devices.
      
      The swap cache functions is enhanced to support add/delete THP to/from
      the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
      enhanced in the future with multi-order radix tree.  But because we will
      split the THP soon during swapping out, that optimization doesn't make
      much sense for this first step.
      
      The THP splitting functions are enhanced to support to split THP in swap
      cache during swapping out.  The page lock will be held during allocating
      the swap cluster, adding the THP into the swap cache and splitting the
      THP.  So in the code path other than swapping out, if the THP need to be
      split, the PageSwapCache(THP) will be always false.
      
      The swap cluster is only available for SSD, so the THP swap optimization
      in this patchset has no effect for HDD.
      
      [ying.huang@intel.com: fix two issues in THP optimize patch]
        Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
      [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
      Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
      Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38d8b4e6
  22. 02 Jul, 2017 1 commit
  23. 28 Jun, 2017 1 commit
  24. 22 Jun, 2017 2 commits
  25. 14 Jun, 2017 1 commit
  26. 13 Jun, 2017 1 commit
  27. 09 Jun, 2017 2 commits
    • Dan Williams's avatar
      x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations · 0aed55af
      Dan Williams authored
      The pmem driver has a need to transfer data with a persistent memory
      destination and be able to rely on the fact that the destination writes are not
      cached. It is sufficient for the writes to be flushed to a cpu-store-buffer
      (non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync()
      to ensure data-writes have reached a power-fail-safe zone in the platform. The
      fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn
      around and fence previous writes with an "sfence".
      
      Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
      memcpy_flushcache, that guarantee that the destination buffer is not dirty in
      the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines
      will be used to replace the "pmem api" (include/linux/pmem.h +
      arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
      and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
      config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
      otherwise.
      
      This is meant to satisfy the concern from Linus that if a driver wants to do
      something beyond the normal nocache semantics it should be something private to
      that driver [1], and Al's concern that anything uaccess related belongs with
      the rest of the uaccess code [2].
      
      The first consumer of this interface is a new 'copy_from_iter' dax operation so
      that pmem can inject cache maintenance operations without imposing this
      overhead on other dax-capable drivers.
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
      
      Cc: <x86@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      0aed55af
    • Bilal Amarni's avatar
      security/keys: add CONFIG_KEYS_COMPAT to Kconfig · 47b2c3ff
      Bilal Amarni authored
      CONFIG_KEYS_COMPAT is defined in arch-specific Kconfigs and is missing for
      several 64-bit architectures : mips, parisc, tile.
      
      At the moment and for those architectures, calling in 32-bit userspace the
      keyctl syscall would return an ENOSYS error.
      
      This patch moves the CONFIG_KEYS_COMPAT option to security/keys/Kconfig, to
      make sure the compatibility wrapper is registered by default for any 64-bit
      architecture as long as it is configured with CONFIG_COMPAT.
      
      [DH: Modified to remove arm64 compat enablement also as requested by Eric
       Biggers]
      Signed-off-by: default avatarBilal Amarni <bilal.amarni@gmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      cc: Eric Biggers <ebiggers3@gmail.com>
      Signed-off-by: default avatarJames Morris <james.l.morris@oracle.com>
      47b2c3ff
  28. 05 Jun, 2017 1 commit
    • Andy Lutomirski's avatar
      x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code · ce4a4e56
      Andy Lutomirski authored
      The UP asm/tlbflush.h generates somewhat nicer code than the SMP version.
      Aside from that, it's fallen quite a bit behind the SMP code:
      
       - flush_tlb_mm_range() didn't flush individual pages if the range
         was small.
      
       - The lazy TLB code was much weaker.  This usually wouldn't matter,
         but, if a kernel thread flushed its lazy "active_mm" more than
         once (due to reclaim or similar), it wouldn't be unlazied and
         would instead pointlessly flush repeatedly.
      
       - Tracepoints were missing.
      
      Aside from that, simply having the UP code around was a maintanence
      burden, since it means that any change to the TLB flush code had to
      make sure not to break it.
      
      Simplify everything by deleting the UP code.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: Ingo Molnar's avatarIngo Molnar <mingo@kernel.org>
      ce4a4e56
  29. 24 May, 2017 1 commit
  30. 26 Apr, 2017 2 commits