Skip to content
Snippets Groups Projects
  1. Jan 08, 2024
    • Paolo Bonzini's avatar
      KVM: fix direction of dependency on MMU notifiers · 3a373e02
      Paolo Bonzini authored
      
      KVM_GENERIC_MEMORY_ATTRIBUTES requires the generic MMU notifier code, because
      it uses kvm_mmu_invalidate_begin/end.  However, it would not work with a bespoke
      implementation of MMU notifiers that does not use KVM_GENERIC_MMU_NOTIFIER,
      because most likely it would not synchronize correctly on invalidation.  So
      the right thing to do is to note the problematic configuration if the
      architecture does not select itself KVM_GENERIC_MMU_NOTIFIER; not to
      enable it blindly.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3a373e02
    • Paolo Bonzini's avatar
      KVM: introduce CONFIG_KVM_COMMON · caadf876
      Paolo Bonzini authored
      
      CONFIG_HAVE_KVM is currently used by some architectures to either
      enabled the KVM config proper, or to enable host-side code that is
      not part of the KVM module.  However, CONFIG_KVM's "select" statement
      in virt/kvm/Kconfig corresponds to a third meaning, namely to
      enable common Kconfigs required by all architectures that support
      KVM.
      
      These three meanings can be replaced respectively by an
      architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by
      a new Kconfig symbol that is in turn selected by the
      architecture-specific "config KVM".
      
      Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON.
      Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by
      architecture code, and it brings in all dependencies of common
      KVM code.  In particular, INTERVAL_TREE was missing in loongarch
      and riscv, so that is another thing that is fixed.
      
      Fixes: 8132d887 ("KVM: remove CONFIG_HAVE_KVM_EVENTFD", 2023-12-08)
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Closes: https://lore.kernel.org/all/44907c6b-c5bd-4e4a-a921-e4d3825539d8@infradead.org/
      
      
      Reviewed-by: default avatarAndrew Jones <ajones@ventanamicro.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      caadf876
  2. Dec 12, 2023
  3. Dec 08, 2023
  4. Dec 01, 2023
    • Sean Christopherson's avatar
      Revert "KVM: Prevent module exit until all VMs are freed" · ea61294b
      Sean Christopherson authored
      Revert KVM's misguided attempt to "fix" a use-after-module-unload bug that
      was actually due to failure to flush a workqueue, not a lack of module
      refcounting.  Pinning the KVM module until kvm_vm_destroy() doesn't
      prevent use-after-free due to the module being unloaded, as userspace can
      invoke delete_module() the instant the last reference to KVM is put, i.e.
      can cause all KVM code to be unmapped while KVM is actively executing said
      code.
      
      Generally speaking, the many instances of module_put(THIS_MODULE)
      notwithstanding, outside of a few special paths, a module can never safely
      put the last reference to itself without creating deadlock, i.e. something
      external to the module *must* put the last reference.  In other words,
      having VMs grab a reference to the KVM module is futile, pointless, and as
      evidenced by the now-reverted commit 70375c2d ("Revert "KVM: set owner
      of cpu and vm file operations""), actively dangerous.
      
      This reverts commit 405294f2 and commit
      5f6de5cb.
      
      Fixes: 405294f2 ("KVM: Unconditionally get a ref to /dev/kvm module when creating a VM")
      Fixes: 5f6de5cb ("KVM: Prevent module exit until all VMs are freed")
      Link: https://lore.kernel.org/r/20231018204624.1905300-4-seanjc@google.com
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      ea61294b
    • Sean Christopherson's avatar
      KVM: Set file_operations.owner appropriately for all such structures · 087e1520
      Sean Christopherson authored
      
      Set .owner for all KVM-owned filed types so that the KVM module is pinned
      until any files with callbacks back into KVM are completely freed.  Using
      "struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
      while there are active VMs, doesn't provide full protection.
      
      Userspace can invoke delete_module() the instant the last reference to KVM
      is put.  If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
      then it's possible for KVM to be preempted and deleted/unloaded before KVM
      fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
      in, it will jump to a code page that is no longer mapped.
      
      Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
      kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
      THIS_MODULE (which points at kvm.ko).  KVM assumes that if /dev/kvm is
      reachable, e.g. VMs are active, then the vendor module is loaded.
      
      To reduce the probability of forgetting to set .owner entirely, use
      THIS_MODULE for stats files where KVM does not call back into vendor code.
      
      This reverts commit 70375c2d, and fixes
      several other file types that have been buggy since their introduction.
      
      Fixes: 70375c2d ("Revert "KVM: set owner of cpu and vm file operations"")
      Fixes: 3bcd0662 ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
      Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      087e1520
    • Philipp Stanner's avatar
      KVM: Harden copying of userspace-array against overflow · 1f829359
      Philipp Stanner authored
      
      kvm_main.c utilizes vmemdup_user() and array_size() to copy a userspace
      array. Currently, this does not check for an overflow.
      
      Use the new wrapper vmemdup_array_user() to copy the array more safely.
      
      Note, KVM explicitly checks the number of entries before duplicating the
      array, i.e. adding the overflow check should be a glorified nop.
      
      Suggested-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarPhilipp Stanner <pstanner@redhat.com>
      Link: https://lore.kernel.org/r/20231102181526.43279-4-pstanner@redhat.com
      
      
      [sean: call out that KVM pre-checks the number of entries]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      1f829359
  5. Nov 30, 2023
  6. Nov 28, 2023
  7. Nov 14, 2023
  8. Nov 13, 2023
  9. Aug 21, 2023
    • David Hildenbrand's avatar
      kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow() · b1e1296d
      David Hildenbrand authored
      KVM is *the* case we know that really wants to honor NUMA hinting falls.
      As we want to stop setting FOLL_HONOR_NUMA_FAULT implicitly, set
      FOLL_HONOR_NUMA_FAULT whenever we might obtain pages on behalf of a VCPU
      to map them into a secondary MMU, and add a comment why.
      
      Do that unconditionally in hva_to_pfn_slow() when calling
      get_user_pages_unlocked().
      
      kvmppc_book3s_instantiate_page(), hva_to_pfn_fast() and
      gfn_to_page_many_atomic() are similarly used to map pages into a
      secondary MMU. However, FOLL_WRITE and get_user_page_fast_only() always
      implicitly honor NUMA hinting faults -- as documented for
      FOLL_HONOR_NUMA_FAULT -- so we can limit this change to a single location
      for now.
      
      Don't set it in check_user_page_hwpoison(), where we really only want to
      check if the mapped page is HW-poisoned.
      
      We won't set it for other KVM users of get_user_pages()/pin_user_pages()
      * arch/powerpc/kvm/book3s_64_mmu_hv.c: not used to map pages into a
        secondary MMU.
      * arch/powerpc/kvm/e500_mmu.c: only used on shared TLB pages with userspace
      * arch/s390/kvm/*: s390x only supports a single NUMA node either way
      * arch/x86/kvm/svm/sev.c: not used to map pages into a secondary MMU.
      
      This is a preparation for making FOLL_HONOR_NUMA_FAULT no longer
      implicitly be set by get_user_pages() and friends.
      
      Link: https://lkml.kernel.org/r/20230803143208.383663-4-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: liubo <liubo254@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b1e1296d
  10. Aug 17, 2023
  11. Aug 03, 2023
  12. Jul 29, 2023
  13. Jul 25, 2023
  14. Jun 22, 2023
    • Gavin Shan's avatar
      KVM: Avoid illegal stage2 mapping on invalid memory slot · 2230f9e1
      Gavin Shan authored
      
      We run into guest hang in edk2 firmware when KSM is kept as running on
      the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
      device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
      buffered write. The status is returned by reading the memory region of
      the pflash device and the read request should have been forwarded to QEMU
      and emulated by it. Unfortunately, the read request is covered by an
      illegal stage2 mapping when the guest hang issue occurs. The read request
      is completed with QEMU bypassed and wrong status is fetched. The edk2
      firmware runs into an infinite loop with the wrong status.
      
      The illegal stage2 mapping is populated due to same page sharing by KSM
      at (C) even the associated memory slot has been marked as invalid at (B)
      when the memory slot is requested to be deleted. It's notable that the
      active and inactive memory slots can't be swapped when we're in the middle
      of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
      is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
      to zero again. Besides, the swapping from the active to the inactive memory
      slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
      corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
      
        CPU-A                    CPU-B
        -----                    -----
                                 ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
                                 kvm_vm_ioctl_set_memory_region
                                 kvm_set_memory_region
                                 __kvm_set_memory_region
                                 kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
                                   kvm_invalidate_memslot
                                     kvm_copy_memslot
                                     kvm_replace_memslot
                                     kvm_swap_active_memslots        (A)
                                     kvm_arch_flush_shadow_memslot   (B)
        same page sharing by KSM
        kvm_mmu_notifier_invalidate_range_start
              :
        kvm_mmu_notifier_change_pte
          kvm_handle_hva_range
          __kvm_handle_hva_range
          kvm_set_spte_gfn            (C)
              :
        kvm_mmu_notifier_invalidate_range_end
      
      Fix the issue by skipping the invalid memory slot at (C) to avoid the
      illegal stage2 mapping so that the read request for the pflash's status
      is forwarded to QEMU and emulated by it. In this way, the correct pflash's
      status can be returned from QEMU to break the infinite loop in the edk2
      firmware.
      
      We tried a git-bisect and the first problematic commit is cd4c7183 ("
      KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
      clean_dcache_guest_page() is called after the memory slots are iterated
      in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
      before the iteration on the memory slots before this commit. This change
      literally enlarges the racy window between kvm_mmu_notifier_change_pte()
      and memory slot removal so that we're able to reproduce the issue in a
      practical test case. However, the issue exists since commit d5d8184d
      ("KVM: ARM: Memory virtualization setup").
      
      Cc: stable@vger.kernel.org # v3.9+
      Fixes: d5d8184d ("KVM: ARM: Memory virtualization setup")
      Reported-by: default avatarShuai Hu <hshuai@redhat.com>
      Reported-by: default avatarZhenyu Zhang <zhenyzha@redhat.com>
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarShaoqin Huang <shahuang@redhat.com>
      Message-Id: <20230615054259.14911-1-gshan@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2230f9e1
  15. Jun 19, 2023
    • Ryan Roberts's avatar
      mm: ptep_get() conversion · c33c7948
      Ryan Roberts authored
      Convert all instances of direct pte_t* dereferencing to instead use
      ptep_get() helper.  This means that by default, the accesses change from a
      C dereference to a READ_ONCE().  This is technically the correct thing to
      do since where pgtables are modified by HW (for access/dirty) they are
      volatile and therefore we should always ensure READ_ONCE() semantics.
      
      But more importantly, by always using the helper, it can be overridden by
      the architecture to fully encapsulate the contents of the pte.  Arch code
      is deliberately not converted, as the arch code knows best.  It is
      intended that arch code (arm64) will override the default with its own
      implementation that can (e.g.) hide certain bits from the core code, or
      determine young/dirty status by mixing in state from another source.
      
      Conversion was done using Coccinelle:
      
      ----
      
      // $ make coccicheck \
      //          COCCI=ptepget.cocci \
      //          SPFLAGS="--include-headers" \
      //          MODE=patch
      
      virtual patch
      
      @ depends on patch @
      pte_t *v;
      @@
      
      - *v
      + ptep_get(v)
      
      ----
      
      Then reviewed and hand-edited to avoid multiple unnecessary calls to
      ptep_get(), instead opting to store the result of a single call in a
      variable, where it is correct to do so.  This aims to negate any cost of
      READ_ONCE() and will benefit arch-overrides that may be more complex.
      
      Included is a fix for an issue in an earlier version of this patch that
      was pointed out by kernel test robot.  The issue arose because config
      MMU=n elides definition of the ptep helper functions, including
      ptep_get().  HUGETLB_PAGE=n configs still define a simple
      huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
      So when both configs are disabled, this caused a build error because
      ptep_get() is not defined.  Fix by continuing to do a direct dereference
      when MMU=n.  This is safe because for this config the arch code cannot be
      trying to virtualize the ptes because none of the ptep helpers are
      defined.
      
      Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
      
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
      
      
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c33c7948
  16. Jun 13, 2023
Loading