Skip to content
Snippets Groups Projects
  1. Oct 22, 2022
  2. Oct 07, 2022
  3. Sep 29, 2022
  4. Sep 26, 2022
  5. Aug 19, 2022
    • Li kunyu's avatar
      KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device() · eceb6e1d
      Li kunyu authored
      
      The variable is initialized but it is only used after its assignment.
      
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarLi kunyu <kunyu@nfschina.com>
      Message-Id: <20220819021535.483702-1-kunyu@nfschina.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eceb6e1d
    • Li kunyu's avatar
      KVM: Drop unnecessary initialization of "npages" in hva_to_pfn_slow() · 28249139
      Li kunyu authored
      
      The variable is initialized but it is only used after its assignment.
      
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarLi kunyu <kunyu@nfschina.com>
      Message-Id: <20220819022804.483914-1-kunyu@nfschina.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      28249139
    • Chao Peng's avatar
      KVM: Rename mmu_notifier_* to mmu_invalidate_* · 20ec3ebd
      Chao Peng authored
      
      The motivation of this renaming is to make these variables and related
      helper functions less mmu_notifier bound and can also be used for non
      mmu_notifier based page invalidation. mmu_invalidate_* was chosen to
      better describe the purpose of 'invalidating' a page that those
      variables are used for.
      
        - mmu_notifier_seq/range_start/range_end are renamed to
          mmu_invalidate_seq/range_start/range_end.
      
        - mmu_notifier_retry{_hva} helper functions are renamed to
          mmu_invalidate_retry{_hva}.
      
        - mmu_notifier_count is renamed to mmu_invalidate_in_progress to
          avoid confusion with mn_active_invalidate_count.
      
        - While here, also update kvm_inc/dec_notifier_count() to
          kvm_mmu_invalidate_begin/end() to match the change for
          mmu_notifier_count.
      
      No functional change intended.
      
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      20ec3ebd
    • Sean Christopherson's avatar
      KVM: Move coalesced MMIO initialization (back) into kvm_create_vm() · c2b82397
      Sean Christopherson authored
      
      Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating
      and initializing coalesced MMIO objects is separate from registering any
      associated devices.  Moving coalesced MMIO cleans up the last oddity
      where KVM does VM creation/initialization after kvm_create_vm(), and more
      importantly after kvm_arch_post_init_vm() is called and the VM is added
      to the global vm_list, i.e. after the VM is fully created as far as KVM
      is concerned.
      
      Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but
      the original implementation was completely devoid of error handling.
      Commit 6ce5a090 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s
      error handling" fixed the various bugs, and in doing so rightly moved the
      call to after kvm_create_vm() because kvm_coalesced_mmio_init() also
      registered the coalesced MMIO device.  Commit 2b3c246a ("KVM: Make
      coalesced mmio use a device per zone") cleaned up that mess by having
      each zone register a separate device, i.e. moved device registration to
      its logical home in kvm_vm_ioctl_register_coalesced_mmio().  As a result,
      kvm_coalesced_mmio_init() is now a "pure" initialization helper and can
      be safely called from kvm_create_vm().
      
      Opportunstically drop the #ifdef, KVM provides stubs for
      kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (s390).
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220816053937.2477106-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c2b82397
    • Sean Christopherson's avatar
      KVM: Unconditionally get a ref to /dev/kvm module when creating a VM · 405294f2
      Sean Christopherson authored
      
      Unconditionally get a reference to the /dev/kvm module when creating a VM
      instead of using try_get_module(), which will fail if the module is in
      the process of being forcefully unloaded.  The error handling when
      try_get_module() fails doesn't properly unwind all that has been done,
      e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
      from the global list.  Not removing VMs from the global list tends to be
      fatal, e.g. leads to use-after-free explosions.
      
      The obvious alternative would be to add proper unwinding, but the
      justification for using try_get_module(), "rmmod --wait", is completely
      bogus as support for "rmmod --wait", i.e. delete_module() without
      O_NONBLOCK, was removed by commit 3f2b9c9c ("module: remove rmmod
      --wait option.") nearly a decade ago.
      
      It's still possible for try_get_module() to fail due to the module dying
      (more like being killed), as the module will be tagged MODULE_STATE_GOING
      by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
      with forced unloading is an exercise in futility and gives a falsea sense
      of security.  Using try_get_module() only prevents acquiring _new_
      references, it doesn't magically put the references held by other VMs,
      and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
      guaranteed to cause spectacular fireworks; the window where KVM will fail
      try_get_module() is tiny compared to the window where KVM is building and
      running the VM with an elevated module refcount.
      
      Addressing KVM's inability to play nice with "rmmod --force" is firmly
      out-of-scope.  Forcefully unloading any module taints kernel (for obvious
      reasons)  _and_ requires the kernel to be built with
      CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
      amusing disclaimer that it's "mainly for kernel developers and desperate
      users".  In other words, KVM is free to scoff at bug reports due to using
      "rmmod --force" while VMs may be running.
      
      Fixes: 5f6de5cb ("KVM: Prevent module exit until all VMs are freed")
      Cc: stable@vger.kernel.org
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220816053937.2477106-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      405294f2
    • Sean Christopherson's avatar
      KVM: Properly unwind VM creation if creating debugfs fails · 4ba4f419
      Sean Christopherson authored
      
      Properly unwind VM creation if kvm_create_vm_debugfs() fails.  A recent
      change to invoke kvm_create_vm_debug() in kvm_create_vm() was led astray
      by buggy try_get_module() handling adding by commit 5f6de5cb ("KVM:
      Prevent module exit until all VMs are freed").  The debugfs error path
      effectively inherits the bad error path of try_module_get(), e.g. KVM
      leaves the to-be-free VM on vm_list even though KVM appears to do the
      right thing by calling module_put() and falling through.
      
      Opportunistically hoist kvm_create_vm_debugfs() above the call to
      kvm_arch_post_init_vm() so that the "post-init" arch hook is actually
      invoked after the VM is initialized (ignoring kvm_coalesced_mmio_init()
      for the moment).  x86 is the only non-nop implementation of the post-init
      hook, and it doesn't allocate/initialize any objects that are reachable
      via debugfs code (spawns a kthread worker for the NX huge page mitigation).
      
      Leave the buggy try_get_module() alone for now, it will be fixed in a
      separate commit.
      
      Fixes: b74ed7a6 ("KVM: Actually create debugfs in kvm_create_vm()")
      Reported-by: default avatar <syzbot+744e173caec2e1627ee0@syzkaller.appspotmail.com>
      Cc: Oliver Upton <oliver.upton@linux.dev>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Message-Id: <20220816053937.2477106-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4ba4f419
  6. Aug 10, 2022
  7. Jul 29, 2022
  8. Jun 24, 2022
    • Vineeth Pillai's avatar
      KVM: debugfs: expose pid of vcpu threads · e36de87d
      Vineeth Pillai authored
      
      Add a new debugfs file to expose the pid of each vcpu threads. This
      is very helpful for userland tools to get the vcpu pids without
      worrying about thread naming conventions of the VMM.
      
      Signed-off-by: default avatarVineeth Pillai (Google) <vineeth@bitbyteword.org>
      Message-Id: <20220523190327.2658-1-vineeth@bitbyteword.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e36de87d
    • David Matlack's avatar
      KVM: Allow for different capacities in kvm_mmu_memory_cache structs · 837f66c7
      David Matlack authored
      
      Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
      declaration time rather than being fixed for all declarations. This will
      be used in a follow-up commit to declare an cache in x86 with a capacity
      of 512+ objects without having to increase the capacity of all caches in
      KVM.
      
      This change requires each cache now specify its capacity at runtime,
      since the cache struct itself no longer has a fixed capacity known at
      compile time. To protect against someone accidentally defining a
      kvm_mmu_memory_cache struct directly (without the extra storage), this
      commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
      
      In order to support different capacities, this commit changes the
      objects pointer array to be dynamically allocated the first time the
      cache is topped-up.
      
      While here, opportunistically clean up the stack-allocated
      kvm_mmu_memory_cache structs in riscv and arm64 to use designated
      initializers.
      
      No functional change intended.
      
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-22-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      837f66c7
  9. Jun 20, 2022
    • Sean Christopherson's avatar
      KVM: Do not zero initialize 'pfn' in hva_to_pfn() · 943dfea8
      Sean Christopherson authored
      
      Drop the unnecessary initialization of the local 'pfn' variable in
      hva_to_pfn().  First and foremost, '0' is not an invalid pfn, it's a
      perfectly valid pfn on most architectures.  I.e. if hva_to_pfn() were to
      return an "uninitializd" pfn, it would actually be interpeted as a legal
      pfn by most callers.
      
      Second, hva_to_pfn() can't return an uninitialized pfn as hva_to_pfn()
      explicitly sets pfn to an error value (or returns an error value directly)
      if a helper returns failure, and all helpers set the pfn on success.
      
      The zeroing of 'pfn' was introduced by commit 2fc84311 ("KVM:
      reorganize hva_to_pfn"), probably to avoid "uninitialized variable"
      warnings on statements that return pfn.  However, no compiler seems
      to produce them, making the initialization unnecessary.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      943dfea8
    • Sean Christopherson's avatar
      KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() · b14b2690
      Sean Christopherson authored
      
      Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()
      to better reflect what KVM is actually checking, and to eliminate extra
      pfn_to_page() lookups.  The kvm_release_pfn_*() an kvm_try_get_pfn()
      helpers in particular benefit from "refouncted" nomenclature, as it's not
      all that obvious why KVM needs to get/put refcounts for some PG_reserved
      pages (ZERO_PAGE and ZONE_DEVICE).
      
      Add a comment to call out that the list of exceptions to PG_reserved is
      all but guaranteed to be incomplete.  The list has mostly been compiled
      by people throwing noodles at KVM and finding out they stick a little too
      well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages
      didn't get freed.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b14b2690
    • Sean Christopherson's avatar
      KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page() · 284dc493
      Sean Christopherson authored
      
      Operate on a 'struct page' instead of a pfn when checking if a page is a
      ZONE_DEVICE page, and rename the helper accordingly.  Generally speaking,
      KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do
      anything special for ZONE_DEVICE memory.  Rather, KVM wants to treat
      ZONE_DEVICE memory like regular memory, and the need to identify
      ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In
      other words, KVM should only ever check for ZONE_DEVICE memory after KVM
      has already verified that there is a struct page associated with the pfn.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      284dc493
    • Sean Christopherson's avatar
      KVM: Remove kvm_vcpu_gfn_to_page() and kvm_vcpu_gpa_to_page() · b1624f99
      Sean Christopherson authored
      
      Drop helpers to convert a gfn/gpa to a 'struct page' in the context of a
      vCPU.  KVM doesn't require that guests be backed by 'struct page' memory,
      thus any use of helpers that assume 'struct page' is bound to be flawed,
      as was the case for the recently removed last user in x86's nested VMX.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b1624f99
    • Sean Christopherson's avatar
      KVM: Don't WARN if kvm_pfn_to_page() encounters a "reserved" pfn · 6573a691
      Sean Christopherson authored
      
      Drop a WARN_ON() if kvm_pfn_to_page() encounters a "reserved" pfn, which
      in this context means a struct page that has PG_reserved but is not a/the
      ZERO_PAGE and is not a ZONE_DEVICE page.  The usage, via gfn_to_page(),
      in x86 is safe as gfn_to_page() is used only to retrieve a page from
      KVM-controlled memslot, but the usage in PPC and s390 operates on
      arbitrary gfns and thus memslots that can be backed by incompatible
      memory.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6573a691
    • Sean Christopherson's avatar
      KVM: Avoid pfn_to_page() and vice versa when releasing pages · 8e1c6914
      Sean Christopherson authored
      Invert the order of KVM's page/pfn release helpers so that the "inner"
      helper operates on a page instead of a pfn.  As pointed out by Linus[*],
      converting between struct page and a pfn isn't necessarily cheap, and
      that's not even counting the overhead of is_error_noslot_pfn() and
      kvm_is_reserved_pfn().  Even if the checks were dirt cheap, there's no
      reason to convert from a page to a pfn and back to a page, just to mark
      the page dirty/accessed or to put a reference to the page.
      
      Opportunistically drop a stale declaration of kvm_set_page_accessed()
      from kvm_host.h (there was no implementation).
      
      No functional change intended.
      
      [*] https://lore.kernel.org/all/CAHk-=wifQimj2d6npq-wCi5onYPjzQg4vyO4tFcPJJZr268cRw@mail.gmail.com
      
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8e1c6914
    • Sean Christopherson's avatar
      KVM: Don't set Accessed/Dirty bits for ZERO_PAGE · a1040b0d
      Sean Christopherson authored
      
      Don't set Accessed/Dirty bits for a struct page with PG_reserved set,
      i.e. don't set A/D bits for the ZERO_PAGE.  The ZERO_PAGE (or pages
      depending on the architecture) should obviously never be written, and
      similarly there's no point in marking it accessed as the page will never
      be swapped out or reclaimed.  The comment in page-flags.h is quite clear
      that PG_reserved pages should be managed only by their owner, and
      strictly following that mandate also simplifies KVM's logic.
      
      Fixes: 7df003c8 ("KVM: fix overflow of zero page refcount with ksm running")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a1040b0d
    • Sean Christopherson's avatar
      KVM: Drop bogus "pfn != 0" guard from kvm_release_pfn() · 28b85ae0
      Sean Christopherson authored
      
      Remove a check from kvm_release_pfn() to bail if the provided @pfn is
      zero.  Zero is a perfectly valid pfn on most architectures, and should
      not be used to indicate an error or an invalid pfn.  The bogus check was
      added by commit 91724814 ("x86/kvm: Cache gfn to pfn translation"),
      which also did the bad thing of zeroing the pfn and gfn to mark a cache
      invalid.  Thankfully, that bad behavior was axed by commit 357a18ad
      ("KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache").
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      28b85ae0
  10. Jun 15, 2022
  11. Jun 09, 2022
    • Maxim Levitsky's avatar
      KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking · 18869f26
      Maxim Levitsky authored
      
      On SVM, if preemption happens right after the call to finish_rcuwait
      but before call to kvm_arch_vcpu_unblocking on SVM/AVIC, it itself
      will re-enable AVIC, and then we will try to re-enable it again
      in kvm_arch_vcpu_unblocking which will lead to a warning
      in __avic_vcpu_load.
      
      The same problem can happen if the vCPU is preempted right after the call
      to kvm_arch_vcpu_blocking but before the call to prepare_to_rcuwait
      and in this case, we will end up with AVIC enabled during sleep -
      Ooops.
      
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220606180829.102503-7-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      18869f26
  12. Jun 08, 2022
    • Zeng Guang's avatar
      KVM: Move kvm_arch_vcpu_precreate() under kvm->lock · 1d5e740d
      Zeng Guang authored
      
      kvm_arch_vcpu_precreate() targets to handle arch specific VM resource
      to be prepared prior to the actual creation of vCPU. For example, x86
      platform may need do per-VM allocation based on max_vcpu_ids at the
      first vCPU creation. It probably leads to concurrency control on this
      allocation as multiple vCPU creation could happen simultaneously. From
      the architectual point of view, it's necessary to execute
      kvm_arch_vcpu_precreate() under protect of kvm->lock.
      
      Currently only arm64, x86 and s390 have non-nop implementations at the
      stage of vCPU pre-creation. Remove the lock acquiring in s390's design
      and make sure all architecture can run kvm_arch_vcpu_precreate() safely
      under kvm->lock without recrusive lock issue.
      
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419154409.11842-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1d5e740d
  13. Jun 07, 2022
    • Alexey Kardashevskiy's avatar
      KVM: Don't null dereference ops->destroy · e8bc2427
      Alexey Kardashevskiy authored
      
      A KVM device cleanup happens in either of two callbacks:
      1) destroy() which is called when the VM is being destroyed;
      2) release() which is called when a device fd is closed.
      
      Most KVM devices use 1) but Book3s's interrupt controller KVM devices
      (XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during
      the machine execution. The error handling in kvm_ioctl_create_device()
      assumes destroy() is always defined which leads to NULL dereference as
      discovered by Syzkaller.
      
      This adds a checks for destroy!=NULL and adds a missing release().
      
      This is not changing kvm_destroy_devices() as devices with defined
      release() should have been removed from the KVM devices list by then.
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8bc2427
  14. May 25, 2022
    • Sean Christopherson's avatar
      KVM: Do not pin pages tracked by gfn=>pfn caches · 85165781
      Sean Christopherson authored
      
      Put the reference to any struct page mapped/tracked by a gfn=>pfn cache
      upon inserting the pfn into its associated cache, as opposed to putting
      the reference only when the cache is done using the pfn.  In other words,
      don't pin pages while they're in the cache.  One of the major roles of
      the gfn=>pfn cache is to play nicely with invalidation events, i.e. it
      exists in large part so that KVM doesn't rely on pinning pages.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429210025.3293691-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85165781
    • Sean Christopherson's avatar
      KVM: Fix multiple races in gfn=>pfn cache refresh · 58cd407c
      Sean Christopherson authored
      
      Rework the gfn=>pfn cache (gpc) refresh logic to address multiple races
      between the cache itself, and between the cache and mmu_notifier events.
      
      The existing refresh code attempts to guard against races with the
      mmu_notifier by speculatively marking the cache valid, and then marking
      it invalid if a mmu_notifier invalidation occurs.  That handles the case
      where an invalidation occurs between dropping and re-acquiring gpc->lock,
      but it doesn't handle the scenario where the cache is refreshed after the
      cache was invalidated by the notifier, but before the notifier elevates
      mmu_notifier_count.  The gpc refresh can't use the "retry" helper as its
      invalidation occurs _before_ mmu_notifier_count is elevated and before
      mmu_notifier_range_start is set/updated.
      
        CPU0                                    CPU1
        ----                                    ----
      
        gfn_to_pfn_cache_invalidate_start()
        |
        -> gpc->valid = false;
                                                kvm_gfn_to_pfn_cache_refresh()
                                                |
                                                |-> gpc->valid = true;
      
                                                hva_to_pfn_retry()
                                                |
                                                -> acquire kvm->mmu_lock
                                                   kvm->mmu_notifier_count == 0
                                                   mmu_seq == kvm->mmu_notifier_seq
                                                   drop kvm->mmu_lock
                                                   return pfn 'X'
        acquire kvm->mmu_lock
        kvm_inc_notifier_count()
        drop kvm->mmu_lock()
        kernel frees pfn 'X'
                                                kvm_gfn_to_pfn_cache_check()
                                                |
                                                |-> gpc->valid == true
      
                                                caller accesses freed pfn 'X'
      
      Key off of mn_active_invalidate_count to detect that a pfncache refresh
      needs to wait for an in-progress mmu_notifier invalidation.  While
      mn_active_invalidate_count is not guaranteed to be stable, it is
      guaranteed to be elevated prior to an invalidation acquiring gpc->lock,
      so either the refresh will see an active invalidation and wait, or the
      invalidation will run after the refresh completes.
      
      Speculatively marking the cache valid is itself flawed, as a concurrent
      kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva
      values.  The KVM Xen use case explicitly allows/wants multiple users;
      even though the caches are allocated per vCPU, __kvm_xen_has_interrupt()
      can read a different vCPU (or vCPUs).  Address this race by invalidating
      the cache prior to dropping gpc->lock (this is made possible by fixing
      the above mmu_notifier race).
      
      Complicating all of this is the fact that both the hva=>pfn resolution
      and mapping of the kernel address can sleep, i.e. must be done outside
      of gpc->lock.
      
      Fix the above races in one fell swoop, trying to fix each individual race
      is largely pointless and essentially impossible to test, e.g. closing one
      hole just shifts the focus to the other hole.
      
      Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
      Cc: stable@vger.kernel.org
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Mingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429210025.3293691-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      58cd407c
    • Sean Christopherson's avatar
      KVM: Fully serialize gfn=>pfn cache refresh via mutex · 93984f19
      Sean Christopherson authored
      
      Protect gfn=>pfn cache refresh with a mutex to fully serialize refreshes.
      The refresh logic doesn't protect against
      
      - concurrent unmaps, or refreshes with different GPAs (which may or may not
        happen in practice, for example if a cache is only used under vcpu->mutex;
        but it's allowed in the code)
      
      - a false negative on the memslot generation.  If the first refresh sees
        a stale memslot generation, it will refresh the hva and generation before
        moving on to the hva=>pfn translation.  If it then drops gpc->lock, a
        different user of the cache can come along, acquire gpc->lock, see that
        the memslot generation is fresh, and skip the hva=>pfn update due to the
        userspace address also matching (because it too was updated).
      
      The refresh path can already sleep during hva=>pfn resolution, so wrap
      the refresh with a mutex to ensure that any given refresh runs to
      completion before other callers can start their refresh.
      
      Cc: stable@vger.kernel.org
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429210025.3293691-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      93984f19
    • Sean Christopherson's avatar
      KVM: Do not incorporate page offset into gfn=>pfn cache user address · 3ba2c95e
      Sean Christopherson authored
      
      Don't adjust the userspace address in the gfn=>pfn cache by the page
      offset from the gpa.  KVM should never use the user address directly, and
      all KVM operations that translate a user address to something else
      require the user address to be page aligned.  Ignoring the offset will
      allow the cache to reuse a gfn=>hva translation in the unlikely event
      that the page offset of the gpa changes, but the gfn does not.  And more
      importantly, not having to (un)adjust the user address will simplify a
      future bug fix.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429210025.3293691-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3ba2c95e
    • Sean Christopherson's avatar
      KVM: Put the extra pfn reference when reusing a pfn in the gpc cache · 3dddf65b
      Sean Christopherson authored
      
      Put the struct page reference to pfn acquired by hva_to_pfn() when the
      old and new pfns for a gfn=>pfn cache match.  The cache already has a
      reference via the old/current pfn, and will only put one reference when
      the cache is done with the pfn.
      
      Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429210025.3293691-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3dddf65b
    • Sean Christopherson's avatar
      KVM: Drop unused @gpa param from gfn=>pfn cache's __release_gpc() helper · 345b0fd6
      Sean Christopherson authored
      
      Drop the @pga param from __release_gpc() and rename the helper to make it
      more obvious that the cache itself is not being released.  The helper
      will be reused by a future commit to release a pfn+khva combination that
      is _never_ associated with the cache, at which point the current name
      would go from slightly misleading to blatantly wrong.
      
      No functional change intended.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429210025.3293691-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      345b0fd6
  15. May 20, 2022
    • Sean Christopherson's avatar
      KVM: Free new dirty bitmap if creating a new memslot fails · c87661f8
      Sean Christopherson authored
      
      Fix a goof in kvm_prepare_memory_region() where KVM fails to free the
      new memslot's dirty bitmap during a CREATE action if
      kvm_arch_prepare_memory_region() fails.  The logic is supposed to detect
      if the bitmap was allocated and thus needs to be freed, versus if the
      bitmap was inherited from the old memslot and thus needs to be kept.  If
      there is no old memslot, then obviously the bitmap can't have been
      inherited
      
      The bug was exposed by commit 86931ff7 ("KVM: x86/mmu: Do not create
      SPTEs for GFNs that exceed host.MAXPHYADDR"), which made it trivally easy
      for syzkaller to trigger failure during kvm_arch_prepare_memory_region(),
      but the bug can be hit other ways too, e.g. due to -ENOMEM when
      allocating x86's memslot metadata.
      
      The backtrace from kmemleak:
      
        __vmalloc_node_range+0xb40/0xbd0 mm/vmalloc.c:3195
        __vmalloc_node mm/vmalloc.c:3232 [inline]
        __vmalloc+0x49/0x50 mm/vmalloc.c:3246
        __vmalloc_array mm/util.c:671 [inline]
        __vcalloc+0x49/0x70 mm/util.c:694
        kvm_alloc_dirty_bitmap virt/kvm/kvm_main.c:1319
        kvm_prepare_memory_region virt/kvm/kvm_main.c:1551
        kvm_set_memslot+0x1bd/0x690 virt/kvm/kvm_main.c:1782
        __kvm_set_memory_region+0x689/0x750 virt/kvm/kvm_main.c:1949
        kvm_set_memory_region virt/kvm/kvm_main.c:1962
        kvm_vm_ioctl_set_memory_region virt/kvm/kvm_main.c:1974
        kvm_vm_ioctl+0x377/0x13a0 virt/kvm/kvm_main.c:4528
        vfs_ioctl fs/ioctl.c:51
        __do_sys_ioctl fs/ioctl.c:870
        __se_sys_ioctl fs/ioctl.c:856
        __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:856
        do_syscall_x64 arch/x86/entry/common.c:50
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      And the relevant sequence of KVM events:
      
        ioctl(3, KVM_CREATE_VM, 0)              = 4
        ioctl(4, KVM_SET_USER_MEMORY_REGION, {slot=0,
                                              flags=KVM_MEM_LOG_DIRTY_PAGES,
                                              guest_phys_addr=0x10000000000000,
                                              memory_size=4096,
                                              userspace_addr=0x20fe8000}
             ) = -1 EINVAL (Invalid argument)
      
      Fixes: 244893fa ("KVM: Dynamically allocate "new" memslots from the get-go")
      Cc: stable@vger.kernel.org
      Reported-by: default avatar <syzbot+8606b8a9cc97a63f1c87@syzkaller.appspotmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220518003842.1341782-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c87661f8
Loading