Skip to content
Snippets Groups Projects
  1. Dec 27, 2022
  2. Dec 02, 2022
  3. Nov 30, 2022
  4. Nov 23, 2022
  5. Nov 18, 2022
  6. Nov 17, 2022
    • David Matlack's avatar
      KVM: Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL · 9eb8ca04
      David Matlack authored
      
      Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL on every halt,
      rather than just sampling the module parameter when the VM is first
      created. This restore the original behavior of kvm.halt_poll_ns for VMs
      that have not opted into KVM_CAP_HALT_POLL.
      
      Notably, this change restores the ability for admins to disable or
      change the maximum halt-polling time system wide for VMs not using
      KVM_CAP_HALT_POLL.
      
      Reported-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Fixes: acd05785 ("kvm: add capability for halt polling")
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221117001657.1067231-4-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9eb8ca04
    • David Matlack's avatar
      KVM: Avoid re-reading kvm->max_halt_poll_ns during halt-polling · 175d5dc7
      David Matlack authored
      
      Avoid re-reading kvm->max_halt_poll_ns multiple times during
      halt-polling except when it is explicitly useful, e.g. to check if the
      max time changed across a halt. kvm->max_halt_poll_ns can be changed at
      any time by userspace via KVM_CAP_HALT_POLL.
      
      This bug is unlikely to cause any serious side-effects. In the worst
      case one halt polls for shorter or longer than it should, and then is
      fixed up on the next halt. Furthmore, this is still possible since
      kvm->max_halt_poll_ns are not synchronized with halts.
      
      Fixes: acd05785 ("kvm: add capability for halt polling")
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221117001657.1067231-3-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      175d5dc7
    • David Matlack's avatar
      KVM: Cap vcpu->halt_poll_ns before halting rather than after · 97b6847a
      David Matlack authored
      
      Cap vcpu->halt_poll_ns based on the max halt polling time just before
      halting, rather than after the last halt. This arguably provides better
      accuracy if an admin disables halt polling in between halts, although
      the improvement is nominal.
      
      A side-effect of this change is that grow_halt_poll_ns() no longer needs
      to access vcpu->kvm->max_halt_poll_ns, which will be useful in a future
      commit where the max halt polling time can come from the module parameter
      halt_poll_ns instead.
      
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221117001657.1067231-2-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      97b6847a
  7. Nov 12, 2022
    • Gavin Shan's avatar
      KVM: Push dirty information unconditionally to backup bitmap · c57351a7
      Gavin Shan authored
      
      In mark_page_dirty_in_slot(), we bail out when no running vcpu exists
      and a running vcpu context is strictly required by architecture. It may
      cause backwards compatible issue. Currently, saving vgic/its tables is
      the only known case where no running vcpu context is expected. We may
      have other unknown cases where no running vcpu context exists and it's
      reported by the warning message and we bail out without pushing the
      dirty information to the backup bitmap. For this, the application is
      going to enable the backup bitmap for the unknown cases. However, the
      dirty information can't be pushed to the backup bitmap even though the
      backup bitmap is enabled for those unknown cases in the application,
      until the unknown cases are added to the allowed list of non-running
      vcpu context with extra code changes to the host kernel.
      
      In order to make the new application, where the backup bitmap has been
      enabled, to work with the unchanged host, we continue to push the dirty
      information to the backup bitmap instead of bailing out early. With the
      added check on 'memslot->dirty_bitmap' to mark_page_dirty_in_slot(), the
      kernel crash is avoided silently by the combined conditions: no running
      vcpu context, kvm_arch_allow_write_without_running_vcpu() returns 'true',
      and the backup bitmap (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) isn't enabled
      yet.
      
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20221112094322.21911-1-gshan@redhat.com
      c57351a7
  8. Nov 10, 2022
  9. Nov 09, 2022
    • Paolo Bonzini's avatar
      KVM: replace direct irq.h inclusion · d663b8a2
      Paolo Bonzini authored
      
      virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source
      directory (i.e. not from arch/*/include) for the sole purpose of retrieving
      irqchip_in_kernel.
      
      Making the function inline in a header that is already included,
      such as asm/kvm_host.h, is not possible because it needs to look at
      struct kvm which is defined after asm/kvm_host.h is included.  So add a
      kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is
      only performance critical on arm64 and x86, and the non-inline function
      is enough on all other architectures.
      
      irq.h can then be deleted from all architectures except x86.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d663b8a2
    • Peter Xu's avatar
      kvm: Add interruptible flag to __gfn_to_pfn_memslot() · c8b88b33
      Peter Xu authored
      
      Add a new "interruptible" flag showing that the caller is willing to be
      interrupted by signals during the __gfn_to_pfn_memslot() request.  Wire it
      up with a FOLL_INTERRUPTIBLE flag that we've just introduced.
      
      This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the
      SIGIPI) even during e.g. handling an userfaultfd page fault.
      
      No functional change intended.
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221011195809.557016-4-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c8b88b33
    • Peter Xu's avatar
      kvm: Add KVM_PFN_ERR_SIGPENDING · fe5ed56c
      Peter Xu authored
      
      Add a new pfn error to show that we've got a pending signal to handle
      during hva_to_pfn_slow() procedure (of -EINTR retval).
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221011195809.557016-3-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe5ed56c
  10. Oct 31, 2022
  11. Oct 27, 2022
    • Sean Christopherson's avatar
      KVM: Reject attempts to consume or refresh inactive gfn_to_pfn_cache · ecbcf030
      Sean Christopherson authored
      
      Reject kvm_gpc_check() and kvm_gpc_refresh() if the cache is inactive.
      Not checking the active flag during refresh is particularly egregious, as
      KVM can end up with a valid, inactive cache, which can lead to a variety
      of use-after-free bugs, e.g. consuming a NULL kernel pointer or missing
      an mmu_notifier invalidation due to the cache not being on the list of
      gfns to invalidate.
      
      Note, "active" needs to be set if and only if the cache is on the list
      of caches, i.e. is reachable via mmu_notifier events.  If a relevant
      mmu_notifier event occurs while the cache is "active" but not on the
      list, KVM will not acquire the cache's lock and so will not serailize
      the mmu_notifier event with active users and/or kvm_gpc_refresh().
      
      A race between KVM_XEN_ATTR_TYPE_SHARED_INFO and KVM_XEN_HVM_EVTCHN_SEND
      can be exploited to trigger the bug.
      
      1. Deactivate shinfo cache:
      
      kvm_xen_hvm_set_attr
      case KVM_XEN_ATTR_TYPE_SHARED_INFO
       kvm_gpc_deactivate
        kvm_gpc_unmap
         gpc->valid = false
         gpc->khva = NULL
        gpc->active = false
      
      Result: active = false, valid = false
      
      2. Cause cache refresh:
      
      kvm_arch_vm_ioctl
      case KVM_XEN_HVM_EVTCHN_SEND
       kvm_xen_hvm_evtchn_send
        kvm_xen_set_evtchn
         kvm_xen_set_evtchn_fast
          kvm_gpc_check
          return -EWOULDBLOCK because !gpc->valid
         kvm_xen_set_evtchn_fast
          return -EWOULDBLOCK
         kvm_gpc_refresh
          hva_to_pfn_retry
           gpc->valid = true
           gpc->khva = not NULL
      
      Result: active = false, valid = true
      
      3. Race ioctl KVM_XEN_HVM_EVTCHN_SEND against ioctl
      KVM_XEN_ATTR_TYPE_SHARED_INFO:
      
      kvm_arch_vm_ioctl
      case KVM_XEN_HVM_EVTCHN_SEND
       kvm_xen_hvm_evtchn_send
        kvm_xen_set_evtchn
         kvm_xen_set_evtchn_fast
          read_lock gpc->lock
                                                kvm_xen_hvm_set_attr case
                                                KVM_XEN_ATTR_TYPE_SHARED_INFO
                                                 mutex_lock kvm->lock
                                                 kvm_xen_shared_info_init
                                                  kvm_gpc_activate
                                                   gpc->khva = NULL
          kvm_gpc_check
           [ Check passes because gpc->valid is
             still true, even though gpc->khva
             is already NULL. ]
          shinfo = gpc->khva
          pending_bits = shinfo->evtchn_pending
          CRASH: test_and_set_bit(..., pending_bits)
      
      Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
      Cc: stable@vger.kernel.org
      Reported-by: default avatar: Michal Luczaj <mhal@rbox.co>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221013211234.1318131-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ecbcf030
    • Michal Luczaj's avatar
      KVM: Initialize gfn_to_pfn_cache locks in dedicated helper · 52491a38
      Michal Luczaj authored
      
      Move the gfn_to_pfn_cache lock initialization to another helper and
      call the new helper during VM/vCPU creation.  There are race
      conditions possible due to kvm_gfn_to_pfn_cache_init()'s
      ability to re-initialize the cache's locks.
      
      For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and
      kvm_gfn_to_pfn_cache_init() leads to a corrupted shinfo gpc lock.
      
                      (thread 1)                |           (thread 2)
                                                |
       kvm_xen_set_evtchn_fast                  |
        read_lock_irqsave(&gpc->lock, ...)      |
                                                | kvm_gfn_to_pfn_cache_init
                                                |  rwlock_init(&gpc->lock)
        read_unlock_irqrestore(&gpc->lock, ...) |
      
      Rename "cache_init" and "cache_destroy" to activate+deactivate to
      avoid implying that the cache really is destroyed/freed.
      
      Note, there more races in the newly named kvm_gpc_activate() that will
      be addressed separately.
      
      Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      [sean: call out that this is a bug fix]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221013211234.1318131-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52491a38
    • Hou Wenlong's avatar
      KVM: debugfs: Return retval of simple_attr_open() if it fails · 180418e2
      Hou Wenlong authored
      
      Although simple_attr_open() fails only with -ENOMEM with current code
      base, it would be nicer to return retval of simple_attr_open() directly
      in kvm_debugfs_open().
      
      No functional change intended.
      
      Signed-off-by: default avatarHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <69d64d93accd1f33691b8a383ae555baee80f943.1665975828.git.houwenlong.hwl@antgroup.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      180418e2
  12. Oct 22, 2022
  13. Oct 07, 2022
  14. Sep 29, 2022
  15. Sep 26, 2022
  16. Aug 19, 2022
    • Li kunyu's avatar
      KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device() · eceb6e1d
      Li kunyu authored
      
      The variable is initialized but it is only used after its assignment.
      
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarLi kunyu <kunyu@nfschina.com>
      Message-Id: <20220819021535.483702-1-kunyu@nfschina.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eceb6e1d
    • Li kunyu's avatar
      KVM: Drop unnecessary initialization of "npages" in hva_to_pfn_slow() · 28249139
      Li kunyu authored
      
      The variable is initialized but it is only used after its assignment.
      
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarLi kunyu <kunyu@nfschina.com>
      Message-Id: <20220819022804.483914-1-kunyu@nfschina.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      28249139
    • Chao Peng's avatar
      KVM: Rename mmu_notifier_* to mmu_invalidate_* · 20ec3ebd
      Chao Peng authored
      
      The motivation of this renaming is to make these variables and related
      helper functions less mmu_notifier bound and can also be used for non
      mmu_notifier based page invalidation. mmu_invalidate_* was chosen to
      better describe the purpose of 'invalidating' a page that those
      variables are used for.
      
        - mmu_notifier_seq/range_start/range_end are renamed to
          mmu_invalidate_seq/range_start/range_end.
      
        - mmu_notifier_retry{_hva} helper functions are renamed to
          mmu_invalidate_retry{_hva}.
      
        - mmu_notifier_count is renamed to mmu_invalidate_in_progress to
          avoid confusion with mn_active_invalidate_count.
      
        - While here, also update kvm_inc/dec_notifier_count() to
          kvm_mmu_invalidate_begin/end() to match the change for
          mmu_notifier_count.
      
      No functional change intended.
      
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      20ec3ebd
    • Sean Christopherson's avatar
      KVM: Move coalesced MMIO initialization (back) into kvm_create_vm() · c2b82397
      Sean Christopherson authored
      
      Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating
      and initializing coalesced MMIO objects is separate from registering any
      associated devices.  Moving coalesced MMIO cleans up the last oddity
      where KVM does VM creation/initialization after kvm_create_vm(), and more
      importantly after kvm_arch_post_init_vm() is called and the VM is added
      to the global vm_list, i.e. after the VM is fully created as far as KVM
      is concerned.
      
      Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but
      the original implementation was completely devoid of error handling.
      Commit 6ce5a090 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s
      error handling" fixed the various bugs, and in doing so rightly moved the
      call to after kvm_create_vm() because kvm_coalesced_mmio_init() also
      registered the coalesced MMIO device.  Commit 2b3c246a ("KVM: Make
      coalesced mmio use a device per zone") cleaned up that mess by having
      each zone register a separate device, i.e. moved device registration to
      its logical home in kvm_vm_ioctl_register_coalesced_mmio().  As a result,
      kvm_coalesced_mmio_init() is now a "pure" initialization helper and can
      be safely called from kvm_create_vm().
      
      Opportunstically drop the #ifdef, KVM provides stubs for
      kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (s390).
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220816053937.2477106-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c2b82397
    • Sean Christopherson's avatar
      KVM: Unconditionally get a ref to /dev/kvm module when creating a VM · 405294f2
      Sean Christopherson authored
      
      Unconditionally get a reference to the /dev/kvm module when creating a VM
      instead of using try_get_module(), which will fail if the module is in
      the process of being forcefully unloaded.  The error handling when
      try_get_module() fails doesn't properly unwind all that has been done,
      e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
      from the global list.  Not removing VMs from the global list tends to be
      fatal, e.g. leads to use-after-free explosions.
      
      The obvious alternative would be to add proper unwinding, but the
      justification for using try_get_module(), "rmmod --wait", is completely
      bogus as support for "rmmod --wait", i.e. delete_module() without
      O_NONBLOCK, was removed by commit 3f2b9c9c ("module: remove rmmod
      --wait option.") nearly a decade ago.
      
      It's still possible for try_get_module() to fail due to the module dying
      (more like being killed), as the module will be tagged MODULE_STATE_GOING
      by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
      with forced unloading is an exercise in futility and gives a falsea sense
      of security.  Using try_get_module() only prevents acquiring _new_
      references, it doesn't magically put the references held by other VMs,
      and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
      guaranteed to cause spectacular fireworks; the window where KVM will fail
      try_get_module() is tiny compared to the window where KVM is building and
      running the VM with an elevated module refcount.
      
      Addressing KVM's inability to play nice with "rmmod --force" is firmly
      out-of-scope.  Forcefully unloading any module taints kernel (for obvious
      reasons)  _and_ requires the kernel to be built with
      CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
      amusing disclaimer that it's "mainly for kernel developers and desperate
      users".  In other words, KVM is free to scoff at bug reports due to using
      "rmmod --force" while VMs may be running.
      
      Fixes: 5f6de5cb ("KVM: Prevent module exit until all VMs are freed")
      Cc: stable@vger.kernel.org
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220816053937.2477106-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      405294f2
    • Sean Christopherson's avatar
      KVM: Properly unwind VM creation if creating debugfs fails · 4ba4f419
      Sean Christopherson authored
      
      Properly unwind VM creation if kvm_create_vm_debugfs() fails.  A recent
      change to invoke kvm_create_vm_debug() in kvm_create_vm() was led astray
      by buggy try_get_module() handling adding by commit 5f6de5cb ("KVM:
      Prevent module exit until all VMs are freed").  The debugfs error path
      effectively inherits the bad error path of try_module_get(), e.g. KVM
      leaves the to-be-free VM on vm_list even though KVM appears to do the
      right thing by calling module_put() and falling through.
      
      Opportunistically hoist kvm_create_vm_debugfs() above the call to
      kvm_arch_post_init_vm() so that the "post-init" arch hook is actually
      invoked after the VM is initialized (ignoring kvm_coalesced_mmio_init()
      for the moment).  x86 is the only non-nop implementation of the post-init
      hook, and it doesn't allocate/initialize any objects that are reachable
      via debugfs code (spawns a kthread worker for the NX huge page mitigation).
      
      Leave the buggy try_get_module() alone for now, it will be fixed in a
      separate commit.
      
      Fixes: b74ed7a6 ("KVM: Actually create debugfs in kvm_create_vm()")
      Reported-by: default avatar <syzbot+744e173caec2e1627ee0@syzkaller.appspotmail.com>
      Cc: Oliver Upton <oliver.upton@linux.dev>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Message-Id: <20220816053937.2477106-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4ba4f419
Loading