Skip to content
Snippets Groups Projects
  1. Jun 09, 2022
    • Maxim Levitsky's avatar
      KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking · 18869f26
      Maxim Levitsky authored
      
      On SVM, if preemption happens right after the call to finish_rcuwait
      but before call to kvm_arch_vcpu_unblocking on SVM/AVIC, it itself
      will re-enable AVIC, and then we will try to re-enable it again
      in kvm_arch_vcpu_unblocking which will lead to a warning
      in __avic_vcpu_load.
      
      The same problem can happen if the vCPU is preempted right after the call
      to kvm_arch_vcpu_blocking but before the call to prepare_to_rcuwait
      and in this case, we will end up with AVIC enabled during sleep -
      Ooops.
      
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220606180829.102503-7-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      18869f26
  2. Jun 07, 2022
    • Alexey Kardashevskiy's avatar
      KVM: Don't null dereference ops->destroy · e8bc2427
      Alexey Kardashevskiy authored
      
      A KVM device cleanup happens in either of two callbacks:
      1) destroy() which is called when the VM is being destroyed;
      2) release() which is called when a device fd is closed.
      
      Most KVM devices use 1) but Book3s's interrupt controller KVM devices
      (XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during
      the machine execution. The error handling in kvm_ioctl_create_device()
      assumes destroy() is always defined which leads to NULL dereference as
      discovered by Syzkaller.
      
      This adds a checks for destroy!=NULL and adds a missing release().
      
      This is not changing kvm_destroy_devices() as devices with defined
      release() should have been removed from the KVM devices list by then.
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8bc2427
  3. May 20, 2022
    • Sean Christopherson's avatar
      KVM: Free new dirty bitmap if creating a new memslot fails · c87661f8
      Sean Christopherson authored
      
      Fix a goof in kvm_prepare_memory_region() where KVM fails to free the
      new memslot's dirty bitmap during a CREATE action if
      kvm_arch_prepare_memory_region() fails.  The logic is supposed to detect
      if the bitmap was allocated and thus needs to be freed, versus if the
      bitmap was inherited from the old memslot and thus needs to be kept.  If
      there is no old memslot, then obviously the bitmap can't have been
      inherited
      
      The bug was exposed by commit 86931ff7 ("KVM: x86/mmu: Do not create
      SPTEs for GFNs that exceed host.MAXPHYADDR"), which made it trivally easy
      for syzkaller to trigger failure during kvm_arch_prepare_memory_region(),
      but the bug can be hit other ways too, e.g. due to -ENOMEM when
      allocating x86's memslot metadata.
      
      The backtrace from kmemleak:
      
        __vmalloc_node_range+0xb40/0xbd0 mm/vmalloc.c:3195
        __vmalloc_node mm/vmalloc.c:3232 [inline]
        __vmalloc+0x49/0x50 mm/vmalloc.c:3246
        __vmalloc_array mm/util.c:671 [inline]
        __vcalloc+0x49/0x70 mm/util.c:694
        kvm_alloc_dirty_bitmap virt/kvm/kvm_main.c:1319
        kvm_prepare_memory_region virt/kvm/kvm_main.c:1551
        kvm_set_memslot+0x1bd/0x690 virt/kvm/kvm_main.c:1782
        __kvm_set_memory_region+0x689/0x750 virt/kvm/kvm_main.c:1949
        kvm_set_memory_region virt/kvm/kvm_main.c:1962
        kvm_vm_ioctl_set_memory_region virt/kvm/kvm_main.c:1974
        kvm_vm_ioctl+0x377/0x13a0 virt/kvm/kvm_main.c:4528
        vfs_ioctl fs/ioctl.c:51
        __do_sys_ioctl fs/ioctl.c:870
        __se_sys_ioctl fs/ioctl.c:856
        __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:856
        do_syscall_x64 arch/x86/entry/common.c:50
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      And the relevant sequence of KVM events:
      
        ioctl(3, KVM_CREATE_VM, 0)              = 4
        ioctl(4, KVM_SET_USER_MEMORY_REGION, {slot=0,
                                              flags=KVM_MEM_LOG_DIRTY_PAGES,
                                              guest_phys_addr=0x10000000000000,
                                              memory_size=4096,
                                              userspace_addr=0x20fe8000}
             ) = -1 EINVAL (Invalid argument)
      
      Fixes: 244893fa ("KVM: Dynamically allocate "new" memslots from the get-go")
      Cc: stable@vger.kernel.org
      Reported-by: default avatar <syzbot+8606b8a9cc97a63f1c87@syzkaller.appspotmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220518003842.1341782-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c87661f8
    • Wanpeng Li's avatar
      KVM: eventfd: Fix false positive RCU usage warning · e332b55f
      Wanpeng Li authored
      
      The splat below can be seen when running kvm-unit-test:
      
           =============================
           WARNING: suspicious RCU usage
           5.18.0-rc7 #5 Tainted: G          IOE
           -----------------------------
           /home/kernel/linux/arch/x86/kvm/../../../virt/kvm/eventfd.c:80 RCU-list traversed in non-reader section!!
      
           other info that might help us debug this:
      
           rcu_scheduler_active = 2, debug_locks = 1
           4 locks held by qemu-system-x86/35124:
            #0: ffff9725391d80b8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x77/0x710 [kvm]
            #1: ffffbd25cfb2a0b8 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0xdeb/0x1900 [kvm]
            #2: ffffbd25cfb2b920 (&kvm->irq_srcu){....}-{0:0}, at: kvm_hv_notify_acked_sint+0x79/0x1e0 [kvm]
            #3: ffffbd25cfb2b920 (&kvm->irq_srcu){....}-{0:0}, at: irqfd_resampler_ack+0x5/0x110 [kvm]
      
           stack backtrace:
           CPU: 2 PID: 35124 Comm: qemu-system-x86 Tainted: G          IOE     5.18.0-rc7 #5
           Call Trace:
            <TASK>
            dump_stack_lvl+0x6c/0x9b
            irqfd_resampler_ack+0xfd/0x110 [kvm]
            kvm_notify_acked_gsi+0x32/0x90 [kvm]
            kvm_hv_notify_acked_sint+0xc5/0x1e0 [kvm]
            kvm_hv_set_msr_common+0xec1/0x1160 [kvm]
            kvm_set_msr_common+0x7c3/0xf60 [kvm]
            vmx_set_msr+0x394/0x1240 [kvm_intel]
            kvm_set_msr_ignored_check+0x86/0x200 [kvm]
            kvm_emulate_wrmsr+0x4f/0x1f0 [kvm]
            vmx_handle_exit+0x6fb/0x7e0 [kvm_intel]
            vcpu_enter_guest+0xe5a/0x1900 [kvm]
            kvm_arch_vcpu_ioctl_run+0x16e/0xac0 [kvm]
            kvm_vcpu_ioctl+0x279/0x710 [kvm]
            __x64_sys_ioctl+0x83/0xb0
            do_syscall_64+0x3b/0x90
            entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      resampler-list is protected by irq_srcu (see kvm_irqfd_assign), so fix
      the false positive by using list_for_each_entry_srcu().
      
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1652950153-12489-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e332b55f
  4. May 17, 2022
  5. May 13, 2022
  6. May 02, 2022
  7. Apr 29, 2022
    • Paolo Bonzini's avatar
      KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT · d495f942
      Paolo Bonzini authored
      
      When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
      member that at the time was unused.  Unfortunately this extensibility
      mechanism has several issues:
      
      - x86 is not writing the member, so it would not be possible to use it
        on x86 except for new events
      
      - the member is not aligned to 64 bits, so the definition of the
        uAPI struct is incorrect for 32- on 64-bit userspace.  This is a
        problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
        usage of flags was only introduced in 5.18.
      
      Since padding has to be introduced, place a new field in there
      that tells if the flags field is valid.  To allow further extensibility,
      in fact, change flags to an array of 16 values, and store how many
      of the values are valid.  The availability of the new ndata field
      is tied to a system capability; all architectures are changed to
      fill in the field.
      
      To avoid breaking compilation of userspace that was using the flags
      field, provide a userspace-only union to overlap flags with data[0].
      The new field is placed at the same offset for both 32- and 64-bit
      userspace.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d495f942
  8. Apr 21, 2022
    • Mingwei Zhang's avatar
      KVM: SEV: add cache flush to solve SEV cache incoherency issues · 683412cc
      Mingwei Zhang authored
      
      Flush the CPU caches when memory is reclaimed from an SEV guest (where
      reclaim also includes it being unmapped from KVM's memslots).  Due to lack
      of coherency for SEV encrypted memory, failure to flush results in silent
      data corruption if userspace is malicious/broken and doesn't ensure SEV
      guest memory is properly pinned and unpinned.
      
      Cache coherency is not enforced across the VM boundary in SEV (AMD APM
      vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
      VM guests have to be explicitly flushed on the host side. If a memory page
      containing dirty confidential cachelines was released by VM and reallocated
      to another user, the cachelines may corrupt the new user at a later time.
      
      KVM takes a shortcut by assuming all confidential memory remain pinned
      until the end of VM lifetime. Therefore, KVM does not flush cache at
      mmu_notifier invalidation events. Because of this incorrect assumption and
      the lack of cache flushing, malicous userspace can crash the host kernel:
      creating a malicious VM and continuously allocates/releases unpinned
      confidential memory pages when the VM is running.
      
      Add cache flush operations to mmu_notifier operations to ensure that any
      physical memory leaving the guest VM get flushed. In particular, hook
      mmu_notifier_invalidate_range_start and mmu_notifier_release events and
      flush cache accordingly. The hook after releasing the mmu lock to avoid
      contention with other vCPUs.
      
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarSean Christpherson <seanjc@google.com>
      Reported-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-4-mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      683412cc
    • Tom Rix's avatar
      KVM: SPDX style and spelling fixes · a413a625
      Tom Rix authored
      
      SPDX comments use use /* */ style comments in headers anad
      // style comments in .c files.  Also fix two spelling mistakes.
      
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Message-Id: <20220410153840.55506-1-trix@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a413a625
    • Sean Christopherson's avatar
      KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref · 5c697c36
      Sean Christopherson authored
      
      Initialize debugfs_entry to its semi-magical -ENOENT value when the VM
      is created.  KVM's teardown when VM creation fails is kludgy and calls
      kvm_uevent_notify_change() and kvm_destroy_vm_debugfs() even if KVM never
      attempted kvm_create_vm_debugfs().  Because debugfs_entry is zero
      initialized, the IS_ERR() checks pass and KVM derefs a NULL pointer.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000018
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 1068b1067 P4D 1068b1067 PUD 1068b0067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 871 Comm: repro Not tainted 5.18.0-rc1+ #825
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__dentry_path+0x7b/0x130
        Call Trace:
         <TASK>
         dentry_path_raw+0x42/0x70
         kvm_uevent_notify_change.part.0+0x10c/0x200 [kvm]
         kvm_put_kvm+0x63/0x2b0 [kvm]
         kvm_dev_ioctl+0x43a/0x920 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x31/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        Modules linked in: kvm_intel kvm irqbypass
      
      Fixes: a44a4cc1 ("KVM: Don't create VM debugfs files outside of the VM directory")
      Cc: stable@vger.kernel.org
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Oliver Upton <oupton@google.com>
      Reported-by: default avatar <syzbot+df6fbbd2ee39f21289ef@syzkaller.appspotmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Message-Id: <20220415004622.2207751-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5c697c36
  9. Apr 07, 2022
    • Oliver Upton's avatar
      KVM: Don't create VM debugfs files outside of the VM directory · a44a4cc1
      Oliver Upton authored
      
      Unfortunately, there is no guarantee that KVM was able to instantiate a
      debugfs directory for a particular VM. To that end, KVM shouldn't even
      attempt to create new debugfs files in this case. If the specified
      parent dentry is NULL, debugfs_create_file() will instantiate files at
      the root of debugfs.
      
      For arm64, it is possible to create the vgic-state file outside of a
      VM directory, the file is not cleaned up when a VM is destroyed.
      Nonetheless, the corresponding struct kvm is freed when the VM is
      destroyed.
      
      Nip the problem in the bud for all possible errant debugfs file
      creations by initializing kvm->debugfs_dentry to -ENOENT. In so doing,
      debugfs_create_file() will fail instead of creating the file in the root
      directory.
      
      Cc: stable@kernel.org
      Fixes: 929f45e3 ("kvm: no need to check return value of debugfs_create functions")
      Signed-off-by: default avatarOliver Upton <oupton@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220406235615.1447180-2-oupton@google.com
      a44a4cc1
  10. Apr 06, 2022
    • Paolo Bonzini's avatar
      KVM: avoid NULL pointer dereference in kvm_dirty_ring_push · 5593473a
      Paolo Bonzini authored
      
      kvm_vcpu_release() will call kvm_dirty_ring_free(), freeing
      ring->dirty_gfns and setting it to NULL.  Afterwards, it calls
      kvm_arch_vcpu_destroy().
      
      However, if closing the file descriptor races with KVM_RUN in such away
      that vcpu->arch.st.preempted == 0, the following call stack leads to a
      NULL pointer dereference in kvm_dirty_run_push():
      
       mark_page_dirty_in_slot+0x192/0x270 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3171
       kvm_steal_time_set_preempted arch/x86/kvm/x86.c:4600 [inline]
       kvm_arch_vcpu_put+0x34e/0x5b0 arch/x86/kvm/x86.c:4618
       vcpu_put+0x1b/0x70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:211
       vmx_free_vcpu+0xcb/0x130 arch/x86/kvm/vmx/vmx.c:6985
       kvm_arch_vcpu_destroy+0x76/0x290 arch/x86/kvm/x86.c:11219
       kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
      
      The fix is to release the dirty page ring after kvm_arch_vcpu_destroy
      has run.
      
      Reported-by: default avatarQiuhao Li <qiuhao@sysec.org>
      Reported-by: default avatarGaoning Pan <pgn@zju.edu.cn>
      Reported-by: default avatarYongkang Jia <kangel@zju.edu.cn>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5593473a
  11. Apr 02, 2022
    • David Woodhouse's avatar
      KVM: Remove dirty handling from gfn_to_pfn_cache completely · cf1d88b3
      David Woodhouse authored
      
      It isn't OK to cache the dirty status of a page in internal structures
      for an indefinite period of time.
      
      Any time a vCPU exits the run loop to userspace might be its last; the
      VMM might do its final check of the dirty log, flush the last remaining
      dirty pages to the destination and complete a live migration. If we
      have internal 'dirty' state which doesn't get flushed until the vCPU
      is finally destroyed on the source after migration is complete, then
      we have lost data because that will escape the final copy.
      
      This problem already exists with the use of kvm_vcpu_unmap() to mark
      pages dirty in e.g. VMX nesting.
      
      Note that the actual Linux MM already considers the page to be dirty
      since we have a writeable mapping of it. This is just about the KVM
      dirty logging.
      
      For the nesting-style use cases (KVM_GUEST_USES_PFN) we will need to
      track which gfn_to_pfn_caches have been used and explicitly mark the
      corresponding pages dirty before returning to userspace. But we would
      have needed external tracking of that anyway, rather than walking the
      full list of GPCs to find those belonging to this vCPU which are dirty.
      
      So let's rely *solely* on that external tracking, and keep it simple
      rather than laying a tempting trap for callers to fall into.
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-3-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cf1d88b3
    • Sean Christopherson's avatar
      KVM: Use enum to track if cached PFN will be used in guest and/or host · d0d96121
      Sean Christopherson authored
      
      Replace the guest_uses_pa and kernel_map booleans in the PFN cache code
      with a unified enum/bitmask. Using explicit names makes it easier to
      review and audit call sites.
      
      Opportunistically add a WARN to prevent passing garbage; instantating a
      cache without declaring its usage is either buggy or pointless.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-2-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d0d96121
    • Sean Christopherson's avatar
      KVM: Don't actually set a request when evicting vCPUs for GFN cache invd · df06dae3
      Sean Christopherson authored
      
      Don't actually set a request bit in vcpu->requests when making a request
      purely to force a vCPU to exit the guest.  Logging a request but not
      actually consuming it would cause the vCPU to get stuck in an infinite
      loop during KVM_RUN because KVM would see the pending request and bail
      from VM-Enter to service the request.
      
      Note, it's currently impossible for KVM to set KVM_REQ_GPC_INVALIDATE as
      nothing in KVM is wired up to set guest_uses_pa=true.  But, it'd be all
      too easy for arch code to introduce use of kvm_gfn_to_pfn_cache_init()
      without implementing handling of the request, especially since getting
      test coverage of MMU notifier interaction with specific KVM features
      usually requires a directed test.
      
      Opportunistically rename gfn_to_pfn_cache_invalidate_start()'s wake_vcpus
      to evict_vcpus.  The purpose of the request is to get vCPUs out of guest
      mode, it's supposed to _avoid_ waking vCPUs that are blocking.
      
      Opportunistically rename KVM_REQ_GPC_INVALIDATE to be more specific as to
      what it wants to accomplish, and to genericize the name so that it can
      used for similar but unrelated scenarios, should they arise in the future.
      Add a comment and documentation to explain why the "no action" request
      exists.
      
      Add compile-time assertions to help detect improper usage.  Use the inner
      assertless helper in the one s390 path that makes requests without a
      hardcoded request.
      
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220223165302.3205276-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      df06dae3
    • David Woodhouse's avatar
      KVM: avoid double put_page with gfn-to-pfn cache · 79593c08
      David Woodhouse authored
      
      If the cache's user host virtual address becomes invalid, there
      is still a path from kvm_gfn_to_pfn_cache_refresh() where __release_gpc()
      could release the pfn but the gpc->pfn field has not been overwritten
      with an error value.  If this happens, kvm_gfn_to_pfn_cache_unmap will
      call put_page again on the same page.
      
      Cc: stable@vger.kernel.org
      Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
      Signed-off-by: default avatarDavid Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      79593c08
  12. Mar 29, 2022
    • David Matlack's avatar
      Revert "KVM: set owner of cpu and vm file operations" · 70375c2d
      David Matlack authored
      
      This reverts commit 3d3aab1b.
      
      Now that the KVM module's lifetime is tied to kvm.users_count, there is
      no need to also tie it's lifetime to the lifetime of the VM and vCPU
      file descriptors.
      
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220303183328.1499189-3-dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      70375c2d
    • David Matlack's avatar
      KVM: Prevent module exit until all VMs are freed · 5f6de5cb
      David Matlack authored
      
      Tie the lifetime the KVM module to the lifetime of each VM via
      kvm.users_count. This way anything that grabs a reference to the VM via
      kvm_get_kvm() cannot accidentally outlive the KVM module.
      
      Prior to this commit, the lifetime of the KVM module was tied to the
      lifetime of /dev/kvm file descriptors, VM file descriptors, and vCPU
      file descriptors by their respective file_operations "owner" field.
      This approach is insufficient because references grabbed via
      kvm_get_kvm() do not prevent closing any of the aforementioned file
      descriptors.
      
      This fixes a long standing theoretical bug in KVM that at least affects
      async page faults. kvm_setup_async_pf() grabs a reference via
      kvm_get_kvm(), and drops it in an asynchronous work callback. Nothing
      prevents the VM file descriptor from being closed and the KVM module
      from being unloaded before this callback runs.
      
      Fixes: af585b92 ("KVM: Halt vcpu if page it tries to access is swapped out")
      Fixes: 3d3aab1b ("KVM: set owner of cpu and vm file operations")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarBen Gardon <bgardon@google.com>
      [ Based on a patch from Ben implemented for Google's kernel. ]
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220303183328.1499189-2-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5f6de5cb
  13. Mar 11, 2022
  14. Mar 08, 2022
  15. Mar 01, 2022
    • Sean Christopherson's avatar
      KVM: Drop kvm_reload_remote_mmus(), open code request in x86 users · 2f6f66cc
      Sean Christopherson authored
      
      Remove the generic kvm_reload_remote_mmus() and open code its
      functionality into the two x86 callers.  x86 is (obviously) the only
      architecture that uses the hook, and is also the only architecture that
      uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name.  That
      will change in a future patch, as x86's usage when zapping a single
      shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only
      MMUs whose root is being zapped actually need to be reloaded.
      
      s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose.
      
      Drop the generic code in anticipation of implementing s390 and x86 arch
      specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely.
      
      Opportunistically reword the x86 TDP MMU comment to avoid making
      references to functions (and requests!) when possible, and to remove the
      rather ambiguous "this".
      
      No functional change intended.
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220225182248.3812651-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2f6f66cc
  16. Feb 25, 2022
    • Vipin Sharma's avatar
      KVM: Move VM's worker kthreads back to the original cgroup before exiting. · e45cce30
      Vipin Sharma authored
      
      VM worker kthreads can linger in the VM process's cgroup for sometime
      after KVM terminates the VM process.
      
      KVM terminates the worker kthreads by calling kthread_stop() which waits
      on the 'exited' completion, triggered by exit_mm(), via mm_release(), in
      do_exit() during the kthread's exit.  However, these kthreads are
      removed from the cgroup using the cgroup_exit() which happens after the
      exit_mm(). Therefore, A VM process can terminate in between the
      exit_mm() and cgroup_exit() calls, leaving only worker kthreads in the
      cgroup.
      
      Moving worker kthreads back to the original cgroup (kthreadd_task's
      cgroup) makes sure that the cgroup is empty as soon as the main VM
      process is terminated.
      
      Signed-off-by: default avatarVipin Sharma <vipinsh@google.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220222054848.563321-1-vipinsh@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e45cce30
  17. Feb 17, 2022
    • Wanpeng Li's avatar
      KVM: Fix lockdep false negative during host resume · 4cb9a998
      Wanpeng Li authored
      
      I saw the below splatting after the host suspended and resumed.
      
         WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm]
         CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G        W IOE     5.17.0-rc3+ #4
         RIP: 0010:kvm_resume+0x2c/0x30 [kvm]
         Call Trace:
          <TASK>
          syscore_resume+0x90/0x340
          suspend_devices_and_enter+0xaee/0xe90
          pm_suspend.cold+0x36b/0x3c2
          state_store+0x82/0xf0
          kernfs_fop_write_iter+0x1b6/0x260
          new_sync_write+0x258/0x370
          vfs_write+0x33f/0x510
          ksys_write+0xc9/0x160
          do_syscall_64+0x3b/0xc0
          entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      lockdep_is_held() can return -1 when lockdep is disabled which triggers
      this warning. Let's use lockdep_assert_not_held() which can detect
      incorrect calls while holding a lock and it also avoids false negatives
      when lockdep is disabled.
      
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1644920142-81249-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4cb9a998
  18. Feb 10, 2022
  19. Jan 28, 2022
    • Hou Wenlong's avatar
      KVM: eventfd: Fix false positive RCU usage warning · 6a0c6170
      Hou Wenlong authored
      
      Fix the following false positive warning:
       =============================
       WARNING: suspicious RCU usage
       5.16.0-rc4+ #57 Not tainted
       -----------------------------
       arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       3 locks held by fc_vcpu 0/330:
        #0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm]
        #1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm]
        #2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm]
      
       stack backtrace:
       CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
       Call Trace:
        <TASK>
        dump_stack_lvl+0x44/0x57
        kvm_notify_acked_gsi+0x6b/0x70 [kvm]
        kvm_notify_acked_irq+0x8d/0x180 [kvm]
        kvm_ioapic_update_eoi+0x92/0x240 [kvm]
        kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm]
        handle_apic_eoi_induced+0x3d/0x60 [kvm_intel]
        vmx_handle_exit+0x19c/0x6a0 [kvm_intel]
        vcpu_enter_guest+0x66e/0x1860 [kvm]
        kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm]
        kvm_vcpu_ioctl+0x38a/0x6f0 [kvm]
        __x64_sys_ioctl+0x89/0xc0
        do_syscall_64+0x3a/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu),
      kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact,
      kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it
      a false positive warning. So use hlist_for_each_entry_srcu() instead of
      hlist_for_each_entry_rcu().
      
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHou Wenlong <houwenlong93@linux.alibaba.com>
      Message-Id: <f98bac4f5052bad2c26df9ad50f7019e40434512.1643265976.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6a0c6170
  20. Jan 26, 2022
  21. Jan 24, 2022
  22. Jan 19, 2022
    • Sean Christopherson's avatar
      KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx · 12a8eee5
      Sean Christopherson authored
      
      Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and
      rename the list and all associated variables to clarify that it tracks
      the set of vCPU that need to be poked on a posted interrupt to the wakeup
      vector.  The list is not used to track _all_ vCPUs that are blocking, and
      the term "blocked" can be misleading as it may refer to a blocking
      condition in the host or the guest, where as the PI wakeup case is
      specifically for the vCPUs that are actively blocking from within the
      guest.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      12a8eee5
    • Sean Christopherson's avatar
      KVM: Drop unused kvm_vcpu.pre_pcpu field · e6eec09b
      Sean Christopherson authored
      
      Remove kvm_vcpu.pre_pcpu as it no longer has any users.  No functional
      change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e6eec09b
    • Christian Borntraeger's avatar
      KVM: avoid warning on s390 in mark_page_dirty · e09fccb5
      Christian Borntraeger authored
      
      Avoid warnings on s390 like
      [ 1801.980931] CPU: 12 PID: 117600 Comm: kworker/12:0 Tainted: G            E     5.17.0-20220113.rc0.git0.32ce2abb03cf.300.fc35.s390x+next #1
      [ 1801.980938] Workqueue: events irqfd_inject [kvm]
      [...]
      [ 1801.981057] Call Trace:
      [ 1801.981060]  [<000003ff805f0f5c>] mark_page_dirty_in_slot+0xa4/0xb0 [kvm]
      [ 1801.981083]  [<000003ff8060e9fe>] adapter_indicators_set+0xde/0x268 [kvm]
      [ 1801.981104]  [<000003ff80613c24>] set_adapter_int+0x64/0xd8 [kvm]
      [ 1801.981124]  [<000003ff805fb9aa>] kvm_set_irq+0xc2/0x130 [kvm]
      [ 1801.981144]  [<000003ff805f8d86>] irqfd_inject+0x76/0xa0 [kvm]
      [ 1801.981164]  [<0000000175e56906>] process_one_work+0x1fe/0x470
      [ 1801.981173]  [<0000000175e570a4>] worker_thread+0x64/0x498
      [ 1801.981176]  [<0000000175e5ef2c>] kthread+0x10c/0x110
      [ 1801.981180]  [<0000000175de73c8>] __ret_from_fork+0x40/0x58
      [ 1801.981185]  [<000000017698440a>] ret_from_fork+0xa/0x40
      
      when writing to a guest from an irqfd worker as long as we do not have
      the dirty ring.
      
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@linux.ibm.com>
      Reluctantly-acked-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20220113122924.740496-1-borntraeger@linux.ibm.com>
      Fixes: 2efd61a6 ("KVM: Warn if mark_page_dirty() is called without an active vCPU")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e09fccb5
  23. Jan 07, 2022
    • David Woodhouse's avatar
      KVM: Reinstate gfn_to_pfn_cache with invalidation support · 982ed0de
      David Woodhouse authored
      
      This can be used in two modes. There is an atomic mode where the cached
      mapping is accessed while holding the rwlock, and a mode where the
      physical address is used by a vCPU in guest mode.
      
      For the latter case, an invalidation will wake the vCPU with the new
      KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
      caches it still needs to access before entering guest mode again.
      
      Only one vCPU can be targeted by the wake requests; it's simple enough
      to make it wake all vCPUs or even a mask but I don't see a use case for
      that additional complexity right now.
      
      Invalidation happens from the invalidate_range_start MMU notifier, which
      needs to be able to sleep in order to wake the vCPU and wait for it.
      
      This means that revalidation potentially needs to "wait" for the MMU
      operation to complete and the invalidate_range_end notifier to be
      invoked. Like the vCPU when it takes a page fault in that period, we
      just spin — fixing that in a future patch by implementing an actual
      *wait* may be another part of shaving this particularly hirsute yak.
      
      As noted in the comments in the function itself, the only case where
      the invalidate_range_start notifier is expected to be called *without*
      being able to sleep is when the OOM reaper is killing the process. In
      that case, we expect the vCPU threads already to have exited, and thus
      there will be nothing to wake, and no reason to wait. So we clear the
      KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
      if there actually *was* anything to wake up.
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      982ed0de
    • David Woodhouse's avatar
      KVM: Warn if mark_page_dirty() is called without an active vCPU · 2efd61a6
      David Woodhouse authored
      The various kvm_write_guest() and mark_page_dirty() functions must only
      ever be called in the context of an active vCPU, because if dirty ring
      tracking is enabled it may simply oops when kvm_get_running_vcpu()
      returns NULL for the vcpu and then kvm_dirty_ring_get() dereferences it.
      
      This oops was reported by "butt3rflyh4ck" <butterflyhuangxx@gmail.com> in
      https://lore.kernel.org/kvm/CAFcO6XOmoS7EacN_n6v4Txk7xL7iqRa2gABg3F7E3Naf5uG94g@mail.gmail.com/
      
      
      
      That actual bug will be fixed under separate cover but this warning
      should help to prevent new ones from being added.
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-2-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2efd61a6
  24. Dec 09, 2021
Loading