- Jun 09, 2022
-
-
Maxim Levitsky authored
On SVM, if preemption happens right after the call to finish_rcuwait but before call to kvm_arch_vcpu_unblocking on SVM/AVIC, it itself will re-enable AVIC, and then we will try to re-enable it again in kvm_arch_vcpu_unblocking which will lead to a warning in __avic_vcpu_load. The same problem can happen if the vCPU is preempted right after the call to kvm_arch_vcpu_blocking but before the call to prepare_to_rcuwait and in this case, we will end up with AVIC enabled during sleep - Ooops. Signed-off-by:
Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606180829.102503-7-mlevitsk@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jun 07, 2022
-
-
Alexey Kardashevskiy authored
A KVM device cleanup happens in either of two callbacks: 1) destroy() which is called when the VM is being destroyed; 2) release() which is called when a device fd is closed. Most KVM devices use 1) but Book3s's interrupt controller KVM devices (XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during the machine execution. The error handling in kvm_ioctl_create_device() assumes destroy() is always defined which leads to NULL dereference as discovered by Syzkaller. This adds a checks for destroy!=NULL and adds a missing release(). This is not changing kvm_destroy_devices() as devices with defined release() should have been removed from the KVM devices list by then. Suggested-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- May 20, 2022
-
-
Sean Christopherson authored
Fix a goof in kvm_prepare_memory_region() where KVM fails to free the new memslot's dirty bitmap during a CREATE action if kvm_arch_prepare_memory_region() fails. The logic is supposed to detect if the bitmap was allocated and thus needs to be freed, versus if the bitmap was inherited from the old memslot and thus needs to be kept. If there is no old memslot, then obviously the bitmap can't have been inherited The bug was exposed by commit 86931ff7 ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR"), which made it trivally easy for syzkaller to trigger failure during kvm_arch_prepare_memory_region(), but the bug can be hit other ways too, e.g. due to -ENOMEM when allocating x86's memslot metadata. The backtrace from kmemleak: __vmalloc_node_range+0xb40/0xbd0 mm/vmalloc.c:3195 __vmalloc_node mm/vmalloc.c:3232 [inline] __vmalloc+0x49/0x50 mm/vmalloc.c:3246 __vmalloc_array mm/util.c:671 [inline] __vcalloc+0x49/0x70 mm/util.c:694 kvm_alloc_dirty_bitmap virt/kvm/kvm_main.c:1319 kvm_prepare_memory_region virt/kvm/kvm_main.c:1551 kvm_set_memslot+0x1bd/0x690 virt/kvm/kvm_main.c:1782 __kvm_set_memory_region+0x689/0x750 virt/kvm/kvm_main.c:1949 kvm_set_memory_region virt/kvm/kvm_main.c:1962 kvm_vm_ioctl_set_memory_region virt/kvm/kvm_main.c:1974 kvm_vm_ioctl+0x377/0x13a0 virt/kvm/kvm_main.c:4528 vfs_ioctl fs/ioctl.c:51 __do_sys_ioctl fs/ioctl.c:870 __se_sys_ioctl fs/ioctl.c:856 __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae And the relevant sequence of KVM events: ioctl(3, KVM_CREATE_VM, 0) = 4 ioctl(4, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=KVM_MEM_LOG_DIRTY_PAGES, guest_phys_addr=0x10000000000000, memory_size=4096, userspace_addr=0x20fe8000} ) = -1 EINVAL (Invalid argument) Fixes: 244893fa ("KVM: Dynamically allocate "new" memslots from the get-go") Cc: stable@vger.kernel.org Reported-by:
<syzbot+8606b8a9cc97a63f1c87@syzkaller.appspotmail.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20220518003842.1341782-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
The splat below can be seen when running kvm-unit-test: ============================= WARNING: suspicious RCU usage 5.18.0-rc7 #5 Tainted: G IOE ----------------------------- /home/kernel/linux/arch/x86/kvm/../../../virt/kvm/eventfd.c:80 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 4 locks held by qemu-system-x86/35124: #0: ffff9725391d80b8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x77/0x710 [kvm] #1: ffffbd25cfb2a0b8 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0xdeb/0x1900 [kvm] #2: ffffbd25cfb2b920 (&kvm->irq_srcu){....}-{0:0}, at: kvm_hv_notify_acked_sint+0x79/0x1e0 [kvm] #3: ffffbd25cfb2b920 (&kvm->irq_srcu){....}-{0:0}, at: irqfd_resampler_ack+0x5/0x110 [kvm] stack backtrace: CPU: 2 PID: 35124 Comm: qemu-system-x86 Tainted: G IOE 5.18.0-rc7 #5 Call Trace: <TASK> dump_stack_lvl+0x6c/0x9b irqfd_resampler_ack+0xfd/0x110 [kvm] kvm_notify_acked_gsi+0x32/0x90 [kvm] kvm_hv_notify_acked_sint+0xc5/0x1e0 [kvm] kvm_hv_set_msr_common+0xec1/0x1160 [kvm] kvm_set_msr_common+0x7c3/0xf60 [kvm] vmx_set_msr+0x394/0x1240 [kvm_intel] kvm_set_msr_ignored_check+0x86/0x200 [kvm] kvm_emulate_wrmsr+0x4f/0x1f0 [kvm] vmx_handle_exit+0x6fb/0x7e0 [kvm_intel] vcpu_enter_guest+0xe5a/0x1900 [kvm] kvm_arch_vcpu_ioctl_run+0x16e/0xac0 [kvm] kvm_vcpu_ioctl+0x279/0x710 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae resampler-list is protected by irq_srcu (see kvm_irqfd_assign), so fix the false positive by using list_for_each_entry_srcu(). Signed-off-by:
Wanpeng Li <wanpengli@tencent.com> Message-Id: <1652950153-12489-1-git-send-email-wanpengli@tencent.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- May 17, 2022
-
-
Wan Jiabing authored
Fix following coccicheck warning: ./virt/kvm/vfio.c:258:1-7: preceding lock on line 236 If kvm_vfio_file_iommu_group() failed, code would goto err_fdput with mutex_lock acquired and then return ret. It might cause potential deadlock. Move mutex_unlock bellow err_fdput tag to fix it. Fixes: d55d9e7a ("kvm/vfio: Store the struct file in the kvm_vfio_group") Signed-off-by:
Wan Jiabing <wanjiabing@vivo.com> Reviewed-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220517023441.4258-1-wanjiabing@vivo.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
- May 13, 2022
-
-
Jason Gunthorpe authored
None of the VFIO APIs take in the vfio_group anymore, so we can remove it completely. This has a subtle side effect on the enforced coherency tracking. The vfio_group_get_external_user() was holding on to the container_users which would prevent the iommu_domain and thus the enforced coherency value from changing while the group is registered with kvm. It changes the security proof slightly into 'user must hold a group FD that has a device that cannot enforce DMA coherence'. As opening the group FD, not attaching the container, is the privileged operation this doesn't change the security properties much. On the flip side it paves the way to changing the iommu_domain/container attached to a group at runtime which is something that will be required to support nested translation. Reviewed-by:
Kevin Tian <kevin.tian@intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de>i> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/7-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
Jason Gunthorpe authored
Just change the argument from struct vfio_group to struct file *. Reviewed-by:
Kevin Tian <kevin.tian@intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/6-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
Jason Gunthorpe authored
Instead of a general extension check change the function into a limited test if the iommu_domain has enforced coherency, which is the only thing kvm needs to query. Make the new op self contained by properly refcounting the container before touching it. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
Jason Gunthorpe authored
vfio_group_fops_open() ensures there is only ever one struct file open for any struct vfio_group at any time: /* Do we need multiple instances of the group open? Seems not. */ opened = atomic_cmpxchg(&group->opened, 0, 1); if (opened) { vfio_group_put(group); return -EBUSY; Therefor the struct file * can be used directly to search the list of VFIO groups that KVM keeps instead of using the vfio_external_group_match_file() callback to try to figure out if the passed in FD matches the list or not. Delete vfio_external_group_match_file(). Reviewed-by:
Kevin Tian <kevin.tian@intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Yi Liu <yi.l.liu@intel.com> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
Jason Gunthorpe authored
The only caller wants to get a pointer to the struct iommu_group associated with the VFIO group file. Instead of returning the group ID then searching sysfs for that string to get the struct iommu_group just directly return the iommu_group pointer already held by the vfio_group struct. It already has a safe lifetime due to the struct file kref, the vfio_group and thus the iommu_group cannot be destroyed while the group file is open. Reviewed-by:
Kevin Tian <kevin.tian@intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Yi Liu <yi.l.liu@intel.com> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
Jason Gunthorpe authored
Following patches will change the APIs to use the struct file as the handle instead of the vfio_group, so hang on to a reference to it with the same duration of as the vfio_group. Reviewed-by:
Kevin Tian <kevin.tian@intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Yi Liu <yi.l.liu@intel.com> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
Jason Gunthorpe authored
To make it easier to read and change in following patches. Reviewed-by:
Kevin Tian <kevin.tian@intel.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Cornelia Huck <cohuck@redhat.com> Reviewed-by:
Yi Liu <yi.l.liu@intel.com> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
- May 02, 2022
-
-
Sean Christopherson authored
For TDX guests, the maximum number of vcpus needs to be specified when the TDX guest VM is initialized (creating the TDX data corresponding to TDX guest) before creating vcpu. It needs to record the maximum number of vcpus on VM creation (KVM_CREATE_VM) and return error if the number of vcpus exceeds it Because there is already max_vcpu member in arm64 struct kvm_arch, move it to common struct kvm and initialize it to KVM_MAX_VCPUS before kvm_arch_init_vm() instead of adding it to x86 struct kvm_arch. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <e53234cdee6a92357d06c80c03d77c19cdefb804.1646422845.git.isaku.yamahata@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Apr 29, 2022
-
-
Paolo Bonzini authored
When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags member that at the time was unused. Unfortunately this extensibility mechanism has several issues: - x86 is not writing the member, so it would not be possible to use it on x86 except for new events - the member is not aligned to 64 bits, so the definition of the uAPI struct is incorrect for 32- on 64-bit userspace. This is a problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately usage of flags was only introduced in 5.18. Since padding has to be introduced, place a new field in there that tells if the flags field is valid. To allow further extensibility, in fact, change flags to an array of 16 values, and store how many of the values are valid. The availability of the new ndata field is tied to a system capability; all architectures are changed to fill in the field. To avoid breaking compilation of userspace that was using the flags field, provide a userspace-only union to overlap flags with data[0]. The new field is placed at the same offset for both 32- and 64-bit userspace. Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Reported-by:
kernel test robot <lkp@intel.com> Message-Id: <20220422103013.34832-1-pbonzini@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Apr 21, 2022
-
-
Mingwei Zhang authored
Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: stable@vger.kernel.org Suggested-by:
Sean Christpherson <seanjc@google.com> Reported-by:
Mingwei Zhang <mizhang@google.com> Signed-off-by:
Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-4-mizhang@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Tom Rix authored
SPDX comments use use /* */ style comments in headers anad // style comments in .c files. Also fix two spelling mistakes. Signed-off-by:
Tom Rix <trix@redhat.com> Message-Id: <20220410153840.55506-1-trix@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Initialize debugfs_entry to its semi-magical -ENOENT value when the VM is created. KVM's teardown when VM creation fails is kludgy and calls kvm_uevent_notify_change() and kvm_destroy_vm_debugfs() even if KVM never attempted kvm_create_vm_debugfs(). Because debugfs_entry is zero initialized, the IS_ERR() checks pass and KVM derefs a NULL pointer. BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1068b1067 P4D 1068b1067 PUD 1068b0067 PMD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 871 Comm: repro Not tainted 5.18.0-rc1+ #825 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__dentry_path+0x7b/0x130 Call Trace: <TASK> dentry_path_raw+0x42/0x70 kvm_uevent_notify_change.part.0+0x10c/0x200 [kvm] kvm_put_kvm+0x63/0x2b0 [kvm] kvm_dev_ioctl+0x43a/0x920 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x31/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Modules linked in: kvm_intel kvm irqbypass Fixes: a44a4cc1 ("KVM: Don't create VM debugfs files outside of the VM directory") Cc: stable@vger.kernel.org Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@google.com> Reported-by:
<syzbot+df6fbbd2ee39f21289ef@syzkaller.appspotmail.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Oliver Upton <oupton@google.com> Message-Id: <20220415004622.2207751-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Apr 07, 2022
-
-
Oliver Upton authored
Unfortunately, there is no guarantee that KVM was able to instantiate a debugfs directory for a particular VM. To that end, KVM shouldn't even attempt to create new debugfs files in this case. If the specified parent dentry is NULL, debugfs_create_file() will instantiate files at the root of debugfs. For arm64, it is possible to create the vgic-state file outside of a VM directory, the file is not cleaned up when a VM is destroyed. Nonetheless, the corresponding struct kvm is freed when the VM is destroyed. Nip the problem in the bud for all possible errant debugfs file creations by initializing kvm->debugfs_dentry to -ENOENT. In so doing, debugfs_create_file() will fail instead of creating the file in the root directory. Cc: stable@kernel.org Fixes: 929f45e3 ("kvm: no need to check return value of debugfs_create functions") Signed-off-by:
Oliver Upton <oupton@google.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220406235615.1447180-2-oupton@google.com
-
- Apr 06, 2022
-
-
Paolo Bonzini authored
kvm_vcpu_release() will call kvm_dirty_ring_free(), freeing ring->dirty_gfns and setting it to NULL. Afterwards, it calls kvm_arch_vcpu_destroy(). However, if closing the file descriptor races with KVM_RUN in such away that vcpu->arch.st.preempted == 0, the following call stack leads to a NULL pointer dereference in kvm_dirty_run_push(): mark_page_dirty_in_slot+0x192/0x270 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3171 kvm_steal_time_set_preempted arch/x86/kvm/x86.c:4600 [inline] kvm_arch_vcpu_put+0x34e/0x5b0 arch/x86/kvm/x86.c:4618 vcpu_put+0x1b/0x70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:211 vmx_free_vcpu+0xcb/0x130 arch/x86/kvm/vmx/vmx.c:6985 kvm_arch_vcpu_destroy+0x76/0x290 arch/x86/kvm/x86.c:11219 kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline] The fix is to release the dirty page ring after kvm_arch_vcpu_destroy has run. Reported-by:
Qiuhao Li <qiuhao@sysec.org> Reported-by:
Gaoning Pan <pgn@zju.edu.cn> Reported-by:
Yongkang Jia <kangel@zju.edu.cn> Cc: stable@vger.kernel.org Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Apr 02, 2022
-
-
David Woodhouse authored
It isn't OK to cache the dirty status of a page in internal structures for an indefinite period of time. Any time a vCPU exits the run loop to userspace might be its last; the VMM might do its final check of the dirty log, flush the last remaining dirty pages to the destination and complete a live migration. If we have internal 'dirty' state which doesn't get flushed until the vCPU is finally destroyed on the source after migration is complete, then we have lost data because that will escape the final copy. This problem already exists with the use of kvm_vcpu_unmap() to mark pages dirty in e.g. VMX nesting. Note that the actual Linux MM already considers the page to be dirty since we have a writeable mapping of it. This is just about the KVM dirty logging. For the nesting-style use cases (KVM_GUEST_USES_PFN) we will need to track which gfn_to_pfn_caches have been used and explicitly mark the corresponding pages dirty before returning to userspace. But we would have needed external tracking of that anyway, rather than walking the full list of GPCs to find those belonging to this vCPU which are dirty. So let's rely *solely* on that external tracking, and keep it simple rather than laying a tempting trap for callers to fall into. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-3-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Replace the guest_uses_pa and kernel_map booleans in the PFN cache code with a unified enum/bitmask. Using explicit names makes it easier to review and audit call sites. Opportunistically add a WARN to prevent passing garbage; instantating a cache without declaring its usage is either buggy or pointless. Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-2-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Don't actually set a request bit in vcpu->requests when making a request purely to force a vCPU to exit the guest. Logging a request but not actually consuming it would cause the vCPU to get stuck in an infinite loop during KVM_RUN because KVM would see the pending request and bail from VM-Enter to service the request. Note, it's currently impossible for KVM to set KVM_REQ_GPC_INVALIDATE as nothing in KVM is wired up to set guest_uses_pa=true. But, it'd be all too easy for arch code to introduce use of kvm_gfn_to_pfn_cache_init() without implementing handling of the request, especially since getting test coverage of MMU notifier interaction with specific KVM features usually requires a directed test. Opportunistically rename gfn_to_pfn_cache_invalidate_start()'s wake_vcpus to evict_vcpus. The purpose of the request is to get vCPUs out of guest mode, it's supposed to _avoid_ waking vCPUs that are blocking. Opportunistically rename KVM_REQ_GPC_INVALIDATE to be more specific as to what it wants to accomplish, and to genericize the name so that it can used for similar but unrelated scenarios, should they arise in the future. Add a comment and documentation to explain why the "no action" request exists. Add compile-time assertions to help detect improper usage. Use the inner assertless helper in the one s390 path that makes requests without a hardcoded request. Cc: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20220223165302.3205276-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Woodhouse authored
If the cache's user host virtual address becomes invalid, there is still a path from kvm_gfn_to_pfn_cache_refresh() where __release_gpc() could release the pfn but the gpc->pfn field has not been overwritten with an error value. If this happens, kvm_gfn_to_pfn_cache_unmap will call put_page again on the same page. Cc: stable@vger.kernel.org Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Signed-off-by:
David Woodhouse <dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Mar 29, 2022
-
-
David Matlack authored
This reverts commit 3d3aab1b. Now that the KVM module's lifetime is tied to kvm.users_count, there is no need to also tie it's lifetime to the lifetime of the VM and vCPU file descriptors. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20220303183328.1499189-3-dmatlack@google.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Tie the lifetime the KVM module to the lifetime of each VM via kvm.users_count. This way anything that grabs a reference to the VM via kvm_get_kvm() cannot accidentally outlive the KVM module. Prior to this commit, the lifetime of the KVM module was tied to the lifetime of /dev/kvm file descriptors, VM file descriptors, and vCPU file descriptors by their respective file_operations "owner" field. This approach is insufficient because references grabbed via kvm_get_kvm() do not prevent closing any of the aforementioned file descriptors. This fixes a long standing theoretical bug in KVM that at least affects async page faults. kvm_setup_async_pf() grabs a reference via kvm_get_kvm(), and drops it in an asynchronous work callback. Nothing prevents the VM file descriptor from being closed and the KVM module from being unloaded before this callback runs. Fixes: af585b92 ("KVM: Halt vcpu if page it tries to access is swapped out") Fixes: 3d3aab1b ("KVM: set owner of cpu and vm file operations") Cc: stable@vger.kernel.org Suggested-by:
Ben Gardon <bgardon@google.com> [ Based on a patch from Ben implemented for Google's kernel. ] Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20220303183328.1499189-2-dmatlack@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Mar 11, 2022
-
-
Guo Ren authored
Current riscv doesn't support the 32bit KVM API. Let's make it clear by not selecting KVM_COMPAT. Signed-off-by:
Guo Ren <guoren@linux.alibaba.com> Signed-off-by:
Guo Ren <guoren@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Anup Patel <anup@brainfault.org> Reviewed-by:
Anup Patel <anup@brainfault.org> Signed-off-by:
Anup Patel <anup@brainfault.org>
-
- Mar 08, 2022
-
-
Paolo Bonzini authored
Allocations whose size is related to the memslot size can be arbitrarily large. Do not use kvzalloc/kvcalloc, as those are limited to "not crazy" sizes that fit in 32 bits. Cc: stable@vger.kernel.org Fixes: 7661809d ("mm: don't allow oversized kvmalloc() calls") Reviewed-by:
David Hildenbrand <david@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Mar 01, 2022
-
-
Sean Christopherson authored
Remove the generic kvm_reload_remote_mmus() and open code its functionality into the two x86 callers. x86 is (obviously) the only architecture that uses the hook, and is also the only architecture that uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name. That will change in a future patch, as x86's usage when zapping a single shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only MMUs whose root is being zapped actually need to be reloaded. s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose. Drop the generic code in anticipation of implementing s390 and x86 arch specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely. Opportunistically reword the x86 TDP MMU comment to avoid making references to functions (and requests!) when possible, and to remove the rather ambiguous "this". No functional change intended. Cc: Ben Gardon <bgardon@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Ben Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-4-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Feb 25, 2022
-
-
Vipin Sharma authored
VM worker kthreads can linger in the VM process's cgroup for sometime after KVM terminates the VM process. KVM terminates the worker kthreads by calling kthread_stop() which waits on the 'exited' completion, triggered by exit_mm(), via mm_release(), in do_exit() during the kthread's exit. However, these kthreads are removed from the cgroup using the cgroup_exit() which happens after the exit_mm(). Therefore, A VM process can terminate in between the exit_mm() and cgroup_exit() calls, leaving only worker kthreads in the cgroup. Moving worker kthreads back to the original cgroup (kthreadd_task's cgroup) makes sure that the cgroup is empty as soon as the main VM process is terminated. Signed-off-by:
Vipin Sharma <vipinsh@google.com> Suggested-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20220222054848.563321-1-vipinsh@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Feb 17, 2022
-
-
Wanpeng Li authored
I saw the below splatting after the host suspended and resumed. WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm] CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G W IOE 5.17.0-rc3+ #4 RIP: 0010:kvm_resume+0x2c/0x30 [kvm] Call Trace: <TASK> syscore_resume+0x90/0x340 suspend_devices_and_enter+0xaee/0xe90 pm_suspend.cold+0x36b/0x3c2 state_store+0x82/0xf0 kernfs_fop_write_iter+0x1b6/0x260 new_sync_write+0x258/0x370 vfs_write+0x33f/0x510 ksys_write+0xc9/0x160 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae lockdep_is_held() can return -1 when lockdep is disabled which triggers this warning. Let's use lockdep_assert_not_held() which can detect incorrect calls while holding a lock and it also avoids false negatives when lockdep is disabled. Signed-off-by:
Wanpeng Li <wanpengli@tencent.com> Message-Id: <1644920142-81249-1-git-send-email-wanpengli@tencent.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Feb 10, 2022
-
-
Jinrong Liang authored
The "struct kvm *kvm" parameter of kvm_make_vcpu_request() is not used, so remove it. No functional change intended. Signed-off-by:
Jinrong Liang <cloudliang@tencent.com> Message-Id: <20220125095909.38122-19-cloudliang@tencent.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 28, 2022
-
-
Hou Wenlong authored
Fix the following false positive warning: ============================= WARNING: suspicious RCU usage 5.16.0-rc4+ #57 Not tainted ----------------------------- arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by fc_vcpu 0/330: #0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm] #1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm] #2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm] stack backtrace: CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+ Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x44/0x57 kvm_notify_acked_gsi+0x6b/0x70 [kvm] kvm_notify_acked_irq+0x8d/0x180 [kvm] kvm_ioapic_update_eoi+0x92/0x240 [kvm] kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm] handle_apic_eoi_induced+0x3d/0x60 [kvm_intel] vmx_handle_exit+0x19c/0x6a0 [kvm_intel] vcpu_enter_guest+0x66e/0x1860 [kvm] kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm] kvm_vcpu_ioctl+0x38a/0x6f0 [kvm] __x64_sys_ioctl+0x89/0xc0 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu), kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact, kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it a false positive warning. So use hlist_for_each_entry_srcu() instead of hlist_for_each_entry_rcu(). Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Hou Wenlong <houwenlong93@linux.alibaba.com> Message-Id: <f98bac4f5052bad2c26df9ad50f7019e40434512.1643265976.git.houwenlong.hwl@antgroup.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 26, 2022
-
-
Sean Christopherson authored
Revert a completely broken check on an "invalid" RIP in SVM's workaround for the DecodeAssists SMAP errata. kvm_vcpu_gfn_to_memslot() obviously expects a gfn, i.e. operates in the guest physical address space, whereas RIP is a virtual (not even linear) address. The "fix" worked for the problematic KVM selftest because the test identity mapped RIP. Fully revert the hack instead of trying to translate RIP to a GPA, as the non-SEV case is now handled earlier, and KVM cannot access guest page tables to translate RIP. This reverts commit e72436bc. Fixes: e72436bc ("KVM: SVM: avoid infinite loop on NPF from bad address") Reported-by:
Liam Merwick <liam.merwick@oracle.com> Cc: stable@vger.kernel.org Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Liam Merwick <liam.merwick@oracle.com> Message-Id: <20220120010719.711476-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 24, 2022
-
-
Xianting Tian authored
The async parameter of hva_to_pfn_remapped() is not used, so remove it. Signed-off-by:
Xianting Tian <xianting.tian@linux.alibaba.com> Message-Id: <20220124020456.156386-1-xianting.tian@linux.alibaba.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 19, 2022
-
-
Sean Christopherson authored
Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and rename the list and all associated variables to clarify that it tracks the set of vCPU that need to be poked on a posted interrupt to the wakeup vector. The list is not used to track _all_ vCPUs that are blocking, and the term "blocked" can be misleading as it may refer to a blocking condition in the host or the guest, where as the PI wakeup case is specifically for the vCPUs that are actively blocking from within the guest. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-7-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Remove kvm_vcpu.pre_pcpu as it no longer has any users. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-6-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Christian Borntraeger authored
Avoid warnings on s390 like [ 1801.980931] CPU: 12 PID: 117600 Comm: kworker/12:0 Tainted: G E 5.17.0-20220113.rc0.git0.32ce2abb03cf.300.fc35.s390x+next #1 [ 1801.980938] Workqueue: events irqfd_inject [kvm] [...] [ 1801.981057] Call Trace: [ 1801.981060] [<000003ff805f0f5c>] mark_page_dirty_in_slot+0xa4/0xb0 [kvm] [ 1801.981083] [<000003ff8060e9fe>] adapter_indicators_set+0xde/0x268 [kvm] [ 1801.981104] [<000003ff80613c24>] set_adapter_int+0x64/0xd8 [kvm] [ 1801.981124] [<000003ff805fb9aa>] kvm_set_irq+0xc2/0x130 [kvm] [ 1801.981144] [<000003ff805f8d86>] irqfd_inject+0x76/0xa0 [kvm] [ 1801.981164] [<0000000175e56906>] process_one_work+0x1fe/0x470 [ 1801.981173] [<0000000175e570a4>] worker_thread+0x64/0x498 [ 1801.981176] [<0000000175e5ef2c>] kthread+0x10c/0x110 [ 1801.981180] [<0000000175de73c8>] __ret_from_fork+0x40/0x58 [ 1801.981185] [<000000017698440a>] ret_from_fork+0xa/0x40 when writing to a guest from an irqfd worker as long as we do not have the dirty ring. Signed-off-by:
Christian Borntraeger <borntraeger@linux.ibm.com> Reluctantly-acked-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20220113122924.740496-1-borntraeger@linux.ibm.com> Fixes: 2efd61a6 ("KVM: Warn if mark_page_dirty() is called without an active vCPU") Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 07, 2022
-
-
David Woodhouse authored
This can be used in two modes. There is an atomic mode where the cached mapping is accessed while holding the rwlock, and a mode where the physical address is used by a vCPU in guest mode. For the latter case, an invalidation will wake the vCPU with the new KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any caches it still needs to access before entering guest mode again. Only one vCPU can be targeted by the wake requests; it's simple enough to make it wake all vCPUs or even a mask but I don't see a use case for that additional complexity right now. Invalidation happens from the invalidate_range_start MMU notifier, which needs to be able to sleep in order to wake the vCPU and wait for it. This means that revalidation potentially needs to "wait" for the MMU operation to complete and the invalidate_range_end notifier to be invoked. Like the vCPU when it takes a page fault in that period, we just spin — fixing that in a future patch by implementing an actual *wait* may be another part of shaving this particularly hirsute yak. As noted in the comments in the function itself, the only case where the invalidate_range_start notifier is expected to be called *without* being able to sleep is when the OOM reaper is killing the process. In that case, we expect the vCPU threads already to have exited, and thus there will be nothing to wake, and no reason to wait. So we clear the KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly if there actually *was* anything to wake up. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-3-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Woodhouse authored
The various kvm_write_guest() and mark_page_dirty() functions must only ever be called in the context of an active vCPU, because if dirty ring tracking is enabled it may simply oops when kvm_get_running_vcpu() returns NULL for the vcpu and then kvm_dirty_ring_get() dereferences it. This oops was reported by "butt3rflyh4ck" <butterflyhuangxx@gmail.com> in https://lore.kernel.org/kvm/CAFcO6XOmoS7EacN_n6v4Txk7xL7iqRa2gABg3F7E3Naf5uG94g@mail.gmail.com/ That actual bug will be fixed under separate cover but this warning should help to prevent new ones from being added. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-2-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 09, 2021
-
-
David Woodhouse authored
Splitting kvm_main.c out into smaller and better-organized files is slightly non-trivial when it involves editing a bunch of per-arch KVM makefiles. Provide virt/kvm/Makefile.kvm for them to include. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Acked-by:
Marc Zyngier <maz@kernel.org> Message-Id: <20211121125451.9489-3-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-