- Dec 27, 2022
-
-
Lai Jiangshan authored
No code is using KVM_MMU_READ_LOCK() or KVM_MMU_READ_UNLOCK(). They used to be in virt/kvm/pfncache.c: KVM_MMU_READ_LOCK(kvm); retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva); KVM_MMU_READ_UNLOCK(kvm); However, since 58cd407c ("KVM: Fix multiple races in gfn=>pfn cache refresh", 2022-05-25) the code is only relying on the MMU notifier's invalidation count and sequence number. Signed-off-by:
Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20221207120617.9409-1-jiangshanlai@gmail.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 02, 2022
-
-
Sean Christopherson authored
Remove a comment about KVM_REQ_UNHALT being set by kvm_vcpu_check_block() that was missed when KVM_REQ_UNHALT was dropped. Fixes: c59fb127 ("KVM: remove KVM_REQ_UNHALT") Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221201220433.31366-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 30, 2022
-
-
Sean Christopherson authored
When refreshing a gfn=>pfn cache, skip straight to unlocking if the cache already valid instead of stuffing the "old" variables to turn the unmapping outro into a nop. Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Sean Christopherson authored
Drop the @gpa param from the exported check()+refresh() helpers and limit changing the cache's GPA to the activate path. All external users just feed in gpc->gpa, i.e. this is a fancy nop. Allowing users to change the GPA at check()+refresh() is dangerous as those helpers explicitly allow concurrent calls, e.g. KVM could get into a livelock scenario. It's also unclear as to what the expected behavior should be if multiple tasks attempt to refresh with different GPAs. Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Sean Christopherson authored
Don't partially reinitialize a gfn=>pfn cache when activating the cache, and instead assert that the cache is not valid during activation. Bug the VM if the assertion fails, as use-after-free and/or data corruption is all but guaranteed if KVM ends up with a valid-but-inactive cache. Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Sean Christopherson authored
Drop kvm_gpc_unmap() as it has no users and unclear requirements. The API was added as part of the original gfn_to_pfn_cache support, but its sole usage[*] was never merged. Fold the guts of kvm_gpc_unmap() into the deactivate path and drop the API. Omit acquiring refresh_lock as as concurrent calls to kvm_gpc_deactivate() are not allowed (this is not enforced, e.g. via lockdep. due to it being called during vCPU destruction). If/when temporary unmapping makes a comeback, the desirable behavior is likely to restrict temporary unmapping to vCPU-exclusive mappings and require the vcpu->mutex be held to serialize unmap. Use of the refresh_lock to protect unmapping was somewhat specuatively added by commit 93984f19 ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") to guard against concurrent unmaps, but the primary use case of the temporary unmap, nested virtualization[*], doesn't actually need or want concurrent unmaps. [*] https://lore.kernel.org/all/20211210163625.2886-7-dwmw2@infradead.org Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Michal Luczaj authored
Make kvm_gpc_refresh() use kvm instance cached in gfn_to_pfn_cache. No functional change intended. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> [sean: leave kvm_gpc_unmap() as-is] Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Michal Luczaj authored
Make hva_to_pfn_retry() use kvm instance cached in gfn_to_pfn_cache. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Michal Luczaj authored
Make kvm_gpc_check() use kvm instance cached in gfn_to_pfn_cache. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Michal Luczaj authored
Move the assignment of immutable properties @kvm, @vcpu, and @usage to the initializer. Make _activate() and _deactivate() use stored values. Note, @len is also effectively immutable for most cases, but not in the case of the Xen runstate cache, which may be split across two pages and the length of the first segment will depend on its address. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> [sean: handle @len in a separate patch] Signed-off-by:
Sean Christopherson <seanjc@google.com> [dwmw2: acknowledge that @len can actually change for some use cases] Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Michal Luczaj authored
Remove the unused @kvm argument from gpc_unmap_khva(). Signed-off-by:
Michal Luczaj <mhal@rbox.co> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Michal Luczaj authored
Formalize "gpc" as the acronym and use it in function names. No functional change intended. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 23, 2022
-
-
David Woodhouse authored
In the case where a GPC is refreshed to a different location within the same page, we didn't bother to update it. Mostly we don't need to, but since the ->khva field also includes the offset within the page, that does have to be updated. Fixes: 3ba2c95e ("KVM: Do not incorporate page offset into gfn=>pfn cache user address") Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Reviewed-by:
Paul Durrant <paul@xen.org> Reviewed-by:
Sean Christopherson <seanjc@google.com> Cc: stable@kernel.org Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 18, 2022
-
-
Paolo Bonzini authored
Since gfn_to_memslot() is relatively expensive, it helps to skip it if it the memslot cannot possibly have dirty logging enabled. In order to do this, add to struct kvm a counter of the number of log-page memslots. While the correct value can only be read with slots_lock taken, the NX recovery thread is content with using an approximate value. Therefore, the counter is an atomic_t. Based on https://lore.kernel.org/kvm/20221027200316.2221027-2-dmatlack@google.com/ by David Matlack. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 17, 2022
-
-
David Matlack authored
Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL on every halt, rather than just sampling the module parameter when the VM is first created. This restore the original behavior of kvm.halt_poll_ns for VMs that have not opted into KVM_CAP_HALT_POLL. Notably, this change restores the ability for admins to disable or change the maximum halt-polling time system wide for VMs not using KVM_CAP_HALT_POLL. Reported-by:
Christian Borntraeger <borntraeger@de.ibm.com> Fixes: acd05785 ("kvm: add capability for halt polling") Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-4-dmatlack@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Avoid re-reading kvm->max_halt_poll_ns multiple times during halt-polling except when it is explicitly useful, e.g. to check if the max time changed across a halt. kvm->max_halt_poll_ns can be changed at any time by userspace via KVM_CAP_HALT_POLL. This bug is unlikely to cause any serious side-effects. In the worst case one halt polls for shorter or longer than it should, and then is fixed up on the next halt. Furthmore, this is still possible since kvm->max_halt_poll_ns are not synchronized with halts. Fixes: acd05785 ("kvm: add capability for halt polling") Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-3-dmatlack@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Cap vcpu->halt_poll_ns based on the max halt polling time just before halting, rather than after the last halt. This arguably provides better accuracy if an admin disables halt polling in between halts, although the improvement is nominal. A side-effect of this change is that grow_halt_poll_ns() no longer needs to access vcpu->kvm->max_halt_poll_ns, which will be useful in a future commit where the max halt polling time can come from the module parameter halt_poll_ns instead. Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-2-dmatlack@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 12, 2022
-
-
Gavin Shan authored
In mark_page_dirty_in_slot(), we bail out when no running vcpu exists and a running vcpu context is strictly required by architecture. It may cause backwards compatible issue. Currently, saving vgic/its tables is the only known case where no running vcpu context is expected. We may have other unknown cases where no running vcpu context exists and it's reported by the warning message and we bail out without pushing the dirty information to the backup bitmap. For this, the application is going to enable the backup bitmap for the unknown cases. However, the dirty information can't be pushed to the backup bitmap even though the backup bitmap is enabled for those unknown cases in the application, until the unknown cases are added to the allowed list of non-running vcpu context with extra code changes to the host kernel. In order to make the new application, where the backup bitmap has been enabled, to work with the unchanged host, we continue to push the dirty information to the backup bitmap instead of bailing out early. With the added check on 'memslot->dirty_bitmap' to mark_page_dirty_in_slot(), the kernel crash is avoided silently by the combined conditions: no running vcpu context, kvm_arch_allow_write_without_running_vcpu() returns 'true', and the backup bitmap (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) isn't enabled yet. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Gavin Shan <gshan@redhat.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221112094322.21911-1-gshan@redhat.com
-
- Nov 10, 2022
-
-
Gavin Shan authored
ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is enabled. It's conflicting with that ring-based dirty page tracking always requires a running VCPU context. Introduce a new flavor of dirty ring that requires the use of both VCPU dirty rings and a dirty bitmap. The expectation is that for non-VCPU sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to the dirty bitmap. Userspace should scan the dirty bitmap before migrating the VM to the target. Use an additional capability to advertise this behavior. The newly added capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Suggested-by:
Marc Zyngier <maz@kernel.org> Suggested-by:
Peter Xu <peterx@redhat.com> Co-developed-by:
Oliver Upton <oliver.upton@linux.dev> Signed-off-by:
Oliver Upton <oliver.upton@linux.dev> Signed-off-by:
Gavin Shan <gshan@redhat.com> Acked-by:
Peter Xu <peterx@redhat.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-4-gshan@redhat.com
-
Gavin Shan authored
The VCPU isn't expected to be runnable when the dirty ring becomes soft full, until the dirty pages are harvested and the dirty ring is reset from userspace. So there is a check in each guest's entrace to see if the dirty ring is soft full or not. The VCPU is stopped from running if its dirty ring has been soft full. The similar check will be needed when the feature is going to be supported on ARM64. As Marc Zyngier suggested, a new event will avoid pointless overhead to check the size of the dirty ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance. Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring becomes soft full in kvm_dirty_ring_push(). The event is only cleared in the check, done in the newly added helper kvm_dirty_ring_check_request(). Since the VCPU is not runnable when the dirty ring becomes soft full, the KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent the VCPU from running until the dirty pages are harvested and the dirty ring is reset by userspace. kvm_dirty_ring_soft_full() becomes a private function with the newly added helper kvm_dirty_ring_check_request(). The alignment for the various event definitions in kvm_host.h is changed to tab character by the way. In order to avoid using 'container_of()', the argument @ring is replaced by @vcpu in kvm_dirty_ring_push(). Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-maz@kernel.org Suggested-by:
Marc Zyngier <maz@kernel.org> Signed-off-by:
Gavin Shan <gshan@redhat.com> Reviewed-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-2-gshan@redhat.com
-
- Nov 09, 2022
-
-
Paolo Bonzini authored
virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source directory (i.e. not from arch/*/include) for the sole purpose of retrieving irqchip_in_kernel. Making the function inline in a header that is already included, such as asm/kvm_host.h, is not possible because it needs to look at struct kvm which is defined after asm/kvm_host.h is included. So add a kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is only performance critical on arm64 and x86, and the non-inline function is enough on all other architectures. irq.h can then be deleted from all architectures except x86. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Peter Xu authored
Add a new "interruptible" flag showing that the caller is willing to be interrupted by signals during the __gfn_to_pfn_memslot() request. Wire it up with a FOLL_INTERRUPTIBLE flag that we've just introduced. This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the SIGIPI) even during e.g. handling an userfaultfd page fault. No functional change intended. Signed-off-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221011195809.557016-4-peterx@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Peter Xu authored
Add a new pfn error to show that we've got a pending signal to handle during hva_to_pfn_slow() procedure (of -EINTR retval). Signed-off-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221011195809.557016-3-peterx@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Oct 31, 2022
-
-
Gavin Shan authored
There are two capabilities related to ring-based dirty page tracking: KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Both are supported by x86. However, arm64 supports KVM_CAP_DIRTY_LOG_RING_ACQ_REL only when the feature is supported on arm64. The userspace doesn't have to enable the advertised capability, meaning KVM_CAP_DIRTY_LOG_RING can be enabled on arm64 by userspace and it's wrong. Fix it by double checking if the capability has been advertised prior to enabling it. It's rejected to enable the capability if it hasn't been advertised. Fixes: 17601bfe ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option") Reported-by:
Sean Christopherson <seanjc@google.com> Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Gavin Shan <gshan@redhat.com> Reviewed-by:
Oliver Upton <oliver.upton@linux.dev> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221031003621.164306-4-gshan@redhat.com
-
- Oct 27, 2022
-
-
Sean Christopherson authored
Reject kvm_gpc_check() and kvm_gpc_refresh() if the cache is inactive. Not checking the active flag during refresh is particularly egregious, as KVM can end up with a valid, inactive cache, which can lead to a variety of use-after-free bugs, e.g. consuming a NULL kernel pointer or missing an mmu_notifier invalidation due to the cache not being on the list of gfns to invalidate. Note, "active" needs to be set if and only if the cache is on the list of caches, i.e. is reachable via mmu_notifier events. If a relevant mmu_notifier event occurs while the cache is "active" but not on the list, KVM will not acquire the cache's lock and so will not serailize the mmu_notifier event with active users and/or kvm_gpc_refresh(). A race between KVM_XEN_ATTR_TYPE_SHARED_INFO and KVM_XEN_HVM_EVTCHN_SEND can be exploited to trigger the bug. 1. Deactivate shinfo cache: kvm_xen_hvm_set_attr case KVM_XEN_ATTR_TYPE_SHARED_INFO kvm_gpc_deactivate kvm_gpc_unmap gpc->valid = false gpc->khva = NULL gpc->active = false Result: active = false, valid = false 2. Cause cache refresh: kvm_arch_vm_ioctl case KVM_XEN_HVM_EVTCHN_SEND kvm_xen_hvm_evtchn_send kvm_xen_set_evtchn kvm_xen_set_evtchn_fast kvm_gpc_check return -EWOULDBLOCK because !gpc->valid kvm_xen_set_evtchn_fast return -EWOULDBLOCK kvm_gpc_refresh hva_to_pfn_retry gpc->valid = true gpc->khva = not NULL Result: active = false, valid = true 3. Race ioctl KVM_XEN_HVM_EVTCHN_SEND against ioctl KVM_XEN_ATTR_TYPE_SHARED_INFO: kvm_arch_vm_ioctl case KVM_XEN_HVM_EVTCHN_SEND kvm_xen_hvm_evtchn_send kvm_xen_set_evtchn kvm_xen_set_evtchn_fast read_lock gpc->lock kvm_xen_hvm_set_attr case KVM_XEN_ATTR_TYPE_SHARED_INFO mutex_lock kvm->lock kvm_xen_shared_info_init kvm_gpc_activate gpc->khva = NULL kvm_gpc_check [ Check passes because gpc->valid is still true, even though gpc->khva is already NULL. ] shinfo = gpc->khva pending_bits = shinfo->evtchn_pending CRASH: test_and_set_bit(..., pending_bits) Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Cc: stable@vger.kernel.org Reported-by:
: Michal Luczaj <mhal@rbox.co> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221013211234.1318131-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Michal Luczaj authored
Move the gfn_to_pfn_cache lock initialization to another helper and call the new helper during VM/vCPU creation. There are race conditions possible due to kvm_gfn_to_pfn_cache_init()'s ability to re-initialize the cache's locks. For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and kvm_gfn_to_pfn_cache_init() leads to a corrupted shinfo gpc lock. (thread 1) | (thread 2) | kvm_xen_set_evtchn_fast | read_lock_irqsave(&gpc->lock, ...) | | kvm_gfn_to_pfn_cache_init | rwlock_init(&gpc->lock) read_unlock_irqrestore(&gpc->lock, ...) | Rename "cache_init" and "cache_destroy" to activate+deactivate to avoid implying that the cache really is destroyed/freed. Note, there more races in the newly named kvm_gpc_activate() that will be addressed separately. Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Cc: stable@vger.kernel.org Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> [sean: call out that this is a bug fix] Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221013211234.1318131-2-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Hou Wenlong authored
Although simple_attr_open() fails only with -ENOMEM with current code base, it would be nicer to return retval of simple_attr_open() directly in kvm_debugfs_open(). No functional change intended. Signed-off-by:
Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <69d64d93accd1f33691b8a383ae555baee80f943.1665975828.git.houwenlong.hwl@antgroup.com> Cc: stable@vger.kernel.org Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Oct 22, 2022
-
-
Alexander Graf authored
We will introduce the first architecture specific compat vm ioctl in the next patch. Add all necessary boilerplate to allow architectures to override compat vm ioctls when necessary. Signed-off-by:
Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-2-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Oct 07, 2022
-
-
Jason Gunthorpe authored
SPAPR exists completely outside the normal iommu driver framework, the groups it creates are fake and are only created to enable VFIO's uAPI. Thus, it does not need to follow the iommu core rule that the iommu_group will only be touched while a driver is attached. Carry a group reference into KVM and have KVM directly manage the lifetime of this object independently of VFIO. This means KVM no longer relies on the vfio group file being valid to maintain the group reference. Tested-by:
Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v2-15417f29324e+1c-vfio_group_disassociate_jgg@nvidia.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
Jason Gunthorpe authored
This replaces uses of vfio_file_iommu_group() which were only detecting if the file is a VFIO file with no interest in the actual group. The only remaning user of vfio_file_iommu_group() is in KVM for the SPAPR stuff. It passes the iommu_group into the arch code through kvm for some reason. Tested-by:
Matthew Rosato <mjrosato@linux.ibm.com> Tested-by:
Christian Borntraeger <borntraeger@de.ibm.com> Tested-by:
Eric Farman <farman@linux.ibm.com> Signed-off-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v2-15417f29324e+1c-vfio_group_disassociate_jgg@nvidia.com Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
- Sep 29, 2022
-
-
Marc Zyngier authored
In order to differenciate between architectures that require no extra synchronisation when accessing the dirty ring and those who do, add a new capability (KVM_CAP_DIRTY_LOG_RING_ACQ_REL) that identify the latter sort. TSO architectures can obviously advertise both, while relaxed architectures must only advertise the ACQ_REL version. This requires some configuration symbol rejigging, with HAVE_KVM_DIRTY_RING being only indirectly selected by two top-level config symbols: - HAVE_KVM_DIRTY_RING_TSO for strongly ordered architectures (x86) - HAVE_KVM_DIRTY_RING_ACQ_REL for weakly ordered architectures (arm64) Suggested-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Reviewed-by:
Gavin Shan <gshan@redhat.com> Reviewed-by:
Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20220926145120.27974-3-maz@kernel.org
-
Marc Zyngier authored
The current implementation of the dirty ring has an implicit requirement that stores to the dirty ring from userspace must be: - be ordered with one another - visible from another CPU executing a ring reset While these implicit requirements work well for x86 (and any other TSO-like architecture), they do not work for more relaxed architectures such as arm64 where stores to different addresses can be freely reordered, and loads from these addresses not observing writes from another CPU unless the required barriers (or acquire/release semantics) are used. In order to start fixing this, upgrade the ring reset accesses: - the kvm_dirty_gfn_harvested() helper now uses acquire semantics so it is ordered after all previous writes, including that from userspace - the kvm_dirty_gfn_set_invalid() helper now uses release semantics so that the next_slot and next_offset reads don't drift past the entry invalidation This is only a partial fix as the userspace side also need upgrading. Signed-off-by:
Marc Zyngier <maz@kernel.org> Reviewed-by:
Gavin Shan <gshan@redhat.com> Reviewed-by:
Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20220926145120.27974-2-maz@kernel.org
-
- Sep 26, 2022
-
-
Paolo Bonzini authored
KVM_REQ_UNHALT is now unnecessary because it is replaced by the return value of kvm_vcpu_block/kvm_vcpu_halt. Remove it. No functional change intended. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Acked-by:
Marc Zyngier <maz@kernel.org> Message-Id: <20220921003201.1441511-13-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Miaohe Lin authored
When alloc_cpumask_var_node() fails for a certain cpu, there might be some allocated cpumasks for percpu cpu_kick_mask. We should free these cpumasks or memoryleak will occur. Fixes: baff59cc ("KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()") Signed-off-by:
Miaohe Lin <linmiaohe@huawei.com> Link: https://lore.kernel.org/r/20220823063414.59778-1-linmiaohe@huawei.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Aug 19, 2022
-
-
Li kunyu authored
The variable is initialized but it is only used after its assignment. Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Li kunyu <kunyu@nfschina.com> Message-Id: <20220819021535.483702-1-kunyu@nfschina.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Li kunyu authored
The variable is initialized but it is only used after its assignment. Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Li kunyu <kunyu@nfschina.com> Message-Id: <20220819022804.483914-1-kunyu@nfschina.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Chao Peng authored
The motivation of this renaming is to make these variables and related helper functions less mmu_notifier bound and can also be used for non mmu_notifier based page invalidation. mmu_invalidate_* was chosen to better describe the purpose of 'invalidating' a page that those variables are used for. - mmu_notifier_seq/range_start/range_end are renamed to mmu_invalidate_seq/range_start/range_end. - mmu_notifier_retry{_hva} helper functions are renamed to mmu_invalidate_retry{_hva}. - mmu_notifier_count is renamed to mmu_invalidate_in_progress to avoid confusion with mn_active_invalidate_count. - While here, also update kvm_inc/dec_notifier_count() to kvm_mmu_invalidate_begin/end() to match the change for mmu_notifier_count. No functional change intended. Signed-off-by:
Chao Peng <chao.p.peng@linux.intel.com> Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating and initializing coalesced MMIO objects is separate from registering any associated devices. Moving coalesced MMIO cleans up the last oddity where KVM does VM creation/initialization after kvm_create_vm(), and more importantly after kvm_arch_post_init_vm() is called and the VM is added to the global vm_list, i.e. after the VM is fully created as far as KVM is concerned. Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but the original implementation was completely devoid of error handling. Commit 6ce5a090 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s error handling" fixed the various bugs, and in doing so rightly moved the call to after kvm_create_vm() because kvm_coalesced_mmio_init() also registered the coalesced MMIO device. Commit 2b3c246a ("KVM: Make coalesced mmio use a device per zone") cleaned up that mess by having each zone register a separate device, i.e. moved device registration to its logical home in kvm_vm_ioctl_register_coalesced_mmio(). As a result, kvm_coalesced_mmio_init() is now a "pure" initialization helper and can be safely called from kvm_create_vm(). Opportunstically drop the #ifdef, KVM provides stubs for kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (s390). Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20220816053937.2477106-4-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Unconditionally get a reference to the /dev/kvm module when creating a VM instead of using try_get_module(), which will fail if the module is in the process of being forcefully unloaded. The error handling when try_get_module() fails doesn't properly unwind all that has been done, e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM from the global list. Not removing VMs from the global list tends to be fatal, e.g. leads to use-after-free explosions. The obvious alternative would be to add proper unwinding, but the justification for using try_get_module(), "rmmod --wait", is completely bogus as support for "rmmod --wait", i.e. delete_module() without O_NONBLOCK, was removed by commit 3f2b9c9c ("module: remove rmmod --wait option.") nearly a decade ago. It's still possible for try_get_module() to fail due to the module dying (more like being killed), as the module will be tagged MODULE_STATE_GOING by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice with forced unloading is an exercise in futility and gives a falsea sense of security. Using try_get_module() only prevents acquiring _new_ references, it doesn't magically put the references held by other VMs, and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but guaranteed to cause spectacular fireworks; the window where KVM will fail try_get_module() is tiny compared to the window where KVM is building and running the VM with an elevated module refcount. Addressing KVM's inability to play nice with "rmmod --force" is firmly out-of-scope. Forcefully unloading any module taints kernel (for obvious reasons) _and_ requires the kernel to be built with CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the amusing disclaimer that it's "mainly for kernel developers and desperate users". In other words, KVM is free to scoff at bug reports due to using "rmmod --force" while VMs may be running. Fixes: 5f6de5cb ("KVM: Prevent module exit until all VMs are freed") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20220816053937.2477106-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Properly unwind VM creation if kvm_create_vm_debugfs() fails. A recent change to invoke kvm_create_vm_debug() in kvm_create_vm() was led astray by buggy try_get_module() handling adding by commit 5f6de5cb ("KVM: Prevent module exit until all VMs are freed"). The debugfs error path effectively inherits the bad error path of try_module_get(), e.g. KVM leaves the to-be-free VM on vm_list even though KVM appears to do the right thing by calling module_put() and falling through. Opportunistically hoist kvm_create_vm_debugfs() above the call to kvm_arch_post_init_vm() so that the "post-init" arch hook is actually invoked after the VM is initialized (ignoring kvm_coalesced_mmio_init() for the moment). x86 is the only non-nop implementation of the post-init hook, and it doesn't allocate/initialize any objects that are reachable via debugfs code (spawns a kthread worker for the NX huge page mitigation). Leave the buggy try_get_module() alone for now, it will be fixed in a separate commit. Fixes: b74ed7a6 ("KVM: Actually create debugfs in kvm_create_vm()") Reported-by:
<syzbot+744e173caec2e1627ee0@syzkaller.appspotmail.com> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Oliver Upton <oliver.upton@linux.dev> Message-Id: <20220816053937.2477106-2-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-