- Nov 26, 2021
-
-
Paolo Bonzini authored
This is not an unrecoverable situation. Users of kvm_read_guest_offset_cached and kvm_write_guest_offset_cached must expect the read/write to fail, and therefore it is possible to just return early with an error value. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 18, 2021
-
-
Sean Christopherson authored
Reject userspace memslots whose size exceeds the storage capacity of an "unsigned long". KVM's uAPI takes the size as u64 to support large slots on 64-bit hosts, but does not account for the size being truncated on 32-bit hosts in various flows. The access_ok() check on the userspace virtual address in particular casts the size to "unsigned long" and will check the wrong number of bytes. KVM doesn't actually support slots whose size doesn't fit in an "unsigned long", e.g. KVM's internal kvm_memory_slot.npages is an "unsigned long", not a "u64", and misc arch specific code follows that behavior. Fixes: fa3d315a ("KVM: Validate userspace_addr of memslot when registered") Cc: stable@vger.kernel.org Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <20211104002531.1176691-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When modifying memslots, snapshot the "old" memslot and copy it to the "new" memslot's arch data after (re)acquiring slots_arch_lock. x86 can change a memslot's arch data while memslot updates are in-progress so long as it holds slots_arch_lock, thus snapshotting a memslot without holding the lock can result in the consumption of stale data. Fixes: b10a038e ("KVM: mmu: Add slots_arch_lock for memslot arch fields") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211104002531.1176691-2-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Woodhouse authored
In commit 7e2175eb ("KVM: x86: Fix recording of guest steal time / preempted status") I removed the only user of these functions because it was basically impossible to use them safely. There are two stages to the GFN->PFN mapping; first through the KVM memslots to a userspace HVA and then through the page tables to translate that HVA to an underlying PFN. Invalidations of the former were being handled correctly, but no attempt was made to use the MMU notifiers to invalidate the cache when the HVA->GFN mapping changed. As a prelude to reinventing the gfn_to_pfn_cache with more usable semantics, rip it out entirely and untangle the implementation of the unsafe kvm_vcpu_map()/kvm_vcpu_unmap() functions from it. All current users of kvm_vcpu_map() also look broken right now, and will be dealt with separately. They broadly fall into two classes: * Those which map, access the data and immediately unmap. This is mostly gratuitous and could just as well use the existing user HVA, and could probably benefit from a gfn_to_hva_cache as they do so. * Those which keep the mapping around for a longer time, perhaps even using the PFN directly from the guest. These will need to be converted to the new gfn_to_pfn_cache and then kvm_vcpu_map() can be removed too. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-8-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 11, 2021
-
-
Paolo Bonzini authored
Generalize KVM_REQ_VM_BUGGED so that it can be called even in cases where it is by design that the VM cannot be operated upon. In this case any KVM_BUG_ON should still warn, so introduce a new flag kvm->vm_dead that is separate from kvm->vm_bugged. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Sep 30, 2021
-
-
Longpeng(Mike) authored
All of the irqfds would to be updated when update the irq routing, it's too expensive if there're too many irqfds. However we can reduce the cost by avoid some unnecessary updates. For irqs of MSI type on X86, the update can be saved if the msi values are not change. The vfio migration could receives benefit from this optimi- zaiton. The test VM has 128 vcpus and 8 VF (with 65 vectors enabled), so the VM has more than 520 irqfds. We mesure the cost of the vfio_msix_enable (in QEMU, it would set routing for each irqfd) for each VF, and we can see the total cost can be significantly reduced. Origin Apply this Patch 1st 8 4 2nd 15 5 3rd 22 6 4th 24 6 5th 36 7 6th 44 7 7th 51 8 8th 58 8 Total 258ms 51ms We're also tring to optimize the QEMU part [1], but it's still worth to optimize the KVM to gain more benefits. [1] https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04215.html Signed-off-by:
Longpeng(Mike) <longpeng2@huawei.com> Message-Id: <20210827080003.2689-1-longpeng2@huawei.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Juergen Gross authored
KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the number of allowed vcpu-ids. This has already led to confusion, so rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more clear Suggested-by:
Eduardo Habkost <ehabkost@redhat.com> Signed-off-by:
Juergen Gross <jgross@suse.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210913135745.13944-3-jgross@suse.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
kvm_make_vcpus_request_mask() already disables preemption so just like kvm_make_all_cpus_request_except() it can be switched to using pre-allocated per-cpu cpumasks. This allows for improvements for both users of the function: in Hyper-V emulation code 'tlb_flush' can now be dropped from 'struct kvm_vcpu_hv' and kvm_make_scan_ioapic_request_mask() gets rid of dynamic allocation. cpumask_available() checks in kvm_make_vcpu_request() and kvm_kick_many_cpus() can now be dropped as they checks for an impossible condition: kvm_init() makes sure per-cpu masks are allocated. Signed-off-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-9-vkuznets@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Allocating cpumask dynamically in zalloc_cpumask_var() is not ideal. Allocation is somewhat slow and can (in theory and when CPUMASK_OFFSTACK) fail. kvm_make_all_cpus_request_except() already disables preemption so we can use pre-allocated per-cpu cpumasks instead. Signed-off-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-8-vkuznets@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Both remaining callers of kvm_make_vcpus_request_mask() pass 'NULL' for 'except' parameter so it can just be dropped. No functional change intended ©. Suggested-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-6-vkuznets@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Iterating over set bits in 'vcpu_bitmap' should be faster than going through all vCPUs, especially when just a few bits are set. Drop kvm_make_vcpus_request_mask() call from kvm_make_all_cpus_request_except() to avoid handling the special case when 'vcpu_bitmap' is NULL, move the code to kvm_make_all_cpus_request_except() itself. Signed-off-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-5-vkuznets@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Yang Li authored
Use vma_pages function on vma object instead of explicit computation. Fix the following coccicheck warning: ./virt/kvm/kvm_main.c:3526:29-35: WARNING: Consider using vma_pages helper on vma Reported-by:
Abaci Robot <abaci@linux.alibaba.com> Signed-off-by:
Yang Li <yang.lee@linux.alibaba.com> Message-Id: <1632900526-119643-1-git-send-email-yang.lee@linux.alibaba.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Sep 23, 2021
-
-
Lai Jiangshan authored
There is no user of tlbs_dirty. Signed-off-by:
Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-4-jiangshanlai@gmail.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Sep 22, 2021
-
-
Sean Christopherson authored
Check for a NULL cpumask_var_t when kicking multiple vCPUs via cpumask_available(), which performs a !NULL check if and only if cpumasks are configured to be allocated off-stack. This is a meaningless optimization, e.g. avoids a TEST+Jcc and TEST+CMOV on x86, but more importantly helps document that the NULL check is necessary even though all callers pass in a local variable. No functional change intended. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210827092516.1027264-3-vkuznets@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu is read exactly once, and by ensuring the vCPU is booted from guest mode if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if kvm_request_needs_ipi() returns true. Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or kvm_request_needs_ipi()) means the target vCPU could get migrated (change vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and reading vcpu->mode. If that happens, the kick/IPI will be sent to the old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs. Although failing to kick the vCPU is not exactly ideal, practically speaking it cannot cause a functional issue unless there is also a bug in the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s behavior. The purpose of sending an IPI is purely to get a vCPU into the host (or out of reading SPTEs) so that the vCPU can recognize a change in state, e.g. a KVM_REQ_* request. If vCPU's handling of the state change is required for correctness, KVM must ensure either the vCPU sees the change before entering the guest, or that the sender sees the vCPU as running in guest mode. All architectures handle this by (a) sending the request before calling kvm_vcpu_kick() and (b) checking for requests _after_ setting vcpu->mode. x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to ensure it kicks and waits for vCPUs that started reading SPTEs _before_ MMU changes were finalized, but any vCPU that starts reading after MMU changes were finalized will see the new state and can continue on uninterrupted. For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g. x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied upon for functional correctness, e.g. in the dirty log case, userspace cannot assume it has a 100% complete log if vCPUs are still running. All that said, eliminate the benign race since the cost of doing so is an "extra" atomic cmpxchg() in the case where the target vCPU is loaded by the current pCPU or is not loaded at all. I.e. the kick will be skipped due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as opposed to the kick being skipped because of the cpu checks. Keep the "cpu != me" checks even though they appear useless/impossible at first glance. x86 processes guest IPI writes in a fast path that runs in IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or READING_SHADOW_PAGE_TABLES is perfectly reasonable. Note, a race with the cpu_online() check in kvm_vcpu_kick() likely persists, e.g. the vCPU could exit guest mode and get offlined between the cpu_online() check and the sending of smp_send_reschedule(). But, the online check appears to exist only to avoid a WARN in x86's native_smp_send_reschedule() that fires if the target CPU is not online. The reschedule WARN exists because CPU offlining takes the CPU out of the scheduling pool, i.e. the WARN is intended to detect the case where the kernel attempts to schedule a task on an offline CPU. The actual sending of the IPI is a non-issue as at worst it will simpy be dropped on the floor. In other words, KVM's usurping of the reschedule IPI could theoretically trigger a WARN if the stars align, but there will be no loss of functionality. [*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a Cc: Venkatesh Srinivas <venkateshs@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Fixes: 97222cc8 ("KVM: Emulate local APIC in kernel") Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210827092516.1027264-2-vkuznets@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sergey Senozhatsky authored
grow_halt_poll_ns() ignores values between 0 and halt_poll_ns_grow_start (10000 by default). However, when we shrink halt_poll_ns we may fall way below halt_poll_ns_grow_start and endup with halt_poll_ns values that don't make a lot of sense: like 1 or 9, or 19. VCPU1 trace (halt_poll_ns_shrink equals 2): VCPU1 grow 10000 VCPU1 shrink 5000 VCPU1 shrink 2500 VCPU1 shrink 1250 VCPU1 shrink 625 VCPU1 shrink 312 VCPU1 shrink 156 VCPU1 shrink 78 VCPU1 shrink 39 VCPU1 shrink 19 VCPU1 shrink 9 VCPU1 shrink 4 Mirror what grow_halt_poll_ns() does and set halt_poll_ns to 0 as soon as new shrink-ed halt_poll_ns value falls below halt_poll_ns_grow_start. Signed-off-by:
Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210902031100.252080-1-senozhatsky@chromium.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Sep 06, 2021
-
-
Peter Xu authored
Drop the unused function as reported by test bot. Reported-by:
kernel test robot <lkp@intel.com> Signed-off-by:
Peter Xu <peterx@redhat.com> Message-Id: <20210901230904.15164-1-peterx@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jing Zhang authored
Add a new stat that counts the number of times a remote TLB flush is requested, regardless of whether it kicks vCPUs out of guest mode. This allows us to look at how often flushes are initiated. Unlike remote_tlb_flush, this one applies to ARM's instruction-set-based TLB flush implementation, so apply it there too. Original-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210817002639.3856694-1-jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Don't export KVM's MMU notifier count helpers, under no circumstance should any downstream module, including x86's vendor code, have a legitimate reason to piggyback KVM's MMU notifier logic. E.g in the x86 case, only KVM's MMU should be elevating the notifier count, and that code is always built into the core kvm.ko module. Fixes: edb298c6 ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range") Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210902175951.1387989-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Aug 20, 2021
-
-
Jing Zhang authored
Add three log histogram stats to record the distribution of time spent on successful polling, failed polling and VCPU wait. halt_poll_success_hist: Distribution of spent time for a successful poll. halt_poll_fail_hist: Distribution of spent time for a failed poll. halt_wait_hist: Distribution of time a VCPU has spent on waiting. Signed-off-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-6-jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jing Zhang authored
Add simple stats halt_wait_ns to record the time a VCPU has spent on waiting for all architectures (not just powerpc). Signed-off-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-5-jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
This together with previous patch, ensures that kvm_zap_gfn_range doesn't race with page fault running on another vcpu, and will make this page fault code retry instead. This is based on a patch suggested by Sean Christopherson: https://lkml.org/lkml/2021/7/22/1025 Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-5-mlevitsk@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Aug 13, 2021
-
-
Peter Xu authored
Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory besides the stats fields. The new interface kvm_arch_create_vm_debugfs() is defined but not yet used. It's called after kvm->debugfs_dentry is created, so it can be referenced directly in kvm_arch_create_vm_debugfs(). Arch should define their own versions when they want to create extra debugfs nodes. Signed-off-by:
Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-2-peterx@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
These stores are copied and pasted from the "if" statements above. They are dead and while they are not really a bug, they can be confusing to anyone reading the code as well. Remove them. Reported-by:
kernel test robot <lkp@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Aug 06, 2021
-
-
David Matlack authored
The memslot for a given gfn is looked up multiple times during page fault handling. Avoid binary searching for it multiple times by caching the most recently used slot. There is an existing VM-wide last_used_slot but that does not work well for cases where vCPUs are accessing memory in different slots (see performance data below). Another benefit of caching the most recently use slot (versus looking up the slot once and passing around a pointer) is speeding up memslot lookups *across* faults and during spte prefetching. To measure the performance of this change I ran dirty_log_perf_test with 64 vCPUs and 64 memslots and measured "Populate memory time" and "Iteration 2 dirty memory time". Tests were ran with eptad=N to force dirty logging to use fast_page_fault so its performance could be measured. Config | Metric | Before | After ---------- | ----------------------------- | ------ | ------ tdp_mmu=Y | Populate memory time | 6.76s | 5.47s tdp_mmu=Y | Iteration 2 dirty memory time | 2.83s | 0.31s tdp_mmu=N | Populate memory time | 20.4s | 18.7s tdp_mmu=N | Iteration 2 dirty memory time | 2.65s | 0.30s The "Iteration 2 dirty memory time" results are especially compelling because they are equivalent to running the same test with a single memslot. In other words, fast_page_fault performance no longer scales with the number of memslots. Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-4-dmatlack@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
lru_slot is used to keep track of the index of the most-recently used memslot. The correct acronym would be "mru" but that is not a common acronym. So call it last_used_slot which is a bit more obvious. Suggested-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-2-dmatlack@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Aug 04, 2021
-
-
Paolo Bonzini authored
KVM creates a debugfs directory for each VM in order to store statistics about the virtual machine. The directory name is built from the process pid and a VM fd. While generally unique, it is possible to keep a file descriptor alive in a way that causes duplicate directories, which manifests as these messages: [ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present! Even though this should not happen in practice, it is more or less expected in the case of KVM for testcases that call KVM_CREATE_VM and close the resulting file descriptor repeatedly and in parallel. When this happens, debugfs_create_dir() returns an error but kvm_create_vm_debugfs() goes on to allocate stat data structs which are later leaked. The slow memory leak was spotted by syzkaller, where it caused OOM reports. Since the issue only affects debugfs, do a lookup before calling debugfs_create_dir, so that the message is downgraded and rate-limited. While at it, ensure kvm->debugfs_dentry is NULL rather than an error if it is not created. This fixes kvm_destroy_vm_debugfs, which was not checking IS_ERR_OR_NULL correctly. Cc: stable@vger.kernel.org Fixes: 536a6f88 ("KVM: Create debugfs dir and stat files for each VM") Reported-by:
Alexey Kardashevskiy <aik@ozlabs.ru> Suggested-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Aug 03, 2021
-
-
Paolo Bonzini authored
Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications that are unrelated to KVM. This is possible now that memslot updates are blocked from range_start() to range_end(); that ensures that lock elision happens in both or none, and therefore that mmu_notifier_count updates (which must occur while holding mmu_lock for write) are always paired across start->end. Based on patches originally written by Ben Gardon. Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}() notifications that are unrelated to KVM. Because mmu_notifier_count must be modified while holding mmu_lock for write, and must always be paired across start->end to stay balanced, lock elision must happen in both or none. Therefore, in preparation for this change, this patch prevents memslot updates across range_start() and range_end(). Note, technically flag-only memslot updates could be allowed in parallel, but stalling a memslot update for a relatively short amount of time is not a scalability issue, and this is all more than complex enough. A long note on the locking: a previous version of the patch used an rwsem to block the memslot update while the MMU notifier run, but this resulted in the following deadlock involving the pseudo-lock tagged as "mmu_notifier_invalidate_range_start". ====================================================== WARNING: possible circular locking dependency detected 5.12.0-rc3+ #6 Tainted: G OE ------------------------------------------------------ qemu-system-x86/3069 is trying to acquire lock: ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190 but task is already holding lock: ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm] which lock already depends on the new lock. This corresponds to the following MMU notifier logic: invalidate_range_start take pseudo lock down_read() (*) release pseudo lock invalidate_range_end take pseudo lock (**) up_read() release pseudo lock At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock; at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock. This could cause a deadlock (ignoring for a second that the pseudo lock is not a lock): - invalidate_range_start waits on down_read(), because the rwsem is held by install_new_memslots - install_new_memslots waits on down_write(), because the rwsem is held till (another) invalidate_range_end finishes - invalidate_range_end sits waits on the pseudo lock, held by invalidate_range_start. Removing the fairness of the rwsem breaks the cycle (in lockdep terms, it would change the *shared* rwsem readers into *shared recursive* readers), so open-code the wait using a readers count and a spinlock. This also allows handling blockable and non-blockable critical section in the same way. Losing the rwsem fairness does theoretically allow MMU notifiers to block install_new_memslots forever. Note that mm/mmu_notifier.c's own retry scheme in mmu_interval_read_begin also uses wait/wake_up and is likewise not fair. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Aug 02, 2021
-
-
Peter Xu authored
Introduce this safe version of kvm_get_kvm() so that it can be called even during vm destruction. Use it in kvm_debugfs_open() and remove the verbose comment. Prepare to be used elsewhere. Signed-off-by:
Peter Xu <peterx@redhat.com> Message-Id: <20210625153214.43106-3-peterx@redhat.com> [Preserve the comment in kvm_debugfs_open. - Paolo] Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <3a0998645c328bf0895f1290e61821b70f048549.1625186503.git.isaku.yamahata@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Marc Zyngier authored
Nobody is using kvm_get_pfn() anymore. Get rid of it. Acked-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210726153552.1535838-7-maz@kernel.org
-
Marc Zyngier authored
Now that arm64 has stopped using kvm_is_transparent_hugepage(), we can remove it, as well as PageTransCompoundMap() which was only used by the former. Acked-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210726153552.1535838-5-maz@kernel.org
-
- Jul 27, 2021
-
-
Paolo Bonzini authored
The arguments to the KVM_CLEAR_DIRTY_LOG ioctl include a pointer, therefore it needs a compat ioctl implementation. Otherwise, 32-bit userspace fails to invoke it on 64-bit kernels; for x86 it might work fine by chance if the padding is zero, but not on big-endian architectures. Reported-by: Thomas Sattler Cc: stable@vger.kernel.org Fixes: 2a31b9db ("kvm: introduce manual dirty log reprotect") Reviewed-by:
Peter Xu <peterx@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Li RongQing authored
SMT siblings share caches and other hardware, and busy halt polling will degrade its sibling performance if its sibling is working Sean Christopherson suggested as below: "Rather than disallowing halt-polling entirely, on x86 it should be sufficient to simply have the hardware thread yield to its sibling(s) via PAUSE. It probably won't get back all performance, but I would expect it to be close. This compiles on all KVM architectures, and AFAICT the intended usage of cpu_relax() is identical for all architectures." Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Li RongQing <lirongqing@baidu.com> Message-Id: <20210727111247.55510-1-lirongqing@baidu.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jul 15, 2021
-
-
Pavel Skripkin authored
In commit bc9e9e67 ("KVM: debugfs: Reuse binary stats descriptors") loop for filling debugfs_stat_data was copy-pasted 2 times, but in the second loop pointers are saved over pointers allocated in the first loop. All this causes is a memory leak, fix it. Fixes: bc9e9e67 ("KVM: debugfs: Reuse binary stats descriptors") Signed-off-by:
Pavel Skripkin <paskripkin@gmail.com> Reviewed-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210701195500.27097-1-paskripkin@gmail.com> Reviewed-by:
Jing Zhang <jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jul 14, 2021
-
-
Kefeng Wang authored
BUG: KASAN: use-after-free in kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183 Read of size 8 at addr ffff0000c03a2500 by task syz-executor083/4269 CPU: 5 PID: 4269 Comm: syz-executor083 Not tainted 5.10.0 #7 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2d0 arch/arm64/kernel/stacktrace.c:132 show_stack+0x28/0x34 arch/arm64/kernel/stacktrace.c:196 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x164 lib/dump_stack.c:118 print_address_description+0x78/0x5c8 mm/kasan/report.c:385 __kasan_report mm/kasan/report.c:545 [inline] kasan_report+0x148/0x1e4 mm/kasan/report.c:562 check_memory_region_inline mm/kasan/generic.c:183 [inline] __asan_load8+0xb4/0xbc mm/kasan/generic.c:252 kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183 kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline] invoke_syscall arch/arm64/kernel/syscall.c:48 [inline] el0_svc_common arch/arm64/kernel/syscall.c:158 [inline] do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220 el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367 el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383 el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670 Allocated by task 4269: stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121 kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc+0xdc/0x120 mm/kasan/common.c:461 kasan_kmalloc+0xc/0x14 mm/kasan/common.c:475 kmem_cache_alloc_trace include/linux/slab.h:450 [inline] kmalloc include/linux/slab.h:552 [inline] kzalloc include/linux/slab.h:664 [inline] kvm_vm_ioctl_register_coalesced_mmio+0x78/0x1cc arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:146 kvm_vm_ioctl+0x7e8/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3746 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline] invoke_syscall arch/arm64/kernel/syscall.c:48 [inline] el0_svc_common arch/arm64/kernel/syscall.c:158 [inline] do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220 el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367 el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383 el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670 Freed by task 4269: stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121 kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track+0x38/0x6c mm/kasan/common.c:56 kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:355 __kasan_slab_free+0x124/0x150 mm/kasan/common.c:422 kasan_slab_free+0x10/0x1c mm/kasan/common.c:431 slab_free_hook mm/slub.c:1544 [inline] slab_free_freelist_hook mm/slub.c:1577 [inline] slab_free mm/slub.c:3142 [inline] kfree+0x104/0x38c mm/slub.c:4124 coalesced_mmio_destructor+0x94/0xa4 arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:102 kvm_iodevice_destructor include/kvm/iodev.h:61 [inline] kvm_io_bus_unregister_dev+0x248/0x280 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:4374 kvm_vm_ioctl_unregister_coalesced_mmio+0x158/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:186 kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline] invoke_syscall arch/arm64/kernel/syscall.c:48 [inline] el0_svc_common arch/arm64/kernel/syscall.c:158 [inline] do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220 el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367 el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383 el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670 If kvm_io_bus_unregister_dev() return -ENOMEM, we already call kvm_iodevice_destructor() inside this function to delete 'struct kvm_coalesced_mmio_dev *dev' from list and free the dev, but kvm_iodevice_destructor() is called again, it will lead the above issue. Let's check the the return value of kvm_io_bus_unregister_dev(), only call kvm_iodevice_destructor() if the return value is 0. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Reported-by:
Hulk Robot <hulkci@huawei.com> Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Message-Id: <20210626070304.143456-1-wangkefeng.wang@huawei.com> Cc: stable@vger.kernel.org Fixes: 5d3c4c79 ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed", 2021-04-20) Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jun 29, 2021
-
-
Liam Howlett authored
vma_lookup() finds the vma of a specific address with a cleaner interface and is more readable. Link: https://lkml.kernel.org/r/20210521174745.2219620-11-Liam.Howlett@Oracle.com Signed-off-by:
Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by:
Laurent Dufour <ldufour@linux.ibm.com> Acked-by:
David Hildenbrand <david@redhat.com> Acked-by:
Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Jun 24, 2021
-
-
Jing Zhang authored
To remove code duplication, use the binary stats descriptors in the implementation of the debugfs interface for statistics. This unifies the definition of statistics for the binary and debugfs interfaces. Signed-off-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-8-jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jing Zhang authored
Add a VCPU ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VCPU stats header, descriptors and data. Define VCPU statistics descriptors and header for all architectures. Reviewed-by:
David Matlack <dmatlack@google.com> Reviewed-by:
Ricardo Koller <ricarkol@google.com> Reviewed-by:
Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-5-jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-