- Dec 28, 2022
-
-
Paolo Bonzini authored
While KVM_XEN_EVTCHN_RESET is usually called with no vCPUs running, if that happened it could cause a deadlock. This is due to kvm_xen_eventfd_reset() doing a synchronize_srcu() inside a kvm->lock critical section. To avoid this, first collect all the evtchnfd objects in an array and free all of them once the kvm->lock critical section is over and th SRCU grace period has expired. Reported-by:
Michal Luczaj <mhal@rbox.co> Cc: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 27, 2022
-
-
Masami Hiramatsu (Google) authored
Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping speculative execution after function return, kprobe jump optimization always fails on the functions with such INT3 inside the function body. (It already checks the INT3 padding between functions, but not inside the function) To avoid this issue, as same as kprobes, check whether the INT3 comes from kgdb or not, and if so, stop decoding and make it fail. The other INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be treated as a one-byte instruction. Fixes: e463a09a ("x86: Add straight-line-speculation mitigation") Suggested-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/167146051929.1374301.7419382929328081706.stgit@devnote3
-
Masami Hiramatsu (Google) authored
Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping speculative execution after RET instruction, kprobes always failes to check the probed instruction boundary by decoding the function body if the probed address is after such sequence. (Note that some conditional code blocks will be placed after function return, if compiler decides it is not on the hot path.) This is because kprobes expects kgdb puts the INT3 as a software breakpoint and it will replace the original instruction. But these INT3 are not such purpose, it doesn't need to recover the original instruction. To avoid this issue, kprobes checks whether the INT3 is owned by kgdb or not, and if so, stop decoding and make it fail. The other INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be treated as a one-byte instruction. Fixes: e463a09a ("x86: Add straight-line-speculation mitigation") Suggested-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/167146051026.1374301.392728975473572291.stgit@devnote3
-
Arnd Bergmann authored
The addition of callthunks_translate_call_dest means that skip_addr() and patch_dest() can no longer be discarded as part of the __init section freeing: WARNING: modpost: vmlinux.o: section mismatch in reference: callthunks_translate_call_dest.cold (section: .text.unlikely) -> skip_addr (section: .init.text) WARNING: modpost: vmlinux.o: section mismatch in reference: callthunks_translate_call_dest.cold (section: .text.unlikely) -> patch_dest (section: .init.text) WARNING: modpost: vmlinux.o: section mismatch in reference: is_callthunk.cold (section: .text.unlikely) -> skip_addr (section: .init.text) ERROR: modpost: Section mismatches detected. Set CONFIG_SECTION_MISMATCH_WARN_ONLY=y to allow them. Fixes: b2e9dfe5 ("x86/bpf: Emit call depth accounting if required") Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221215164334.968863-1-arnd@kernel.org
-
Colin Ian King authored
The left shift of int 32 bit integer constant 1 is evaluated using 32 bit arithmetic and then passed as a 64 bit function argument. In the case where i is 32 or more this can lead to an overflow. Avoid this by shifting using the BIT_ULL macro instead. Fixes: 471af006 ("perf/x86/amd: Constrain Large Increment per Cycle events") Signed-off-by:
Colin Ian King <colin.i.king@gmail.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Ian Rogers <irogers@google.com> Acked-by:
Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/20221202135149.1797974-1-colin.i.king@gmail.com
-
David Woodhouse authored
These are (uint64_t)-1 magic values are a userspace ABI, allowing the shared info pages and other enlightenments to be disabled. This isn't a Xen ABI because Xen doesn't let the guest turn these off except with the full SHUTDOWN_soft_reset mechanism. Under KVM, the userspace VMM is expected to handle soft reset, and tear down the kernel parts of the enlightenments accordingly. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-5-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Michal Luczaj authored
Port number is validated in kvm_xen_setattr_evtchn(). Remove superfluous checks in kvm_xen_eventfd_assign() and kvm_xen_eventfd_update(). Signed-off-by:
Michal Luczaj <mhal@rbox.co> Message-Id: <20221222203021.1944101-3-mhal@rbox.co> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-4-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The evtchnfd structure itself must be protected by either kvm->lock or SRCU. Use the former in kvm_xen_eventfd_update(), since the lock is being taken anyway; kvm_xen_hcall_evtchn_send() instead is a reader and does not need kvm->lock, and is called in SRCU critical section from the kvm_x86_handle_exit function. It is also important to use rcu_read_{lock,unlock}() in kvm_xen_hcall_evtchn_send(), because idr_remove() will *not* use synchronize_srcu() to wait for readers to complete. Remove a superfluous if (kvm) check before calling synchronize_srcu() in kvm_xen_eventfd_deassign() where kvm has been dereferenced already. Co-developed-by:
Michal Luczaj <mhal@rbox.co> Signed-off-by:
Michal Luczaj <mhal@rbox.co> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-3-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Woodhouse authored
In particular, we shouldn't assume that being contiguous in guest virtual address space means being contiguous in guest *physical* address space. In dropping the manual calls to kvm_mmu_gva_to_gpa_system(), also drop the srcu_read_lock() that was around them. All call sites are reached from kvm_xen_hypercall() which is called from the handle_exit function with the read lock already held. 53639526 ("KVM: x86/xen: handle PV timers oneshot mode") 1a65105a ("KVM: x86/xen: handle PV spinlocks slowpath") Fixes: 2fd6df2f ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-2-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Michal Luczaj authored
Release page irrespectively of kvm_vcpu_write_guest() return value. Suggested-by:
Paul Durrant <paul@xen.org> Fixes: 23200b7a ("KVM: x86/xen: intercept xen hypercalls if enabled") Signed-off-by:
Michal Luczaj <mhal@rbox.co> Message-Id: <20221220151454.712165-1-mhal@rbox.co> Reviewed-by:
Paul Durrant <paul@xen.org> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-1-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
"be split be split" -> "be split" Signed-off-by:
Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20221207120505.9175-1-jiangshanlai@gmail.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 25, 2022
-
-
Steven Rostedt (Google) authored
Due to several bugs caused by timers being re-armed after they are shutdown and just before they are freed, a new state of timers was added called "shutdown". After a timer is set to this state, then it can no longer be re-armed. The following script was run to find all the trivial locations where del_timer() or del_timer_sync() is called in the same function that the object holding the timer is freed. It also ignores any locations where the timer->function is modified between the del_timer*() and the free(), as that is not considered a "trivial" case. This was created by using a coccinelle script and the following commands: $ cat timer.cocci @@ expression ptr, slab; identifier timer, rfield; @@ ( - del_timer(&ptr->timer); + timer_shutdown(&ptr->timer); | - del_timer_sync(&ptr->timer); + timer_shutdown_sync(&ptr->timer); ) ... when strict when != ptr->timer ( kfree_rcu(ptr, rfield); | kmem_cache_free(slab, ptr); | kfree(ptr); ) $ spatch timer.cocci . > /tmp/t.patch $ patch -p1 < /tmp/t.patch Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/ Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ] Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ] Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ] Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Dec 23, 2022
-
-
Sean Christopherson authored
Don't install a leaf TDP MMU SPTE if the parent page's level doesn't match the target level of the fault, and instead have the vCPU retry the faulting instruction after warning. Continuing on is completely unnecessary as the absolute worst case scenario of retrying is DoSing the vCPU, whereas continuing on all but guarantees bigger explosions, e.g. ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__handle_changed_spte.cold+0x95/0x9c RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246 RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0 R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3 R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_mmu_map+0x3b0/0x510 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Modules linked in: kvm_intel ---[ end trace 0000000000000000 ]--- Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-5-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock when adding a new shadow page in the TDP MMU. To ensure the NX reclaim kthread can't see a not-yet-linked shadow page, the page fault path links the new page table prior to adding the page to possible_nx_huge_pages. If the page is zapped by different task, e.g. because dirty logging is disabled, between linking the page and adding it to the list, KVM can end up triggering use-after-free by adding the zapped SP to the aforementioned list, as the zapped SP's memory is scheduled for removal via RCU callback. The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y, i.e. the below splat is just one possible signature. ------------[ cut here ]------------ list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38). WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0 Modules linked in: kvm_intel CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #71 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__list_add_valid+0x79/0xa0 RSP: 0018:ffffc900006efb68 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8 RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08 R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930 R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90 FS: 00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0 Call Trace: <TASK> track_possible_nx_huge_page+0x53/0x80 kvm_tdp_mmu_map+0x242/0x2c0 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 61f94478 ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE") Reported-by:
Greg Thelen <gthelen@google.com> Analyzed-by:
David Matlack <dmatlack@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-4-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Map the leaf SPTE when handling a TDP MMU page fault if and only if the target level is reached. A recent commit reworked the retry logic and incorrectly assumed that walking SPTEs would never "fail", as the loop either bails (retries) or installs parent SPs. However, the iterator itself will bail early if it detects a frozen (REMOVED) SPTE when stepping down. The TDP iterator also rereads the current SPTE before stepping down specifically to avoid walking into a part of the tree that is being removed, which means it's possible to terminate the loop without the guts of the loop observing the frozen SPTE, e.g. if a different task zaps a parent SPTE between the initial read and try_step_down()'s refresh. Mapping a leaf SPTE at the wrong level results in all kinds of badness as page table walkers interpret the SPTE as a page table, not a leaf, and walk into the weeds. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510 Modules linked in: kvm_intel CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x481/0x510 RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48 R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0 R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Invalid SPTE change: cannot replace a present leaf SPTE with another present leaf SPTE mapping a different PFN! as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2 ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__handle_changed_spte.cold+0x95/0x9c RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246 RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0 R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3 R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_mmu_map+0x3b0/0x510 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Modules linked in: kvm_intel ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25 ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Hoist the is_removed_spte() check above the "level == goal_level" check when walking SPTEs during a TDP MMU page fault to avoid attempting to map a leaf entry if said entry is frozen by a different task/vCPU. ------------[ cut here ]------------ WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653 kvm_tdp_mmu_map+0x269/0x4b0 Modules linked in: kvm_intel CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0 RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246 RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005 RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000 RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005 R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000 FS: 00007f01c942c740(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25 ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Robert Hoo <robert.hu@linux.intel.com> Message-Id: <20221213033030.83345-2-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When stuffing the allowed secondary execution controls for nested VMX in response to CPUID updates, don't set the allowed-1 bit for a feature that isn't supported by KVM, i.e. isn't allowed by the canonical vmcs_config. WARN if KVM attempts to manipulate a feature that isn't supported. All features that are currently stuffed are always advertised to L1 for nested VMX if they are supported in KVM's base configuration, and no additional features should ever be added to the CPUID-induced stuffing (updating VMX MSRs in response to CPUID updates is a long-standing KVM flaw that is slowly being fixed). Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221213062306.667649-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the feature is supported in hardware and enabled in KVM's base, non-nested configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported. This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and obviously allows L1 to enable the feature for L2. KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when updating secondary controls in response to KVM_SET_CPUID(2), but (a) that depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction that the guest value must be a strict subset of the supported host value. Although no past commit explicitly enabled nested support for WAITPKG, doing so is safe and functionally correct from an architectural perspective as no additional KVM support is needed to virtualize TPAUSE, UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards VM-Exits to L1 as necessary (commit bf653b78, "KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit"). Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in hardware, i.e. always runs both L1 and L2 with the host's power management settings for TPAUSE and UMWAIT. See commit bf09fb6c ("KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL") for more details. Fixes: e69e72fa ("KVM: x86: Add support for user wait instructions") Cc: stable@vger.kernel.org Reported-by:
Aaron Lewis <aaronlewis@google.com> Reported-by:
Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Jim Mattson <jmattson@google.com> Message-Id: <20221213062306.667649-2-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Explicitly drop the result of kvm_vcpu_write_guest() when writing the "launch state" as part of VMCLEAR emulation, and add a comment to call out that KVM's behavior is architecturally valid. Intel's pseudocode effectively says that VMCLEAR is a nop if the target VMCS address isn't in memory, e.g. if the address points at MMIO. Add a FIXME to call out that suppressing failures on __copy_to_user() is wrong, as memory (a memslot) does exist in that case. Punt the issue to the future as open coding kvm_vcpu_write_guest() just to make sure the guest dies with -EFAULT isn't worth the extra complexity. The flaw will need to be addressed if KVM ever does something intelligent on uaccess failures, e.g. to support post-copy demand paging, but in that case KVM will need a more thorough overhaul, i.e. VMCLEAR shouldn't need to open code a core KVM helper. No functional change intended. Reported-by:
coverity-bot <keescook+coverity-bot@chromium.org> Addresses-Coverity-ID: 1527765 ("Error handling issues") Fixes: 587d7e72 ("kvm: nVMX: VMCLEAR should not cause the vCPU to shut down") Cc: Jim Mattson <jmattson@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221220154224.526568-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add a sanity check in kvm_handle_memory_failure() to assert that a valid x86_exception structure is provided if the memory "failure" wants to propagate a fault into the guest. If a memory failure happens during a direct guest physical memory access, e.g. for nested VMX, KVM hardcodes the failure to X86EMUL_IO_NEEDED and doesn't provide an exception pointer (because the exception struct would just be filled with garbage). Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221220153427.514032-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Peng Hao authored
kvm_apic_hw_enabled() only needs to return bool, there is no place to use the return value of MSR_IA32_APICBASE_ENABLE. Signed-off-by:
Peng Hao <flyingpeng@tencent.com> Message-Id: <CAPm50aJ=BLXNWT11+j36Dd6d7nz2JmOBk4u7o_NPQ0N61ODu1g@mail.gmail.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
In kvm_hv_flush_tlb(), 'data_offset' and 'consumed_xmm_halves' variables are used in a mutually exclusive way: in 'hc->fast' we count in 'XMM halves' and increase 'data_offset' otherwise. Coverity discovered, that in one case both variables are incremented unconditionally. This doesn't seem to cause any issues as the only user of 'data_offset'/'consumed_xmm_halves' data is kvm_hv_get_tlb_flush_entries() -> kvm_hv_get_hc_data() which also takes into account 'hc->fast' but is still worth fixing. To make things explicit, put 'data_offset' and 'consumed_xmm_halves' to 'struct kvm_hv_hcall' as a union and use at call sites. This allows to remove explicit 'data_offset'/'consumed_xmm_halves' parameters from kvm_hv_get_hc_data()/kvm_get_sparse_vp_set()/kvm_hv_get_tlb_flush_entries() helpers. Note: 'struct kvm_hv_hcall' is allocated on stack in kvm_hv_hypercall() and is not zeroed, consumers are supposed to initialize the appropriate field if needed. Reported-by:
coverity-bot <keescook+coverity-bot@chromium.org> Addresses-Coverity-ID: 1527764 ("Uninitialized variables") Fixes: 26097086 ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently") Signed-off-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221208102700.959630-1-vkuznets@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Adamos Ttofari authored
When scanning userspace I/OAPIC entries, intercept EOI for level-triggered IRQs if the current vCPU has a pending and/or in-service IRQ for the vector in its local API, even if the vCPU doesn't match the new entry's destination. This fixes a race between userspace I/OAPIC reconfiguration and IRQ delivery that results in the vector's bit being left set in the remote IRR due to the eventual EOI not being forwarded to the userspace I/OAPIC. Commit 0fc5a36d ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race") fixed the in-kernel IOAPIC, but not the userspace IOAPIC configuration, which has a similar race. Fixes: 0fc5a36d ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race") Signed-off-by:
Adamos Ttofari <attofari@amazon.de> Reviewed-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221208094415.12723-1-attofari@amazon.de> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Like Xu authored
The current vPMU can reuse the same pmc->perf_event for the same hardware event via pmc_pause/resume_counter(), but this optimization does not apply to a portion of the TSX events (e.g., "event=0x3c,in_tx=1, in_tx_cp=1"), where event->attr.sample_period is legally zero at creation, thus making the perf call to perf_event_period() meaningless (no need to adjust sample period in this case), and instead causing such reusable perf_events to be repeatedly released and created. Avoid releasing zero sample_period events by checking is_sampling_event() to follow the previously enable/disable optimization. Signed-off-by:
Like Xu <likexu@tencent.com> Message-Id: <20221207071506.15733-2-likexu@tencent.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 21, 2022
-
-
Linus Torvalds authored
The m68 hand-written assembler version of strcmp() has always been broken: it returns the difference between the first non-matching byte done as a 8-bit subtraction. That is _almost_ right, but is broken for the overflow case. The strcmp() function should indeed return the sign of the difference between the first byte that differs, but the subtraction needs to be done in a wider type than 'char'. Otherwise the ordering isn't actually stable. This went unnoticed for basically forever, because nobody ever cares about non-US-ASCII orderings in the kernel (in fact, most users only care about "exact match or not"), so overflows don't really happen in practice, even if it was very very wrong. But that mostly unnoticeable bug becomes very noticeable by the recent change to make 'char' be unsigned in the kernel across all architectures (commit 3bc753c0: "kbuild: treat char as always unsigned"). Because the code not only did the subtraction in the wrong type width, it also used 'char' to then make the compiler expand the result from an 8-bit difference to the 'int' return value. So now with an unsigned char that incorrect arithmetic width was then not even sign-expanded, and always returned just a positive integer. We could re-instate the old broken code by just turning the 'char' into 'signed char' as has been done elsewhere where people depended on the signedness of 'char', but since the whole function was broken to begin with, and we have a non-broken default fallback implementation, let's just remove this broken function entirely. Reported-by:
Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/lkml/20221221145332.GA2399037@roeck-us.net/ Cc: Jason Donenfeld <Jason@zx2c4.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Florian Fainelli authored
We currently have a DTC warning with the current DTS due to the lack of a suitable #address-cells and #size-cells property: DTC arch/mips/boot/dts/brcm/bcm63268-comtrend-vr-3032u.dtb arch/mips/boot/dts/brcm/bcm63268.dtsi:115.5-22: Warning (reg_format): /ubus/timer-mfd@10000080/timer@0:reg: property has invalid length (8 bytes) (#address-cells == 2, #size-cells == 1) arch/mips/boot/dts/brcm/bcm63268.dtsi:120.5-22: Warning (reg_format): /ubus/timer-mfd@10000080/watchdog@1c:reg: property has invalid length (8 bytes) (#address-cells == 2, #size-cells == 1) arch/mips/boot/dts/brcm/bcm63268.dtsi:111.4-35: Warning (ranges_format): /ubus/timer-mfd@10000080:ranges: "ranges" property has invalid length (12 bytes) (parent #address-cells == 1, child #address-cells == 2, #size-cells == 1) Fixes: d3db4b96 ("mips: dts: bcm63268: add TWD block timer") Signed-off-by:
Florian Fainelli <f.fainelli@gmail.com> Reviewed-by:
Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by:
Thomas Bogendoerfer <tsbogend@alpha.franken.de>
-
Sergio Paracuellos authored
Commit 38a8553b ("clk: ralink: make system controller node a reset provider") make system controller a reset provider for mt7621 ralink SoCs. Ralink init code also tries to start previous common reset controller which at the end tries to find device tree node 'ralink,rt2880-reset'. mt7621 device tree file is not using at all this node anymore. Hence avoid to init this common reset controller for mt7621 ralink SoCs to avoid 'Failed to find reset controller node' boot error trace error. Fixes: 64b2d6ff ("staging: mt7621-dts: align resets with binding documentation") Signed-off-by:
Sergio Paracuellos <sergio.paracuellos@gmail.com> Signed-off-by:
Thomas Bogendoerfer <tsbogend@alpha.franken.de>
-
- Dec 20, 2022
-
-
Jason A. Donenfeld authored
Convert the final two users of prandom_u32_max() that slipped in during 6.2-rc1 to use get_random_u32_below(). Then, with no more users left, we can finally remove the deprecated function. Signed-off-by:
Jason A. Donenfeld <Jason@zx2c4.com>
-
Jason A. Donenfeld authored
The <asm/archrandom.h> header is a random.c private detail, not something to be called by other code. As such, don't make it automatically available by way of random.h. Cc: Michael Ellerman <mpe@ellerman.id.au> Acked-by:
Heiko Carstens <hca@linux.ibm.com> Reviewed-by:
Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by:
Jason A. Donenfeld <Jason@zx2c4.com>
-
- Dec 19, 2022
-
-
Arnd Bergmann authored
lld cannot build for ARMv4/v4T targets because it inserts 'blx' instructions that are unsupported there: ld.lld: warning: lld uses blx instruction, no object with architecture supporting feature detected Add a Kconfig time dependency to prevent those targets from being selected in randconfig builds. Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Reviewed-by:
Nathan Chancellor <nathan@kernel.org> Link: https://github.com/llvm/llvm-project/issues/50764 Link: https://github.com/ClangBuiltLinux/linux/issues/964 Link: https://lore.kernel.org/r/20221215162635.3750763-1-arnd@kernel.org Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
Arnd Bergmann authored
The integrated assembler in clang does not understand the xscale specific mra/mar instructions: arch/arm/mach-pxa/pxa27x.c:136:15: error: unsupported architectural extension: xscale asm volatile(".arch_extension xscale\n\t" arch/arm/mach-pxa/pxa27x.c:136:40: error: invalid instruction, did you mean: mcr, mla, mrc, mrs, msr? mra r2, r3, acc0 Since these are coprocessor features, the same can be expressed using mrrc/mcrr, so use that for builds with IAS. Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20221215162529.3659187-1-arnd@kernel.org Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
Krzysztof Kozlowski authored
By coding style, unit address should not start with 0x. Signed-off-by:
Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by:
Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/r/20221210113347.63939-1-krzysztof.kozlowski@linaro.org Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
Helge Deller authored
Some (mostly 64-bit machines) machines allow to run MPE/iX and report the MPE model string via firmware call. Enhance the pdc_model_sysmodel() function to report that model string. Note that some 32-bit machines like the B160L wrongly report success for the firmware call, so include a check to prevent showing wrong info. Signed-off-by:
Helge Deller <deller@gmx.de>
-
- Dec 18, 2022
-
-
Helge Deller authored
Fix those make warnings: arch/parisc/kernel/vdso32/Makefile:30: FORCE prerequisite is missing arch/parisc/kernel/vdso64/Makefile:30: FORCE prerequisite is missing Add the missing FORCE prerequisites for all build targets identified by "make help". Fixes: e1f86d7b ("kbuild: warn if FORCE is missing for if_changed(_dep,_rule) and filechk") Signed-off-by:
Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> # 5.18+
-
Helge Deller authored
Signed-off-by:
Helge Deller <deller@gmx.de>
-
Helge Deller authored
No need to have specific locking for console I/O since the PDC functions provide an own locking. Signed-off-by:
Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> # 6.1+
-
Helge Deller authored
The kgdb console is already implemented and registered in pdc_cons.c, so the duplicate code can be dropped. Signed-off-by:
Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> # 6.1+
-
Helge Deller authored
Utilize pdc_lock spinlock to protect parallel modifications of the iodc_dbuf[] buffer, check length to prevent buffer overflow of iodc_dbuf[], drop the iodc_retbuf[] buffer and fix some wrong indentings. Signed-off-by:
Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> # 6.0+
-
- Dec 17, 2022
-
-
Helge Deller authored
PMD_SHIFT isn't defined if CONFIG_PGTABLE_LEVELS == 3, and as such the kernel test robot found this warning: In file included from include/linux/pgtable.h:6, from arch/parisc/kernel/head.S:23: arch/parisc/include/asm/pgtable.h:169:32: warning: "PMD_SHIFT" is not defined, evaluates to 0 [-Wundef] 169 | #if (KERNEL_INITIAL_ORDER) >= (PMD_SHIFT) Avoid the warning by using PLD_SHIFT and BITS_PER_PTE. Signed-off-by:
Helge Deller <deller@gmx.de> Reported-by:
kernel test robot <lkp@intel.com> Cc: <stable@vger.kernel.org> # 6.0+
-
Helge Deller authored
Adjust some MADV_XXX constants to be in sync what their values are on all other platforms. There is currently no reason to have an own numbering on parisc, but it requires workarounds in many userspace sources (e.g. glibc, qemu, ...) - which are often forgotten and thus introduce bugs and different behaviour on parisc. A wrapper avoids an ABI breakage for existing userspace applications by translating any old values to the new ones, so this change allows us to move over all programs to the new ABI over time. Signed-off-by:
Helge Deller <deller@gmx.de>
-