Skip to content
Snippets Groups Projects
  1. Sep 27, 2024
    • Al Viro's avatar
      [tree-wide] finally take no_llseek out · cb787f4a
      Al Viro authored
      
      no_llseek had been defined to NULL two years ago, in commit 868941b1
      ("fs: remove no_llseek")
      
      To quote that commit,
      
        At -rc1 we'll need do a mechanical removal of no_llseek -
      
        git grep -l -w no_llseek | grep -v porting.rst | while read i; do
      	sed -i '/\<no_llseek\>/d' $i
        done
      
        would do it.
      
      Unfortunately, that hadn't been done.  Linus, could you do that now, so
      that we could finally put that thing to rest? All instances are of the
      form
      	.llseek = no_llseek,
      so it's obviously safe.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb787f4a
  2. Sep 17, 2024
    • Peter Xu's avatar
      KVM: use follow_pfnmap API · 5731aacd
      Peter Xu authored
      Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
      is done perfectly on the other side (host_pfn_mapping_level()).
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-11-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5731aacd
  3. Sep 10, 2024
  4. Sep 04, 2024
    • Sean Christopherson's avatar
      KVM: Add arch hooks for enabling/disabling virtualization · b67107a2
      Sean Christopherson authored
      
      Add arch hooks that are invoked when KVM enables/disable virtualization.
      x86 will use the hooks to register an "emergency disable" callback, which
      is essentially an x86-specific shutdown notifier that is used when the
      kernel is doing an emergency reboot/shutdown/kexec.
      
      Add comments for the declarations to help arch code understand exactly
      when the callbacks are invoked.  Alternatively, the APIs themselves could
      communicate most of the same info, but kvm_arch_pre_enable_virtualization()
      and kvm_arch_post_disable_virtualization() are a bit cumbersome, and make
      it a bit less obvious that they are intended to be implemented as a pair.
      
      Reviewed-by: default avatarChao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b67107a2
    • Sean Christopherson's avatar
      KVM: Add a module param to allow enabling virtualization when KVM is loaded · b4886fab
      Sean Christopherson authored
      
      Add an on-by-default module param, enable_virt_at_load, to let userspace
      force virtualization to be enabled in hardware when KVM is initialized,
      i.e. just before /dev/kvm is exposed to userspace.  Enabling virtualization
      during KVM initialization allows userspace to avoid the additional latency
      when creating/destroying the first/last VM (or more specifically, on the
      0=>1 and 1=>0 edges of creation/destruction).
      
      Now that KVM uses the cpuhp framework to do per-CPU enabling, the latency
      could be non-trivial as the cpuhup bringup/teardown is serialized across
      CPUs, e.g. the latency could be problematic for use case that need to spin
      up VMs quickly.
      
      Prior to commit 10474ae8 ("KVM: Activate Virtualization On Demand"),
      KVM _unconditionally_ enabled virtualization during load, i.e. there's no
      fundamental reason KVM needs to dynamically toggle virtualization.  These
      days, the only known argument for not enabling virtualization is to allow
      KVM to be autoloaded without blocking other out-of-tree hypervisors, and
      such use cases can simply change the module param, e.g. via command line.
      
      Note, the aforementioned commit also mentioned that enabling SVM (AMD's
      virtualization extensions) can result in "using invalid TLB entries".
      It's not clear whether the changelog was referring to a KVM bug, a CPU
      bug, or something else entirely.  Regardless, leaving virtualization off
      by default is not a robust "fix", as any protection provided is lost the
      instant userspace creates the first VM.
      
      Reviewed-by: default avatarChao Gao <chao.gao@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b4886fab
    • Sean Christopherson's avatar
      KVM: Rename arch hooks related to per-CPU virtualization enabling · 071f24ad
      Sean Christopherson authored
      
      Rename the per-CPU hooks used to enable virtualization in hardware to
      align with the KVM-wide helpers in kvm_main.c, and to better capture that
      the callbacks are invoked on every online CPU.
      
      No functional change intended.
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240830043600.127750-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      071f24ad
    • Sean Christopherson's avatar
      KVM: Rename symbols related to enabling virtualization hardware · 70c01943
      Sean Christopherson authored
      
      Rename the various functions (and a variable) that enable virtualization
      to prepare for upcoming changes, and to clean up artifacts of KVM's
      previous behavior, which required manually juggling locks around
      kvm_usage_count.
      
      Drop the "nolock" qualifier from per-CPU functions now that there are no
      "nolock" implementations of the "all" variants, i.e. now that calling a
      non-nolock function from a nolock function isn't confusing (unlike this
      sentence).
      
      Drop "all" from the outer helpers as they no longer manually iterate
      over all CPUs, and because it might not be obvious what "all" refers to.
      
      In lieu of the above qualifiers, append "_cpu" to the end of the functions
      that are per-CPU helpers for the outer APIs.
      
      Opportunistically prepend "kvm" to all functions to help make it clear
      that they are KVM helpers, but mostly because there's no reason not to.
      
      Lastly, use "virtualization" instead of "hardware", because while the
      functions do enable virtualization in hardware, there are a _lot_ of
      things that KVM enables in hardware.
      
      Defer renaming the arch hooks to future patches, purely to reduce the
      amount of churn in a single commit.
      
      Reviewed-by: default avatarChao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      70c01943
    • Sean Christopherson's avatar
      KVM: Register cpuhp and syscore callbacks when enabling hardware · 9a798b13
      Sean Christopherson authored
      
      Register KVM's cpuhp and syscore callback when enabling virtualization
      in hardware instead of registering the callbacks during initialization,
      and let the CPU up/down framework invoke the inner enable/disable
      functions.  Registering the callbacks during initialization makes things
      more complex than they need to be, as KVM needs to be very careful about
      handling races between enabling CPUs being onlined/offlined and hardware
      being enabled/disabled.
      
      Intel TDX support will require KVM to enable virtualization during KVM
      initialization, i.e. will add another wrinkle to things, at which point
      sorting out the potential races with kvm_usage_count would become even
      more complex.
      
      Note, using the cpuhp framework has a subtle behavioral change: enabling
      will be done serially across all CPUs, whereas KVM currently sends an IPI
      to all CPUs in parallel.  While serializing virtualization enabling could
      create undesirable latency, the issue is limited to the 0=>1 transition of
      VM creation.  And even that can be mitigated, e.g. by letting userspace
      force virtualization to be enabled when KVM is initialized.
      
      Cc: Chao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9a798b13
    • Sean Christopherson's avatar
      KVM: Use dedicated mutex to protect kvm_usage_count to avoid deadlock · 44d17459
      Sean Christopherson authored
      
      Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock
      on x86 due to a chain of locks and SRCU synchronizations.  Translating the
      below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on
      CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the
      fairness of r/w semaphores).
      
          CPU0                     CPU1                     CPU2
      1   lock(&kvm->slots_lock);
      2                                                     lock(&vcpu->mutex);
      3                                                     lock(&kvm->srcu);
      4                            lock(cpu_hotplug_lock);
      5                            lock(kvm_lock);
      6                            lock(&kvm->slots_lock);
      7                                                     lock(cpu_hotplug_lock);
      8   sync(&kvm->srcu);
      
      Note, there are likely more potential deadlocks in KVM x86, e.g. the same
      pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with
      __kvmclock_cpufreq_notifier():
      
        cpuhp_cpufreq_online()
        |
        -> cpufreq_online()
           |
           -> cpufreq_gov_performance_limits()
              |
              -> __cpufreq_driver_target()
                 |
                 -> __target_index()
                    |
                    -> cpufreq_freq_transition_begin()
                       |
                       -> cpufreq_notify_transition()
                          |
                          -> ... __kvmclock_cpufreq_notifier()
      
      But, actually triggering such deadlocks is beyond rare due to the
      combination of dependencies and timings involved.  E.g. the cpufreq
      notifier is only used on older CPUs without a constant TSC, mucking with
      the NX hugepage mitigation while VMs are running is very uncommon, and
      doing so while also onlining/offlining a CPU (necessary to generate
      contention on cpu_hotplug_lock) would be even more unusual.
      
      The most robust solution to the general cpu_hotplug_lock issue is likely
      to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq
      notifier doesn't to take kvm_lock.  For now, settle for fixing the most
      blatant deadlock, as switching to an RCU-protected list is a much more
      involved change, but add a comment in locking.rst to call out that care
      needs to be taken when walking holding kvm_lock and walking vm_list.
      
        ======================================================
        WARNING: possible circular locking dependency detected
        6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S         O
        ------------------------------------------------------
        tee/35048 is trying to acquire lock:
        ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm]
      
        but task is already holding lock:
        ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm]
      
        which lock already depends on the new lock.
      
         the existing dependency chain (in reverse order) is:
      
        -> #3 (kvm_lock){+.+.}-{3:3}:
               __mutex_lock+0x6a/0xb40
               mutex_lock_nested+0x1f/0x30
               kvm_dev_ioctl+0x4fb/0xe50 [kvm]
               __se_sys_ioctl+0x7b/0xd0
               __x64_sys_ioctl+0x21/0x30
               x64_sys_call+0x15d0/0x2e60
               do_syscall_64+0x83/0x160
               entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
        -> #2 (cpu_hotplug_lock){++++}-{0:0}:
               cpus_read_lock+0x2e/0xb0
               static_key_slow_inc+0x16/0x30
               kvm_lapic_set_base+0x6a/0x1c0 [kvm]
               kvm_set_apic_base+0x8f/0xe0 [kvm]
               kvm_set_msr_common+0x9ae/0xf80 [kvm]
               vmx_set_msr+0xa54/0xbe0 [kvm_intel]
               __kvm_set_msr+0xb6/0x1a0 [kvm]
               kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm]
               kvm_vcpu_ioctl+0x485/0x5b0 [kvm]
               __se_sys_ioctl+0x7b/0xd0
               __x64_sys_ioctl+0x21/0x30
               x64_sys_call+0x15d0/0x2e60
               do_syscall_64+0x83/0x160
               entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
        -> #1 (&kvm->srcu){.+.+}-{0:0}:
               __synchronize_srcu+0x44/0x1a0
               synchronize_srcu_expedited+0x21/0x30
               kvm_swap_active_memslots+0x110/0x1c0 [kvm]
               kvm_set_memslot+0x360/0x620 [kvm]
               __kvm_set_memory_region+0x27b/0x300 [kvm]
               kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm]
               kvm_vm_ioctl+0x295/0x650 [kvm]
               __se_sys_ioctl+0x7b/0xd0
               __x64_sys_ioctl+0x21/0x30
               x64_sys_call+0x15d0/0x2e60
               do_syscall_64+0x83/0x160
               entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
        -> #0 (&kvm->slots_lock){+.+.}-{3:3}:
               __lock_acquire+0x15ef/0x2e30
               lock_acquire+0xe0/0x260
               __mutex_lock+0x6a/0xb40
               mutex_lock_nested+0x1f/0x30
               set_nx_huge_pages+0x179/0x1e0 [kvm]
               param_attr_store+0x93/0x100
               module_attr_store+0x22/0x40
               sysfs_kf_write+0x81/0xb0
               kernfs_fop_write_iter+0x133/0x1d0
               vfs_write+0x28d/0x380
               ksys_write+0x70/0xe0
               __x64_sys_write+0x1f/0x30
               x64_sys_call+0x281b/0x2e60
               do_syscall_64+0x83/0x160
               entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      Cc: Chao Gao <chao.gao@intel.com>
      Fixes: 0bf50497 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      44d17459
  5. Aug 30, 2024
  6. Aug 23, 2024
    • Ilias Stamatis's avatar
      KVM: Fix coalesced_mmio_has_room() to avoid premature userspace exit · 92f6d413
      Ilias Stamatis authored
      
      The following calculation used in coalesced_mmio_has_room() to check
      whether the ring buffer is full is wrong and results in premature exits if
      the start of the valid entries is in the first half of the ring buffer.
      
        avail = (ring->first - last - 1) % KVM_COALESCED_MMIO_MAX;
        if (avail == 0)
      	  /* full */
      
      Because negative values are handled using two's complement, and KVM
      computes the result as an unsigned value, the above will get a false
      positive if "first < last" and the ring is half-full.
      
      The above might have worked as expected in python for example:
        >>> (-86) % 170
        84
      
      However it doesn't work the same way in C.
      
        printf("avail: %d\n", (-86) % 170);
        printf("avail: %u\n", (-86) % 170);
        printf("avail: %u\n", (-86u) % 170u);
      
      Using gcc-11 these print:
      
        avail: -86
        avail: 4294967210
        avail: 0
      
      For illustration purposes, given a 4-bit integer and a ring size of 0xA
      (unsigned), 0xA == 0x1010 == -6, and thus (-6u % 0xA) == 0.
      
      Fix the calculation and allow all but one entries in the buffer to be
      used as originally intended.
      
      Note, KVM's behavior is self-healing to some extent, as KVM will allow the
      entire buffer to be used if ring->first is beyond the halfway point.  In
      other words, in the unlikely scenario that a use case benefits from being
      able to coalesce more than 86 entries at once, KVM will still provide such
      behavior, sometimes.
      
      Note #2, the % operator in C is not the modulo operator but the remainder
      operator. Modulo and remainder operators differ with respect to negative
      values.  But, the relevant values in KVM are all unsigned, so it's a moot
      point in this case anyway.
      
      Note #3, this is almost a pure revert of the buggy commit, plus a
      READ_ONCE() to provide additional safety.  Thue buggy commit justified the
      change with "it paves the way for making this function lockless", but it's
      not at all clear what was intended, nor is there any evidence that the
      buggy code was somehow safer.  (a) the fields in question were already
      accessed locklessly, from the perspective that they could be modified by
      userspace at any time, and (b) the lock guarding the ring itself was
      changed, but never dropped, i.e. whatever lockless scheme (SRCU?) was
      planned never landed.
      
      Fixes: 105f8d40 ("KVM: Calculate available entries in coalesced mmio ring")
      Signed-off-by: default avatarIlias Stamatis <ilstam@amazon.com>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Link: https://lore.kernel.org/r/20240718193543.624039-2-ilstam@amazon.com
      
      
      [sean: rework changelog to clarify behavior, call out weirdness of buggy commit]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      92f6d413
  7. Aug 14, 2024
    • Sean Christopherson's avatar
      KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) · 66155de9
      Sean Christopherson authored
      
      Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
      directly emulate instructions for ES/SNP, and instead the guest must
      explicitly request emulation.  Unless the guest explicitly requests
      emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
      SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
      
      But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
      because except for ES/SNP, doing so requires setting reserved bits in the
      SPTE, i.e. the SPTE can't be readable while also generating a #VC on
      writes.  Because KVM never creates MMIO SPTEs and jumps directly to
      emulation, the guest never gets a #VC.  And since KVM simply resumes the
      guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
      into an infinite #NPF loop if the vCPU attempts to write read-only memory.
      
      Disallow read-only memory for all VMs with protected state, i.e. for
      upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
      to support read-only memory, as TDX uses EPT Violation #VE to reflect the
      fault into the guest, e.g. KVM could configure read-only SPTEs with RX
      protections and SUPPRESS_VE=0.  But there is no strong use case for
      supporting read-only memslots on TDX, e.g. the main historical usage is
      to emulate option ROMs, but TDX disallows executing from shared memory.
      And if someone comes along with a legitimate, strong use case, the
      restriction can always be lifted for TDX.
      
      Don't bother trying to retroactively apply the restriction to SEV-ES
      VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
      possibly work for SEV-ES, i.e. disallowing such memslots is really just
      means reporting an error to userspace instead of silently hanging vCPUs.
      Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
      isn't worth the marginal benefit it would provide userspace.
      
      Fixes: 26c44aa9 ("KVM: SEV: define VM types for SEV and SEV-ES")
      Fixes: 1dfe571c ("KVM: SEV: Add initial SEV-SNP support")
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Michael Roth <michael.roth@amd.com>
      Cc: Vishal Annapurve <vannapurve@google.com>
      Cc: Ackerly Tng <ackerleytng@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240809190319.1710470-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      66155de9
  8. Aug 13, 2024
    • Li RongQing's avatar
      KVM: eventfd: Use synchronize_srcu_expedited() on shutdown · c9b35a6f
      Li RongQing authored
      
      When hot-unplug a device which has many queues, and guest CPU will has
      huge jitter, and unplugging is very slow.
      
      It turns out synchronize_srcu() in irqfd_shutdown() caused the guest
      jitter and unplugging latency, so replace synchronize_srcu() with
      synchronize_srcu_expedited(), to accelerate the unplugging, and reduce
      the guest OS jitter, this accelerates the VM reboot too.
      
      Signed-off-by: default avatarLi RongQing <lirongqing@baidu.com>
      Message-ID: <20240711121130.38917-1-lirongqing@baidu.com>
      [Call it just once in irqfd_resampler_shutdown. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9b35a6f
    • Al Viro's avatar
      introduce fd_file(), convert all accessors to it. · 1da91ea8
      Al Viro authored
      
      	For any changes of struct fd representation we need to
      turn existing accesses to fields into calls of wrappers.
      Accesses to struct fd::flags are very few (3 in linux/file.h,
      1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
      explicit initializers).
      	Those can be dealt with in the commit converting to
      new layout; accesses to struct fd::file are too many for that.
      	This commit converts (almost) all of f.file to
      fd_file(f).  It's not entirely mechanical ('file' is used as
      a member name more than just in struct fd) and it does not
      even attempt to distinguish the uses in pointer context from
      those in boolean context; the latter will be eventually turned
      into a separate helper (fd_empty()).
      
      	NOTE: mass conversion to fd_empty(), tempting as it
      might be, is a bad idea; better do that piecewise in commit
      that convert from fdget...() to CLASS(...).
      
      [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
      caught by git; fs/stat.c one got caught by git grep]
      [fs/xattr.c conflict]
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1da91ea8
  9. Jul 26, 2024
    • Paolo Bonzini's avatar
      KVM: guest_memfd: abstract how prepared folios are recorded · 66a644c0
      Paolo Bonzini authored
      
      Right now, large folios are not supported in guest_memfd, and therefore the order
      used by kvm_gmem_populate() is always 0.  In this scenario, using the up-to-date
      bit to track prepared-ness is nice and easy because we have one bit available
      per page.
      
      In the future, however, we might have large pages that are partially populated;
      for example, in the case of SEV-SNP, if a large page has both shared and private
      areas inside, it is necessary to populate it at a granularity that is smaller
      than that of the guest_memfd's backing store.  In that case we will have
      to track preparedness at a 4K level, probably as a bitmap.
      
      In preparation for that, do not use explicitly folio_test_uptodate() and
      folio_mark_uptodate().  Return the state of the page directly from
      __kvm_gmem_get_pfn(), so that it is expected to apply to 2^N pages
      with N=*max_order.  The function to mark a range as prepared for now
      takes just a folio, but is expected to take also an index and order
      (or something like that) when large pages are introduced.
      
      Thanks to Michael Roth for pointing out the issue with large pages.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      66a644c0
    • Paolo Bonzini's avatar
      KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfns · e4ee5447
      Paolo Bonzini authored
      
      This check is currently performed by sev_gmem_post_populate(), but it
      applies to all callers of kvm_gmem_populate(): the point of the function
      is that the memory is being encrypted and some work has to be done
      on all the gfns in order to encrypt them.
      
      Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior
      to invoking the callback, and stop the operation if a shared page
      is encountered.  Because CONFIG_KVM_PRIVATE_MEM in principle does
      not require attributes, this makes kvm_gmem_populate() depend on
      CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them).
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e4ee5447
    • Paolo Bonzini's avatar
      KVM: extend kvm_range_has_memory_attributes() to check subset of attributes · 4b5f6712
      Paolo Bonzini authored
      
      While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE,
      KVM code such as kvm_mem_is_private() is written to expect their existence.
      Allow using kvm_range_has_memory_attributes() as a multi-page version of
      kvm_mem_is_private(), without it breaking later when more attributes are
      introduced.
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4b5f6712
    • Paolo Bonzini's avatar
      KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes() · e300614f
      Paolo Bonzini authored
      
      Use a guard to simplify early returns, and add two more easy
      shortcuts.  If the requested attributes are invalid, the attributes
      xarray will never show them as set.  And if testing a single page,
      kvm_get_memory_attributes() is more efficient.
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e300614f
    • Paolo Bonzini's avatar
      KVM: guest_memfd: move check for already-populated page to common code · de802524
      Paolo Bonzini authored
      
      Do not allow populating the same page twice with startup data.  In the
      case of SEV-SNP, for example, the firmware does not allow it anyway,
      since the launch-update operation is only possible on pages that are
      still shared in the RMP.
      
      Even if it worked, kvm_gmem_populate()'s callback is meant to have side
      effects such as updating launch measurements, and updating the same
      page twice is unlikely to have the desired results.
      
      Races between calls to the ioctl are not possible because
      kvm_gmem_populate() holds slots_lock and the VM should not be running.
      But again, even if this worked on other confidential computing technology,
      it doesn't matter to guest_memfd.c whether this is something fishy
      such as missing synchronization in userspace, or rather something
      intentional.  One of the racers wins, and the page is initialized by
      either kvm_gmem_prepare_folio() or kvm_gmem_populate().
      
      Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use
      the same errno that kvm_gmem_populate() is using.
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      de802524
    • Paolo Bonzini's avatar
      KVM: remove kvm_arch_gmem_prepare_needed() · 7239ed74
      Paolo Bonzini authored
      
      It is enough to return 0 if a guest need not do any preparation.
      This is in fact how sev_gmem_prepare() works for non-SNP guests,
      and it extends naturally to Intel hosts: the x86 callback for
      gmem_prepare is optional and returns 0 if not defined.
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7239ed74
    • Paolo Bonzini's avatar
      KVM: guest_memfd: make kvm_gmem_prepare_folio() operate on a single struct kvm · 6dd761d9
      Paolo Bonzini authored
      
      This is now possible because preparation is done by kvm_gmem_get_pfn()
      instead of fallocate().  In practice this is not a limitation, because
      even though guest_memfd can be bound to multiple struct kvm, for
      hardware implementations of confidential computing only one guest
      (identified by an ASID on SEV-SNP, or an HKID on TDX) will be able
      to access it.
      
      In the case of intra-host migration (not implemented yet for SEV-SNP,
      but we can use SEV-ES as an idea of how it will work), the new struct
      kvm inherits the same ASID and preparation need not be repeated.
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6dd761d9
    • Paolo Bonzini's avatar
      KVM: guest_memfd: delay kvm_gmem_prepare_folio() until the memory is passed to the guest · b8552431
      Paolo Bonzini authored
      
      Initializing the contents of the folio on fallocate() is unnecessarily
      restrictive.  It means that the page is registered with the firmware and
      then it cannot be touched anymore.  In particular, this loses the
      possibility of using fallocate() to pre-allocate the page for SEV-SNP
      guests, because kvm_arch_gmem_prepare() then fails.
      
      It's only when the guest actually accesses the page (and therefore
      kvm_gmem_get_pfn() is called) that the page must be cleared from any
      stale host data and registered with the firmware.  The up-to-date flag
      is clear if this has to be done (i.e. it is the first access and
      kvm_gmem_populate() has not been called).
      
      All in all, there are enough differences between kvm_gmem_get_pfn() and
      kvm_gmem_populate(), that it's better to separate the two flows completely.
      Extract the bulk of kvm_gmem_get_folio(), which take a folio and end up
      setting its up-to-date flag, to a new function kvm_gmem_prepare_folio();
      these are now done only by the non-__-prefixed kvm_gmem_get_pfn().
      As a bonus, __kvm_gmem_get_pfn() loses its ugly "bool prepare" argument.
      
      One difference is that fallocate(PUNCH_HOLE) can now race with a
      page fault.  Potentially this causes a page to be prepared and into the
      filemap even after fallocate(PUNCH_HOLE).  This is harmless, as it can be
      fixed by another hole punching operation, and can be avoided by clearing
      the private-page attribute prior to invoking fallocate(PUNCH_HOLE).
      This way, the page fault will cause an exit to user space.
      
      The previous semantics, where fallocate() could be used to prepare
      the pages in advance of running the guest, can be accessed with
      KVM_PRE_FAULT_MEMORY.
      
      For now, accessing a page in one VM will attempt to call
      kvm_arch_gmem_prepare() in all of those that have bound the guest_memfd.
      Cleaning this up is left to a separate patch.
      
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b8552431
    • Paolo Bonzini's avatar
      KVM: guest_memfd: return locked folio from __kvm_gmem_get_pfn · 78c42933
      Paolo Bonzini authored
      
      Allow testing the up-to-date flag in the caller without taking the
      lock again.
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      78c42933
    • Paolo Bonzini's avatar
      KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_* · 564429a6
      Paolo Bonzini authored
      
      Add "ARCH" to the symbols; shortly, the "prepare" phase will include both
      the arch-independent step to clear out contents left in the page by the
      host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE.
      For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well.
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      564429a6
    • Paolo Bonzini's avatar
      KVM: guest_memfd: do not go through struct page · 7fbdda31
      Paolo Bonzini authored
      
      We have a perfectly usable folio, use it to retrieve the pfn and order.
      All that's needed is a version of folio_file_page that returns a pfn.
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7fbdda31
    • Paolo Bonzini's avatar
      KVM: guest_memfd: delay folio_mark_uptodate() until after successful preparation · d04c77d2
      Paolo Bonzini authored
      
      The up-to-date flag as is now is not too useful; it tells guest_memfd not
      to overwrite the contents of a folio, but it doesn't say that the page
      is ready to be mapped into the guest.  For encrypted guests, mapping
      a private page requires that the "preparation" phase has succeeded,
      and at the same time the same page cannot be prepared twice.
      
      So, ensure that folio_mark_uptodate() is only called on a prepared page.  If
      kvm_gmem_prepare_folio() or the post_populate callback fail, the folio
      will not be marked up-to-date; it's not a problem to call clear_highpage()
      again on such a page prior to the next preparation attempt.
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d04c77d2
    • Paolo Bonzini's avatar
      KVM: guest_memfd: return folio from __kvm_gmem_get_pfn() · d0d87226
      Paolo Bonzini authored
      
      Right now this is simply more consistent and avoids use of pfn_to_page()
      and put_page().  It will be put to more use in upcoming patches, to
      ensure that the up-to-date flag is set at the very end of both the
      kvm_gmem_get_pfn() and kvm_gmem_populate() flows.
      
      Reviewed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d0d87226
  10. Jul 12, 2024
  11. Jun 28, 2024
  12. Jun 20, 2024
  13. Jun 18, 2024
Loading