Regression - Fence fallback timer expired on ring - Virtualized environment

You can use amdgpu.runpm=0 to disable runtime power management when you load the driver. It sounds like either runtime power management or MSIs are not working in the guest VM. Runtime power management relies on the OS putting the device into D3 at runtime. If the guest is not doing that the the driver portion of that won't work either.

Thanks you a lot ! :)

It seems that setting amdgpu.runpm=0 and also setting pci=nomsi fix the issue (to be confirmed, going to take few days, but from my first test it seems to do the trick)

So the problem (if confirmed) is probably on the xen side. I do not really known what "pci=nomsi" does, do you known what it means to have to set those 2 parameter ?

Does just setting pci=nomsi alone fix it or do you need both parameters?

MSIs (https://en.wikipedia.org/wiki/Message_Signaled_Interrupts) are a way of delivering interrupts on modern devices. It would appear xen does not handle them correctly.

For runtime power management, the OS generally puts the device into D3 state when it's idle. The D state is controlled via PCI config space which the hypervisor generally owns. When the guest requests a change to PCI config space it's generally either proxied to the hypervisor or dropped. I'm not sure how xen handles this. If it's dropped, it won't work.

setting pci=nomsi alone fix it, no need for amdgpu.runpm=0

And thanks for the explication

So I guess it means that MSI are dropped or ignored or not handled correctly. But theoretically it should not be the case ( Opened a ticket in the QubesOS issue tracker https://github.com/QubesOS/qubes-issues/issues/7971 )

Yeah, it should be possible to use MSIs in VMs.

Your description in https://github.com/QubesOS/qubes-issues/issues/7971 is not quite correct. MSIs have nothing to do with power management and amdgpu has been using MSIs since the driver was created. MSIs are used for interrupts.

See amdgpu_irq_init() in amdgpu_irq.c for how the driver interacts with the pci subsystem for allocating interrupts. The driver does use MSI-X if it's available.

Thanks, will check that code, and report the details to the qubes project. Since it was working before linux kernel 5.7, does this means that only some MSI call are not working ? And that the unsupported MSI call have been introduced in 5.7 and 5.10 ?

At some point the driver switched from using MSIs only to using MSI-X if available. That might have been in that timeframe.

MSI/MSI-X situation in qubes currently is weird. MSI should work just fine. But MSI-X does not yet, and is hidden away from PCIe caps by device model. So, if driver is looking for MSI, it's fine. If it expects MSI-X, it won't find it. But then, it may either fallback to MSI (which generally works fine), or INTx (which we find broken in several drivers, as that path is rarely tested for MSI-supporting devices).

Proper MSI-X support is not far away, but we still have some work to do in that area.

Is there a way to tell the driver to use MSI in absence of MSI-X? pci=nomsi is a global switch that disables both...

Shouldn't pci_alloc_irq_vectors() fallback to MSIs if MSI-X is not available? I suppose the driver could check if we are in a xen vm and not set the PCI_IRQ_MSIX flag, but that seems like a hack. The PCI core should handle this transparent to the driver.

I see currently both PCI_IRQ_MSIX and PCI_IRQ_MSI flags are set, so theoretically it should be okay. Skipping PCI_IRQ_MSIX on Xen in upstream driver indeed sounds bad (the issue with MSI-X is specific to our configuration - namely using qemu in stubdomain; it should work just fine in default Xen setup). I may include such patch in our kernel, until proper MSI-X support is done, though. But I'd need to double check if that really helps first.

Thanks for the help!

closed

It strange but I have exactly the same error on host OS without virtual machines.

Regression - Fence fallback timer expired on ring - Virtualized environment

Brief summary of the problem:

Hardware description and system information:

How to reproduce the issue:

Designs

Child items ...

Activity

Admin message

Admin message

Regression - Fence fallback timer expired on ring - Virtualized environment

Brief summary of the problem:

Hardware description and system information:

How to reproduce the issue:

Activity