Machine check exception on device reset while in DC5/DC6
I have a somewhat non-standard GPU pass through setup where the GPU is handled exclusively by a Windows 10 VM (qemu/KVM).
If the graphics device is reset (PCIe function level reset) I sometimes see a machine check exception on the host. This happens (almost) always if the FLR happens while the guest driver put the card into one of the DC states (i.e. DC_STATE_EN & 0x03 != 0). The device reset itself is part of the cleanup sequence in the vfio driver if qemu is killed or crashes. I.e. if qemu crashes while the card is in D5/D6 the host suffers from a machine check and must be rebootet.
I've written a device specific reset function in drivers/pci/quirks.c that disables DC states by clearing bits 0x03 in DC_STATE_EN before the PCIe FLR. This resolves the issue for me. I can provide a patch if desired.
Affected device IDs include: 8086:3ea0 (WHL) and 8086:5917 (KBL) SKL does not seems to have this issue. However, this may be due to the guest driver not enabling DC states on SKL?
The Machine check is in Bank#4 with MC4_STATUS=0xba00000011000402 which seems to indicate a "PCU internal error".
Any insights on the issue would be appreciated.
regards Christian