ASUS TUF A16 - RX 7600S falls off the bus after 8bd82363e2ee
On my ASUS TUF A16 laptop, the dGPU (an RX 7600S) will occasionally fall off the bus - that is, the pcieport driver complains about a broken device/link:
kernel: pcieport 0000:00:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
kernel: pcieport 0000:00:01.1: retraining failed
kernel: pcieport 0000:00:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
kernel: pcieport 0000:00:01.1: retraining failed
kernel: pcieport 0000:01:00.0: not ready 1023ms after resume; waiting
kernel: pcieport 0000:01:00.0: not ready 2047ms after resume; waiting
kernel: pcieport 0000:01:00.0: not ready 4095ms after resume; waiting
kernel: pcieport 0000:01:00.0: not ready 8191ms after resume; waiting
kernel: pcieport 0000:01:00.0: not ready 16383ms after resume; waiting
kernel: pcieport 0000:01:00.0: not ready 32767ms after resume; waiting
kernel: pcieport 0000:01:00.0: not ready 65535ms after resume; giving up
kernel: pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
kernel: pcieport 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: pcieport 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: amdgpu 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: [drm:gmc_v11_0_flush_gpu_tlb [amdgpu]] *ERROR* Timeout waiting for sem acquire in VM flush!
kernel: amdgpu 0000:03:00.0: amdgpu: Timeout waiting for VM flush ACK!
kernel: [drm:gmc_v11_0_flush_gpu_tlb [amdgpu]] *ERROR* Timeout waiting for sem acquire in VM flush!
kernel: amdgpu 0000:03:00.0: amdgpu: Timeout waiting for VM flush ACK!
This is followed by amdgpu trying to gracefully shut down the device, causing a slew of WARNs by various cleanup functions that assume there is still a GPU to clean up.
Bisecting this led to 8bd82363e2ee ("drm/amdgpu: revert "take runtime pm reference when we attach a buffer" v2") being the first bad commit.
The commit itself looks sensible/correct - I suspect it merely exposes some underlying problem with GPU/PCI power management on the platform, though I have no idea how I'd debug this/which information is necessary.
It's also curious that the "non-functional downstream link" message comes from a PCI bridge device (see lspci
output for the involved devices) as opposed to the actual GPU:
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h-19h PCIe GPP Bridge (prog-if 00 [Normal decode])
Subsystem: ASUSTeK Computer Inc. Device 1513
Flags: bus master, fast devsel, latency 0, IRQ 33, IOMMU group 1
Bus: primary=00, secondary=01, subordinate=03, sec-latency=0
I/O behind bridge: f000-ffff [size=4K] [16-bit]
Memory behind bridge: fcb00000-fcdfffff [size=3M] [32-bit]
Prefetchable memory behind bridge: 7c00000000-7e0fffffff [size=8448M] [32-bit]
Capabilities: <access denied>
Kernel driver in use: pcieport
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev 12) (prog-if 00 [Normal decode])
Physical Slot: 0
Flags: bus master, fast devsel, latency 0, IRQ 39, IOMMU group 13
Memory at fcd00000 (32-bit, non-prefetchable) [size=16K]
Bus: primary=01, secondary=02, subordinate=03, sec-latency=0
I/O behind bridge: f000-ffff [size=4K] [16-bit]
Memory behind bridge: fcb00000-fccfffff [size=2M] [32-bit]
Prefetchable memory behind bridge: 7c00000000-7e0fffffff [size=8448M] [32-bit]
Capabilities: <access denied>
Kernel driver in use: pcieport
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch (rev 12) (prog-if 00 [Normal decode])
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
Flags: bus master, fast devsel, latency 0, IRQ 40, IOMMU group 14
Bus: primary=02, secondary=03, subordinate=03, sec-latency=0
I/O behind bridge: f000-ffff [size=4K] [16-bit]
Memory behind bridge: fcb00000-fccfffff [size=2M] [32-bit]
Prefetchable memory behind bridge: 7c00000000-7e0fffffff [size=8448M] [32-bit]
Capabilities: <access denied>
Kernel driver in use: pcieport
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 33 [Radeon RX 7600/7600 XT/7600M XT/7600S/7700S / PRO W7600] (rev c3) (prog-if 00 [VGA controller])
Subsystem: ASUSTeK Computer Inc. Device 231d
Flags: bus master, fast devsel, latency 0, IRQ 68, IOMMU group 15
Memory at 7c00000000 (64-bit, prefetchable) [size=8G]
Memory at 7e00000000 (64-bit, prefetchable) [size=256M]
I/O ports at f000 [size=256]
Memory at fcb00000 (32-bit, non-prefetchable) [size=1M]
Expansion ROM at fcc00000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: amdgpu
Kernel modules: amdgpu
Let me know if there is any other helpful info I could provide.