RX 580 / Vega56 as eGPU amdgpu: gpu post error!
Submitted by Robert Strube
Assigned to Default DRI bug account
Link to original bug (#108521)
Description
Hello everyone,
I've been attempting to get my RX 580 working correctly as an eGPU using the Akitio Node eGPU enclosure (over Thunderbolt 3).
I've confirmed that both the Akitio Node and my laptops Thunderbolt 3 controller are running the most up-to-date firmware. I've also been able to successfully authorize the Thunderbolt eGPU enclosure, and see the RX 580 in lspci, see blow:
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05)
00:13.0 Non-VGA unclassified device: Intel Corporation 100 Series/C230 Series Chipset Family Integrated Sensor Hub (rev 31)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (closed) (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (closed) (rev 31)
00:17.0 SATA controller: Intel Corporation HM170/QM170 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (closed) (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (closed) (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (closed) (rev f1)
00:1f.0 ISA bridge: Intel Corporation QM175 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Polaris 22 [Radeon RX Vega M GL] (rev c0)
02:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
04:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:01.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:02.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:04.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
06:00.0 System peripheral: Intel Corporation JHL6540 Thunderbolt 3 NHI (C step) [Alpine Ridge 4C 2016] (rev 02)
07:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
08:01.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
09:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 580]
Looking at just the RX 580 in more detail using lspci -v we have:
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7) (prog-if 00 [VGA controller])
Subsystem: XFX Pine Group Inc. Ellesmere [Radeon RX 470/480/570/570X/580/580X]
Flags: fast devsel, IRQ 18
Memory at 2fb0000000 (64-bit, prefetchable) [size=256M]
Memory at 2fc0000000 (64-bit, prefetchable) [size=2M]
I/O ports at 2000 [size=256]
Memory at bc000000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at bc040000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 >
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 >
Capabilities: [150] Advanced Error Reporting
Capabilities: [200] #15 (closed)
Capabilities: [270] #19 (closed)
Capabilities: [2b0] Address Translation Service (ATS)
Capabilities: [2c0] Page Request Interface (PRI)
Capabilities: [2d0] Process Address Space ID (PASID)
Capabilities: [320] Latency Tolerance Reporting
Capabilities: [328] Alternative Routing-ID Interpretation (ARI)
Capabilities: [370] L1 PM Substates
Kernel modules: amdgpu
When looking at demsg I see the following (I've removed non-relevant lines):
[ 8.534250] amdgpu 0000:09:00.0: enabling device (0006 -> 0007)
[ 8.534756] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1682:0xC580 0xE7).
[ 8.537567] [drm] register mmio base: 0xBC000000
[ 8.537568] [drm] register mmio size: 262144
[ 8.537598] [drm] add ip block number 0 <vi_common>
[ 8.537599] [drm] add ip block number 1 <gmc_v8_0>
[ 8.537599] [drm] add ip block number 2 <tonga_ih>
[ 8.537599] [drm] add ip block number 3 <powerplay>
[ 8.537600] [drm] add ip block number 4 <dm>
[ 8.537600] [drm] add ip block number 5 <gfx_v8_0>
[ 8.537601] [drm] add ip block number 6 <sdma_v3_0>
[ 8.537602] [drm] add ip block number 7 <uvd_v6_0>
[ 8.537602] [drm] add ip block number 8 <vce_v3_0>
[ 8.537608] kfd kfd: skipped device 1002:67df, PCI rejects atomics
[ 8.537630] [drm] UVD is enabled in VM mode
[ 8.537630] [drm] UVD ENC is enabled in VM mode
[ 8.537636] [drm] VCE enabled in VM mode
[ 8.614467] ATOM BIOS: 401815-171128-QS1
[ 8.614512] [drm] GPU posting now...
[ 13.621276] [drm:atom_op_jump [amdgpu]] ERROR atombios stuck in loop for more than 5secs aborting
[ 13.621310] [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing E650 (len 187, WS 0, PS 4) @ 0xE6FA
[ 13.621341] [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing C53A (len 193, WS 4, PS 4) @ 0xC569
[ 13.621359] [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing C410 (len 114, WS 0, PS 8) @ 0xC47C
[ 13.621361] amdgpu 0000:09:00.0: gpu post error!
[ 13.621363] amdgpu 0000:09:00.0: Fatal error during GPU init
[ 13.621370] [drm] amdgpu: finishing device.
[ 13.621792] amdgpu: probe of 0000:09:00.0 failed with error -22
Here are my system details:
System: Dell XPS 15 2 in 1 (Kaby Lake G)
Kernel: 4.19
Mesa: 18.2.2
Xorg: 1.20.1
Built in GPUs: Intel iGPU, Vega M
eGPU: RX 580
I'm not sure if I'm having problems because my laptop also contains a Vega M, which also uses the amdgpu driver. Perhaps there's a problem if there are multiple GPUs using amdgpu? One thing to point out is that the Vega M has worked flawlessly since Kernel 4.18.x.
I did run across several other users posting about this same problem when attempting to run AMD GPUs as eGPUs. Here's a post where a user is reporting the same issue:
https://egpu.io/forums/thunderbolt-linux-setup/egpus-under-linux-an-advanced-guide/#post-33304
And here's another post:
https://forum.manjaro.org/t/rx-580-in-a-thunderbolt-egpu-dock/58210
I'm comfortable applying and testing kernel patches, so please feel free to ask me to test any fixes. I'm currently running 4.19, but could also patch a 4.18.x kernel.
Thanks!