New DG2/A380 is frequently wedged with GuC fw load failed (-ETIMEDOUT), excessive init time or fw not ready, unexpected reset
Problem
The system boots frequently with a non-functional wedged GPU from the a GuC initialization failed -ETIMEDOUT
> 1000ms. Often it will boot successfully with a successive GUC: excessive init
time around or under the 1000ms limit (averaging 700-900ms). Sometimes it will boot with fw not ready, unexpected reset
. Sometimes it boots clean with no issues in the kernel log. A reboot from this state may or may not fix the issue. There does not seem to be a pattern on how to predict which state the card will boot in.
According to this message by @johnharr on the kernel mailing list here at https://patchwork.kernel.org/project/intel-gfx/patch/20240102222202.310495-1-John.C.Harrison@Intel.com/
"Note that the only reason an end user should hit the timeout is in case of extreme thermal throttling."
However I find this is not the case here. I see the message frequently and it happens on boot, where the system is cool and not throttling at all. The card is in a open space with a cool ambient temperature and not physically hot.
I have tried two different kernels on Ubuntu Server: 6.5 (22.04.4) and 6.8 (24.04).
What kernel version was the 1000ms timeout introduced and was was 1000ms chosen? Many times I see a init time of 700-1100 so it seems random if I will have a successful boot with a usable card or not.
Since this is not expected to be an issue that an end user should see, what is the likely cause of this problem and what is the recommended solution to achieve a stable system.
System Information
- Steps to Reproduce: Restart system, observe kernel boot messages.
- Frequency: The issue happens often but is intermittent, I cannot see any pattern to predict when it will happen.
- Display connector, DP
- Desktop environment: N/A
- Card: https://www.asrock.com/Graphics-Card/Intel/Intel%20Arc%20A380%20Challenger%20ITX%206GB%20OC/
System 1: Tested on Ubuntu Server 22.04.4 with Hardware Enablement Stack and firmware-linux driver, Kernel 6.5
- uname -srvmo: Linux 6.5.0-28-generic #29 (closed)~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
- lspci -vnn -d :*:0300:
03:00.0 VGA compatible controller [0300]: Intel Corporation Device [8086:56a5] (rev 05) (prog-if 00 [VGA controller])
Subsystem: ASRock Incorporation Device [1849:6004]
Flags: bus master, fast devsel, latency 0, IRQ 159
Memory at 82000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=8G]
Expansion ROM at 83000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
Capabilities: [d0] Power Management version 3
Capabilities: [100] Alternative Routing-ID Interpretation (ARI)
Capabilities: [420] Physical Resizable BAR
Capabilities: [400] Latency Tolerance Reporting
Kernel driver in use: i915
Kernel modules: i915
System 2: Tested on Ubuntu Server 24.04 with Kernel 6.8 and default drivers.
- uname -srvmo: Linux 6.8.0-31-generic #31 (closed)-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024 x86_64 GNU/Linux
- lspci -vnn -d :*:0300:
03:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A380] [8086:56a5] (rev 05) (prog-if 00 [VGA controller])
Subsystem: ASRock Incorporation DG2 [Arc A380] [1849:6004]
Flags: bus master, fast devsel, latency 0, IRQ 168, IOMMU group 14
Memory at 82000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=8G]
Expansion ROM at 83000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
Capabilities: [d0] Power Management version 3
Capabilities: [100] Alternative Routing-ID Interpretation (ARI)
Capabilities: [420] Physical Resizable BAR
Capabilities: [400] Latency Tolerance Reporting
Kernel driver in use: i915
Kernel modules: i915, xe
An example of a wedged boot (dmesg | grep i915):
[ 1.919628] i915 0000:03:00.0: vgaarb: deactivate vga console
[ 1.919672] i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
[ 1.919675] i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
[ 1.933363] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1.936221] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[ 1.982883] i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.5.1
[ 1.982887] i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[ 3.018230] i915 0000:03:00.0: [drm] GT0: GUC: load failed: status = 0x80000534, time = 1001ms, freq = 2400MHz, ret = -110
[ 3.018261] i915 0000:03:00.0: [drm] GT0: GUC: load failed: status: Reset = 0, BootROM = 0x1A, UKernel = 0x05, MIA = 0x00, Auth = 0x02
[ 3.018278] i915 0000:03:00.0: [drm] GT0: GUC: still extracting hwconfig table.
[ 3.018755] i915 0000:03:00.0: [drm] *ERROR* GT0: GuC initialization failed -ETIMEDOUT
[ 3.018765] i915 0000:03:00.0: [drm] *ERROR* GT0: Enabling uc failed (-5)
[ 3.018773] i915 0000:03:00.0: [drm] *ERROR* GT0: Failed to initialize GPU, declaring it wedged!
[ 3.025801] i915 0000:03:00.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_set_wedged_on_init+0x34/0x50 [i915]
[ 3.084559] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 0
[ 3.126572] fbcon: i915drmfb (fb0) is primary device
[ 3.228997] i915 0000:03:00.0: [drm] fb0: i915drmfb frame buffer device
[ 5.080356] mei_gsc i915.mei-gscfi.768: cl:host=01 me=32 fw disconnect request received
[ 5.080383] mei i915.mei-gscfi.768-e2c2afa2-3817-4d19-9d95-06b16b588a5d: cannot connect
[ 5.083341] mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
[ 5.083404] mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[ 5.083475] mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
[ 5.083499] mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[ 5.167953] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[ 5.469923] i915 0000:03:00.0: [drm] *ERROR* failed to load huc via gsc -8
[ 5.469940] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: failed to bind 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915]): -8
[ 5.470322] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: adev bind failed: -8
[ 5.470776] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: Master comp add failed -8
[ 5.470780] mei_pxp: probe of i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1 failed with error -8
An example of a boot with excessive init time (dmesg | grep i915):
[ 13.312665] i915 0000:03:00.0: [drm] VT-d active for gfx access
[ 13.330537] i915 0000:03:00.0: vgaarb: deactivate vga console
[ 13.330582] i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
[ 13.330583] i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
[ 13.344404] i915 0000:03:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[ 13.348207] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[ 13.394758] i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
[ 13.394761] i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[ 14.321366] i915 0000:03:00.0: [drm] GT0: GUC: excessive init time: 892ms! [status = 0x8002F034, count = 0, ret = 0]
[ 14.321389] i915 0000:03:00.0: [drm] GT0: GUC: excessive init time: [freq = 2400MHz, before = 2400MHz, perf_limit_reasons = 0x01000100]
[ 14.341899] i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
[ 14.341905] i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
[ 14.344718] i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
[ 14.442339] [drm] Initialized i915 1.6.0 20230929 for 0000:03:00.0 on minor 1
[ 14.450335] i915 display info: display version: 13
[ 14.450336] i915 display info: cursor_needs_physical: no
[ 14.450336] i915 display info: has_cdclk_crawl: no
[ 14.450337] i915 display info: has_cdclk_squash: yes
[ 14.450337] i915 display info: has_ddi: yes
[ 14.450338] i915 display info: has_dp_mst: yes
[ 14.450338] i915 display info: has_dsb: yes
[ 14.450339] i915 display info: has_fpga_dbg: yes
[ 14.450339] i915 display info: has_gmch: no
[ 14.450339] i915 display info: has_hotplug: yes
[ 14.450340] i915 display info: has_hti: no
[ 14.450340] i915 display info: has_ipc: yes
[ 14.450341] i915 display info: has_overlay: no
[ 14.450341] i915 display info: has_psr: yes
[ 14.450341] i915 display info: has_psr_hw_tracking: no
[ 14.450342] i915 display info: overlay_needs_physical: no
[ 14.450342] i915 display info: supports_tv: no
[ 14.450343] i915 display info: has_hdcp: yes
[ 14.450343] i915 display info: has_dmc: yes
[ 14.450344] i915 display info: has_dsc: yes
[ 14.483517] fbcon: i915drmfb (fb0) is primary device
[ 14.483475] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[ 14.583993] i915 0000:03:00.0: [drm] fb0: i915drmfb frame buffer device
[ 15.188442] i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
[ 15.188461] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
An example of a fw not ready, unexpected reset (dmesg | grep i915):
[ 1.480970] i915 0000:03:00.0: vgaarb: deactivate vga console
[ 1.480987] i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
[ 1.480989] i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
[ 1.494549] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1.497397] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[ 1.518545] i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.5.1
[ 1.518550] i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[ 1.529631] i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
[ 1.529635] i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
[ 1.529862] i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
[ 1.542912] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 0
[ 1.575201] fbcon: i915drmfb (fb0) is primary device
[ 1.658867] i915 0000:03:00.0: [drm] fb0: i915drmfb frame buffer device
[ 3.536252] mei i915.mei-gscfi.768-e2c2afa2-3817-4d19-9d95-06b16b588a5d: cannot connect
[ 3.538311] mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
[ 3.538333] mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[ 3.539064] mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
[ 3.539086] mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[ 3.589155] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[ 3.937916] i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
[ 3.937932] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
An example of a normal boot:
[ 22.699877] i915 0000:03:00.0: [drm] VT-d active for gfx access
[ 22.721692] i915 0000:03:00.0: vgaarb: deactivate vga console
[ 22.721727] i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
[ 22.721729] i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
[ 22.735412] i915 0000:03:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[ 22.740200] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[ 22.761524] i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
[ 22.761528] i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[ 22.769980] i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
[ 22.769981] i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
[ 22.770210] i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
[ 22.791410] [drm] Initialized i915 1.6.0 20230929 for 0000:03:00.0 on minor 1
[ 22.792086] i915 display info: display version: 13
[ 22.792088] i915 display info: cursor_needs_physical: no
[ 22.792088] i915 display info: has_cdclk_crawl: no
[ 22.792089] i915 display info: has_cdclk_squash: yes
[ 22.792090] i915 display info: has_ddi: yes
[ 22.792090] i915 display info: has_dp_mst: yes
[ 22.792090] i915 display info: has_dsb: yes
[ 22.792091] i915 display info: has_fpga_dbg: yes
[ 22.792091] i915 display info: has_gmch: no
[ 22.792092] i915 display info: has_hotplug: yes
[ 22.792092] i915 display info: has_hti: no
[ 22.792093] i915 display info: has_ipc: yes
[ 22.792093] i915 display info: has_overlay: no
[ 22.792093] i915 display info: has_psr: yes
[ 22.792094] i915 display info: has_psr_hw_tracking: no
[ 22.792094] i915 display info: overlay_needs_physical: no
[ 22.792095] i915 display info: supports_tv: no
[ 22.792095] i915 display info: has_hdcp: yes
[ 22.792096] i915 display info: has_dmc: yes
[ 22.792096] i915 display info: has_dsc: yes
[ 22.826958] fbcon: i915drmfb (fb0) is primary device
[ 22.826924] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[ 22.925002] i915 0000:03:00.0: [drm] fb0: i915drmfb frame buffer device
[ 23.506553] i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
[ 23.506569] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
Any advice provided would be greatly appreciated as this is a new system but basic stability is required. Is this normal operating behavior for this card? Is this a software, configuration, driver or hardware problem?
Thank you.