Workaround (on certain hardware) to GPU Reset with amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout

added Phoenix label

Thanks for the advice. Currently trying this for #3515; I'll report back if this seems to resolve it.

EDIT: yep, this regression seems to do the trick; everything has been smooth sailing since I properly installed this older firmware. Thanks OP!

mentioned in issue #3515

I've tried this on my Thinkpad T14 ... and so far it's looking good. I've managed to suspend/resume the laptop a bunch of times and actually use a screenlock for the first time in weeks. No crashes yet. Will need to tweak some more, because wifi did crash after a few minutes with this old firmware :)

I guess that copying only /lib/firmware/amd* could do the trick

edit: btw I spent the day on 2e7754508 without the two reverts and still no crashes, continuing my journey through more recent commits

Yeah, I've copied the amdgpu directory to another location, then did a checkout of 'main', did a firmware install of that, and then rsynced the amdgpu directory to /lib/firmware/amdgpu So far so good, working wifi + gpu

More than likely a regression starting in the "5.7 branch" firmware updates committed by @agd5f on September 28th 2023. I am basing this hypothesis off of my previous experience without issues on the 20230919.git3672ccab-0ubuntu2.10 ubuntu mantic package (September 18th 2023) which does not contain these commits. Although it could be anything up to 20240318.git3b128b60-0ubuntu2.1 (March 12th 2024). Most firmware was updated twice after the initial 5.7 branch updates.

The total commits to bisect are as follows:

3 days now on b205802296 without any crash, which is more recent (2024-04-12). Don't know if any kind of non-volatile register could have been fixed by older firmwares or this one commit is still before the regression.

Just for info I'm on Linux 6.10

Either the issue I am experiencing is different or its been fixed in later versions (doubt it) because the version I use is older and crashes

After this commit my laptop failed to hibernate, I have to manually install old firmware.

#3047 (comment 2444004)

I can consistently reproduce a ring gfx timeout on my 6800HS (Radeon 680M) by starting a new game on Xenoblade Chronicles 3's DLC using the Switch emulator Ryujinx. I would like to find an easier way to reproduce (this is quite niche) but among the random occurrences this one always crashes my system and at the same moment.

Using the old firmware as in OP still gives me the same crash.

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.1.0 timeout, signaled seq=43136, emitted seq=43139
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 1030 thread kwin_wayla:cs0 pid 1075
amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
amdgpu 0000:04:00.0: amdgpu: MODE2 reset
amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
[drm] PCIE GART of 1024M enabled (table at 0x000000F41FC00000).
[drm] VRAM is lost due to GPU reset!
amdgpu 0000:04:00.0: amdgpu: PSP is resuming...
amdgpu 0000:04:00.0: amdgpu: reserve 0xa00000 from 0xf41e000000 for PSP TMR
amdgpu 0000:04:00.0: amdgpu: RAS: optional ras ta ucode is not available
amdgpu 0000:04:00.0: amdgpu: RAP: optional rap ta ucode is not available
amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
[drm] DMUB hardware initialized: version=0x04000044
[drm] kiq ring mec 2 pipe 1 q 0
[drm] JPEG decode initialized successfully.
amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0
amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
amdgpu 0000:04:00.0: amdgpu: GPU reset(4) succeeded!
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Sadly, I am not knowledgeable about any of this, but I would like to help if possible. How can I be sure I am using the older firmware in the first place?

If you consistently get a hang with a particular application, it's most likely a mesa issue related to that particular app. I'd suggest opening a mesa ticket.

Thank you for your answer and sorry for the confusion. I couldn't tell the difference between the hang I experience in this particular case and the random one that happens occasionally so I thought they were the same issue.

This workaround works for me too with 7840HS (780M, no dGPU). I'm almost certain that this is caused by Critical Thermal Fault changes in kernel 5.8.

Before applying the old firmware, I've also tried manually downclocking to around 2300MHz and it was working too:

echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo "s 1 2300" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage

Not sure if amdgpu.ppmask=0xffffffff in boot parameters is required

I've also monitored the behavior in Windows and in Debian with old/new firmware.

So when I run some benchmark on Windows or with the old firmware the following happens:

GPU temps quickly rise to about 95 C and above
GPU freq goes down to around 2500 MHz
My laptop (LENOVO IdeaPad Pro 5 14APH8) doesn't seem to care too much about iGPU thermals because the fan is definitely not at 100% even though the GPU is clearly overheating and throttling.
Everything seems to be working fine except the high temps

When I do the same with the new firmware:

GPU temps start reaching 93-94 C
GUI crash happens around this point, usually difficult to recover and I have to restart
There is no mention of thermals in the logs, only ring gfx timeout
The frequency seems to stay at 2700 before that, but I'm not really sure because everything crashes at this point

So basically instead of thermal throttling and slightly reducing GPU frequency like the old firmware does, the new one triggers this ring gfx timeout

Lots of people are reporting similar problems with dGPUs, however you can control the fan curve in that case, but it's impossible for iGPUs. Also, most overclocking (or in this case underclocking) options are not available for iGPUs.

I'm using Unigine Valley or random WebGL stress tests from a Google search for stressing the GPU. The WebGL seems to generate more heat and it crashes faster. For example - https://mprep.info/gpu/

Additionally - At one point something happened with the fan management (that was a few months ago, not related to the testing above) and it got stuck at 100% all the time. Everything was working properly without having to resort to workarounds. But sadly I had to restart at some point and it was also quite annoying. So I think it's a thermal throttling issue, exacerbated by Lenovo's thermal management.

mentioned in issue #2068

Does this patch help? agd5f/linux@c50fe289

I tested kernel 6.11.0-0.rc6.20240904gt88fac175.350.vanilla.fc40, which I believe should include this patch. However, it hard-crashed right after login, similar to version 6.10.6. Using version 6.9.12 helps, as the crashes are at least recoverable.

Edit: Is there any kernel version deemed safe to rule out a hardware defect?

Version 6.9.12 makes my situation better. Version 6.10.6 crashed every 5min.

@agd5f which firmware version should that patch (kernel 6.11.0-0.rc6) be used with?

Edit: Fedora updated linux-firmware to linux-firmware-20240909-1.fc40. In combination with kernel 6.11.0-0.rc6.20240904gt88fac175.350.vanilla.fc40 it seems promising to be stable!

@matietjen can you try 6.9.4? That one appears stable on my system for now.

I can not access 6.9.4 any more. 6.9.12 is the smallest.

6.9.12 run stable most of the time. 3 crashes in the last 3 weeks.

Actually, it seems that the latest linux-firmware version from git works without patches. I was experimenting with clock speeds and only tried the commit from the issue, not the latest

I'm on kernel 6.9.10+bpo-amd64 (Debian 12) and it crashes with the default firmware, but seems stable with the latest

I'm attaching /sys/kernel/debug/dri/0/amdgpu_firmware_info for both

amdgpu_firmware_info_crash

amdgpu_firmware_info_working

I'm speculating here, but I don't think #3131 is relevant, the frequencies in pp_od_clk_voltage seem fine for 780M:

OD_SCLK:
0:        800Mhz
1:       2700Mhz
OD_RANGE:
SCLK:     800Mhz       2700Mhz

For dGPU like the ones mentioned in #3131 - they should never thermal throttle, so even if they crash instead of throttling, it's more important to fix the frequency and avoid reaching that point. However, for this laptop it seems to me that throttling is expected (or Lenovo didn't properly tune the fan curve) and the problem is that it crashes instead of slowing a bit down like it does now. In Windows it seems to be working at 2700MHz properly for a half a minute before it reaches high temperature and throttles, but never crashes.

It reaches 102 C, I'm not sure how healthy this is, but seems to behave the same way in Windows

I can consistently produce this crash on my system running either the RX6800 or the RX 580 by running Java Minecraft with a 512x texture pack. Crashes less than 2 minutes into the game. This firmware did not fix my issue but manually setting the GPU power profile to the highest does. I can consistently play the game and it no longer crashes on stable diffusion either, which also was a cause of frequentbut irregular crashes. The log messages are the same about the timeout

Workaround (on certain hardware) to GPU Reset with amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout

Designs

Child items 0

Activity

Admin message

Workaround (on certain hardware) to GPU Reset with amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout

Activity

Workaround (on certain hardware) to GPU Reset with amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout