Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Workaround (on certain hardware) to GPU Reset with amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout
Running 6.10 kernel, 4 days now and not a single crash (happened several times a day). kwin compositor sometimes stutters, especially with transparent windows animations, but it's still better than a whole system crash.
The base commit 2e7754508 is 1 year old, so I'll try to find some time to pinpoint on which commit exactly the issue appeared.
Designs
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
I've tried this on my Thinkpad T14 ... and so far it's looking good.
I've managed to suspend/resume the laptop a bunch of times and actually use a screenlock for the first time in weeks. No crashes yet.
Will need to tweak some more, because wifi did crash after a few minutes with this old firmware :)
Yeah, I've copied the amdgpu directory to another location, then did a checkout of 'main', did a firmware install of that, and then rsynced the amdgpu directory to /lib/firmware/amdgpu
So far so good, working wifi + gpu
More than likely a regression starting in the "5.7 branch" firmware updates committed by @agd5f on September 28th 2023. I am basing this hypothesis off of my previous experience without issues on the 20230919.git3672ccab-0ubuntu2.10 ubuntu mantic package (September 18th 2023) which does not contain these commits. Although it could be anything up to 20240318.git3b128b60-0ubuntu2.1 (March 12th 2024). Most firmware was updated twice after the initial 5.7 branch updates.
3 days now on b205802296 without any crash, which is more recent (2024-04-12). Don't know if any kind of non-volatile register could have been fixed by older firmwares or this one commit is still before the regression.
I can consistently reproduce a ring gfx timeout on my 6800HS (Radeon 680M) by starting a new game on Xenoblade Chronicles 3's DLC using the Switch emulator Ryujinx. I would like to find an easier way to reproduce (this is quite niche) but among the random occurrences this one always crashes my system and at the same moment.
Using the old firmware as in OP still gives me the same crash.
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.1.0 timeout, signaled seq=43136, emitted seq=43139[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 1030 thread kwin_wayla:cs0 pid 1075amdgpu 0000:04:00.0: amdgpu: GPU reset begin!amdgpu 0000:04:00.0: amdgpu: MODE2 resetamdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume[drm] PCIE GART of 1024M enabled (table at 0x000000F41FC00000).[drm] VRAM is lost due to GPU reset!amdgpu 0000:04:00.0: amdgpu: PSP is resuming...amdgpu 0000:04:00.0: amdgpu: reserve 0xa00000 from 0xf41e000000 for PSP TMRamdgpu 0000:04:00.0: amdgpu: RAS: optional ras ta ucode is not availableamdgpu 0000:04:00.0: amdgpu: RAP: optional rap ta ucode is not availableamdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not availableamdgpu 0000:04:00.0: amdgpu: SMU is resuming...amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully![drm] DMUB hardware initialized: version=0x04000044[drm] kiq ring mec 2 pipe 1 q 0[drm] JPEG decode initialized successfully.amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0amdgpu 0000:04:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow startamdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow doneamdgpu 0000:04:00.0: amdgpu: GPU reset(4) succeeded![drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Sadly, I am not knowledgeable about any of this, but I would like to help if possible. How can I be sure I am using the older firmware in the first place?
If you consistently get a hang with a particular application, it's most likely a mesa issue related to that particular app. I'd suggest opening a mesa ticket.
Thank you for your answer and sorry for the confusion. I couldn't tell the difference between the hang I experience in this particular case and the random one that happens occasionally so I thought they were the same issue.
Not sure if amdgpu.ppmask=0xffffffff in boot parameters is required
I've also monitored the behavior in Windows and in Debian with old/new firmware.
So when I run some benchmark on Windows or with the old firmware the following happens:
GPU temps quickly rise to about 95 C and above
GPU freq goes down to around 2500 MHz
My laptop (LENOVO IdeaPad Pro 5 14APH8) doesn't seem to care too much about iGPU thermals because the fan is definitely not at 100% even though the GPU is clearly overheating and throttling.
Everything seems to be working fine except the high temps
When I do the same with the new firmware:
GPU temps start reaching 93-94 C
GUI crash happens around this point, usually difficult to recover and I have to restart
There is no mention of thermals in the logs, only ring gfx timeout
The frequency seems to stay at 2700 before that, but I'm not really sure because everything crashes at this point
So basically instead of thermal throttling and slightly reducing GPU frequency like the old firmware does, the new one triggers this ring gfx timeout
Lots of people are reporting similar problems with dGPUs, however you can control the fan curve in that case, but it's impossible for iGPUs. Also, most overclocking (or in this case underclocking) options are not available for iGPUs.
I'm using Unigine Valley or random WebGL stress tests from a Google search for stressing the GPU. The WebGL seems to generate more heat and it crashes faster. For example - https://mprep.info/gpu/
Additionally - At one point something happened with the fan management (that was a few months ago, not related to the testing above) and it got stuck at 100% all the time. Everything was working properly without having to resort to workarounds. But sadly I had to restart at some point and it was also quite annoying. So I think it's a thermal throttling issue, exacerbated by Lenovo's thermal management.
I tested kernel 6.11.0-0.rc6.20240904gt88fac175.350.vanilla.fc40, which I believe should include this patch. However, it hard-crashed right after login, similar to version 6.10.6. Using version 6.9.12 helps, as the crashes are at least recoverable.
Edit: Is there any kernel version deemed safe to rule out a hardware defect?
@agd5f which firmware version should that patch (kernel 6.11.0-0.rc6) be used with?
Edit: Fedora updated linux-firmware to linux-firmware-20240909-1.fc40. In combination with kernel 6.11.0-0.rc6.20240904gt88fac175.350.vanilla.fc40 it seems promising to be stable!
Actually, it seems that the latest linux-firmware version from git works without patches. I was experimenting with clock speeds and only tried the commit from the issue, not the latest
I'm on kernel 6.9.10+bpo-amd64 (Debian 12) and it crashes with the default firmware, but seems stable with the latest
I'm attaching /sys/kernel/debug/dri/0/amdgpu_firmware_info for both
For dGPU like the ones mentioned in #3131 - they should never thermal throttle, so even if they crash instead of throttling, it's more important to fix the frequency and avoid reaching that point. However, for this laptop it seems to me that throttling is expected (or Lenovo didn't properly tune the fan curve) and the problem is that it crashes instead of slowing a bit down like it does now. In Windows it seems to be working at 2700MHz properly for a half a minute before it reaches high temperature and throttles, but never crashes.
It reaches 102 C, I'm not sure how healthy this is, but seems to behave the same way in Windows
I can consistently produce this crash on my system running either the RX6800 or the RX 580 by running Java Minecraft with a 512x texture pack. Crashes less than 2 minutes into the game. This firmware did not fix my issue but manually setting the GPU power profile to the highest does. I can consistently play the game and it no longer crashes on stable diffusion either, which also was a cause of frequentbut irregular crashes. The log messages are the same about the timeout