Diagnosing issues with Radeon VII

L.S.S. said:

A little update:

Was able to trigger system lockup/hard reset even at low clock speed (at around 1000MHz) last night. It happened when I reapplied the clock settings after it entered safe clock again, most likely due to that I was also doing something else on the system (which caused the card to enter safe clock in the first place).

I still couldn't find a way to increase the log verbosity, but I've tried the following:

Disabling CUs to match that of a Vega 56 (amdgpu.disable_cu=0.0.14,0.0.15,1.0.14,1.0.15,2.0. 14,2.0.15,3.0.14,3.0.15, found in the Phoronix post[1]). This obviously had no effect as I'm yet to know which are the real ones I should disable, or if this problem can really be worked around or not this way.

2. Disabling DC (amdgpu.dc=0): The system won't boot at all.

3. Setting amdgpu.vm_update_mode=3: The system freezes a short while after startup, but it doesn't hard reset. I can just press the reset button and the card still works after reboot, without having to do a power cycle.

The system is currently at the latest stable kernel (5.3), but the problem had existed for quite a while (on all kernel versions, so it must be due to the hardware itself).

NOTE: On the other hand, the card seems to intermittently cause stutters to the system possibly with some related but corrected AERs showing in the system log, but the problem went away after I set the PCIe slot the card is on to GEN2 speed.

[1]. https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1049483-amd-devs-error-ring-gfx-timeout?p=1050232#post1050232

I'm the one who wrote this issue on Bugzilla and I'm considering closing it, as the issue is not valid anymore. The card is currently being used on a spare Windows-based system for mining and the card doesn't exhibit the behavior on Windows, even when mining.

Additionally, I could confirm the card would gradually underclock itself to 700/350 when it reaches critical temperature in overall (and would revert to the normal clock when the temperature goes down). This happened a few times on the Windows system before and it was due to the liquid cooler's reservoir was not full (so the air inside caused the water pump to suddenly stop working). Refilling the cooler fixed the issue.

This issue hasn't had any activity since 2020-10-06. The AMD driver stack changes rapidly and contains lots of shared code across products so it's possible that it has already been fixed. Please upgrade to a current stable kernel and userspace stack and try again. If you still experience this issue with the latest driver stack, please capture relevant logging and open a new issue referring back to this one.

closed

Diagnosing issues with Radeon VII

Submitted by L.S.S.

Description

Designs

Child items ...

Activity

Admin message

Admin message

Diagnosing issues with Radeon VII

Submitted by L.S.S.

Description

Activity