kernel 5.16.10 brings back issue-1709: mclk at max

probably due to 6e7545ddb13416fd200e0b91c0acfd0404e2e27b (which in essence reverts the workaround from #1709 (closed) - by re-enabling MPC_SPLIT_AVOID_MULT_DISP)

In addition to the memory clock maxing out, I have noticed 5.16.10 also brings back the bug that prevents loading a graphical environment, which seems to stem from #1709 (closed). 5.16.9 works fine.

GPU: Radeon RX 5500 XT

Error logs (same as from #1709 (closed)):

amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110).
[drm:process_one_work] *ERROR* ib ring test failed (-110).
amdgpu 0000:03:00.0: amdgpu: Failed to power gate JPEG!
[drm:jpeg_v2_0_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000002E SMN_C2PMSG_82:0x00000000
amdgpu 0000:03:00.0: amdgpu: Failed to power gate VCN!
[drm:amdgpu_dpm_enable_uvd [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000002E SMN_C2PMSG_82:0x00000000
amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000002E SMN_C2PMSG_82:0x00000000
amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000002E SMN_C2PMSG_82:0x00000000
amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000002E SMN_C2PMSG_82:0x00000000

info: recent firmware does not help: linux-firmware-c53073d4e1485ac9f7cb065db466793c495aead7

Reverting 6e7545ddb13416fd200e0b91c0acfd0404e2e27b fixes the stuck memory clock for me (once again) on 5.15.29 (and newer) with Navi 10 (dcn20). I don't have any flickering or other issues driving three 60 Hz DP MST monitors with MPC_SPLIT_DYNAMIC, so unconditionally fixing the clocks to max (causing the GPU to draw 30 watts and constantly spin up its fans) with MPC_SPLIT_AVOID_MULT_DISP does not seem like a proper solution to me.

Unfortunately, with that patch reverted, a number of people have reported hangs on driver load with multiple displays attached. Unfortunately, we have not been able to repro the issue internally so it's been slow to debug.

Hi @agd5f, has there been any progress on debugging this issue? The issue still persists and hangs my system on driver load after testing kernel 5.18.1.

Hi, is there a workaround for this or the only solution now is to build a custom kernel ?

well, I haven't tried reverting the commit lately - but I startet to poweroff my 2nd monitor (no led active) and when I need it, to enable it. The crux is, it has to be in powered-off state when booting.

With this scheme I get ~ 15W idle w. 1 monitor vs. 36W idle w. 2 monitors. I have to manually (as in physically) hit the power button on my 2nd monitor.

I did the same yesterday, mostly to test which monitor causes the issue. I realized that my high frequency monitor can work up to 144hz at ~9W idle. If I put it over 144hz it starts using ~30W. Both monitors at 60hz will use 30W+. So it's definitely caused by two monitors working at the same time. I don't have to turn the monitor off from the physical button, I just turn it off from the Gnome settings. Reverting the commit didn't fix it, but I learned a lot about building my custom kernel, so I count that as positive

p.s I tested the issue both on Linux and Windows, and it's causing exactly the same power usage and high mclk.

Hmm, I actually just patched my Linux kernel (5.17.5) and it didn't fix it for me, so it must be something else for my GPU / Monitors (AMD 5700XT and two monitors running at 60hz).

GFX Clocks and Power:
        875 MHz (MCLK)
        6 MHz (SCLK)
        1300 MHz (PSTATE_SCLK)
        625 MHz (PSTATE_MCLK)
        725 mV (VDDGFX)
        30.0 W (average GPU)

GPU Temperature: 56 C
GPU Load: 0 %
MEM Load: 0 %

I bet it has something to do with vertical blanking interval but I don't think there is a way to change it on Gnome Wayland... oh well.. I'm kinda tired of my GPU using so much power on idle, but I don't see any easy solution and this issue persist for at least 1 year now.

I'm successfully using the following patch on 5.15.35 (will try on a newer version later) with a 5700 XT and three 60 Hz monitors (DisplayPort MST) on KDE Plasma:

--- b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
+++ a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
@@ -1069,7 +1069,7 @@
 		.timing_trace = false,
 		.clock_trace = true,
 		.disable_pplib_clock_request = true,
+		.pipe_split_policy = MPC_SPLIT_DYNAMIC,
-		.pipe_split_policy = MPC_SPLIT_AVOID_MULT_DISP,
 		.force_single_disp_pipe_split = false,
 		.disable_dcc = DCC_ENABLE,
 		.vsr_support = true,

This drops the memory clock (and power consumption) when idle:

$ cat /sys/class/drm/card0/device/pp_dpm_mclk 
0: 100Mhz *
1: 500Mhz 
2: 625Mhz 
3: 875Mhz

amdgpu-pci-2f00
Adapter: PCI adapter
vddgfx:      750.00 mV 
fan1:           0 RPM  (min =    0 RPM, max = 3500 RPM)
edge:         +45.0°C  (crit = +100.0°C, hyst = -273.1°C)
                       (emerg = +105.0°C)
junction:     +45.0°C  (crit = +110.0°C, hyst = -273.1°C)
                       (emerg = +115.0°C)
mem:          +46.0°C  (crit = +105.0°C, hyst = -273.1°C)
                       (emerg = +110.0°C)
slowPPT:      11.00 W  (cap = 220.00 W)

But based on your results it might be a different issue, as without this patch I've only ever seen the memory clock at max (875 MHz), while you do get values lower than that (but still higher than they should be). My three monitors are identical with presumably identical timings, so I guess that's why this fix works in my case.

Just as a follow-up question, do you still see a difference with/without this patch, i.e. without it always max MCLK, but with it sometimes lower than max MCLK? If the patch works you should still see lower MCLK states and power consumption over time even if the monitor timings wouldn't allow dropping MCLK all the way to state 0.

Hi Dennis, no the patch is not working for me at all, it's always stuck at level 3 with or without the patch.

edit: see the comment below, it actually works if I put both monitors at 1080p resolution.

I am also running kernel 5.17.5 successfully with that patch on my RX 5500 XT with dual displays at 1920x1080 60hz and 1680x1050 60hz.

The patch fixes the hanging on driver load and allows starting an Xorg graphical session. Memory clock goes down at idle too.

# head -n 11 /sys/kernel/debug/dri/0/amdgpu_pm_info

GFX Clocks and Power:
        100 MHz (MCLK)
        0 MHz (SCLK)
        1300 MHz (PSTATE_SCLK)
        625 MHz (PSTATE_MCLK)
        6 mV (VDDGFX)
        4.0 W (average GPU)

GPU Temperature: 34 C
GPU Load: 0 %
MEM Load: 3 %

If it helps, here's the mode/timing information for my monitors (which are 16:10) from xrandr --verbose:

  1920x1200 (0x6b7) 154.000MHz +HSync -VSync *current +preferred
        h: width  1920 start 1968 end 2000 total 2080 skew    0 clock  74.04KHz
        v: height 1200 start 1203 end 1209 total 1235           clock  59.95Hz

I'm seeing success with MPC_SPLIT_DYNAMIC with three of these monitors over DisplayPort MST.

Mine is:

  2560x1440 (0x3f5) 311.750MHz -HSync +VSync *current +preferred
        h: width  2560 start 2752 end 3024 total 3488 skew    0 clock  89.38KHz
        v: height 1440 start 1443 end 1448 total 1493           clock  59.86Hz

  1920x1080 (0x41) 173.000MHz -HSync +VSync *current +preferred
        h: width  1920 start 2048 end 2248 total 2576 skew    0 clock  67.16KHz
        v: height 1080 start 1083 end 1088 total 1120           clock  59.96Hz

Maybe it has to do with the different resolutions or timings but it's a bit strange that it can drive one monitor 2560x1440@144hz with 8W but needs 30W for two 60hz monitors.

I tried to change both resolutions to 1920x1080@60hz(patched) and it works, I'll reboot to see if there is any difference without the patch.

GFX Clocks and Power:
	100 MHz (MCLK)
	6 MHz (SCLK)
	1300 MHz (PSTATE_SCLK)
	625 MHz (PSTATE_MCLK)
	725 mV (VDDGFX)
	8.0 W (average GPU)

edit: without the patch my clocks are still stuck, so the patch works, but only for 1080p@60hz. I can't use my monitor at that resolution so I guess I will have to find another solution or keep using my GPU at level 3.

Issue still persists on kernel 5.19.1 and hangs system when starting display server.

mentioned in issue #1403

mentioned in issue #1969

mentioned in issue #3363

kernel 5.16.10 brings back issue-1709: mclk at max

Designs

Child items ...

Activity

Admin message

Admin message

kernel 5.16.10 brings back issue-1709: mclk at max

Activity