Droping GPU coreclock since kernel 6.4.7 and 6.5-RC-3

changed title from Droping GPU coreclock since kernel 6.4.7 to Droping GPU coreclock since kernel 6.4.7 and 6.5-RC-3

changed the description

added 6000 dGPU Series label

It's probably caused by: a4eb11824170 ("drm/amdgpu/pm: make gfxclock consistent for sienna cichlid") which backported to stable as d28f75c986de ("drm/amdgpu/pm: make gfxclock consistent for sienna cichlid").

Can you try to revert that to confirm?

I'm new to compiling and reversing commits. Which of the two I need to reverse? Will give it a shot and report back.

If you're testing against 6.4.7, you'll need to revert d28f75c986de, if you're testing against 6.5, you'll need to revert a4eb11824170.

Figured that. I can't seem to understand how git revert works and use it with the pkgbuild of kernel-tkg (easier for me to compile). I really know that this isn't the place to ask such questions. I need help with the revert as I can't seem to do it myself. After that I will be able to compile 6.4.7 and 6.5rc and report if the issue persists. Thanks!

For 6.4.7, you can try this patch. If you put the respective file to the list of sources in the PKGBUILD, it should get applied automatically.

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index 8fe2e1716da4..e22fc563b462 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -1927,12 +1927,16 @@ static int sienna_cichlid_read_sensor(struct smu_context *smu,
                *size = 4;
                break;
        case AMDGPU_PP_SENSOR_GFX_MCLK:
-               ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_UCLK, (uint32_t *)data);
+               ret = sienna_cichlid_get_smu_metrics_data(smu,
+                                                         METRICS_CURR_UCLK,
+                                                         (uint32_t *)data);
                *(uint32_t *)data *= 100;
                *size = 4;
                break;
        case AMDGPU_PP_SENSOR_GFX_SCLK:
-               ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_GFXCLK, (uint32_t *)data);
+               ret = sienna_cichlid_get_smu_metrics_data(smu,
+                                                         METRICS_AVERAGE_GFXCLK,
+                                                         (uint32_t *)data);
                *(uint32_t *)data *= 100;
                *size = 4;
                break;

Compiling 647. Will report back tomorrow if the coreclock fluctuates. Will also try to do benchmarks with 647 and my OC settings to see if there is difference in score, will also report back.

The average and current clock values reported by SMU are different.

$ vblank_mode=0 glxgears

idle

(RX 6600, GFX clock: 500-2750MHz, Memory Clock: 96-875MHz)

 -> Applying your own linux-6.4 patch /home/georgi/linux-tkg/linux64-tkg-userpatches/patch.mypatch
  -> 
  -> ######################################################
patching file drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
Hunk #1 FAILED at 1927.
1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c.rej
==> ERROR: A failure occurred in prepare().
    Aborting...

sienna_cichlid_ppt.c.rej

--- drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -1927,12 +1927,16 @@ static int sienna_cichlid_read_sensor(struct smu_context *smu,
                *size = 4;
                break;
        case AMDGPU_PP_SENSOR_GFX_MCLK:
-               ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_UCLK, (uint32_t *)data);
+               ret = sienna_cichlid_get_smu_metrics_data(smu,
+                                                         METRICS_CURR_UCLK,
+                                                         (uint32_t *)data);
                *(uint32_t *)data *= 100;
                *size = 4;
                break;
        case AMDGPU_PP_SENSOR_GFX_SCLK:
-               ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_GFXCLK, (uint32_t *)data);
+               ret = sienna_cichlid_get_smu_metrics_data(smu,
+                                                         METRICS_AVERAGE_GFXCLK,
+                                                         (uint32_t *)data);
                *(uint32_t *)data *= 100;
                *size = 4;
                break;

LOL, the patch was somehow backwards, try this instead:

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index e22fc563b462..8fe2e1716da4 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -1927,16 +1927,12 @@ static int sienna_cichlid_read_sensor(struct smu_context *smu,
 		*size = 4;
 		break;
 	case AMDGPU_PP_SENSOR_GFX_MCLK:
-		ret = sienna_cichlid_get_smu_metrics_data(smu,
-							 METRICS_CURR_UCLK,
-							 (uint32_t *)data);
+		ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_UCLK, (uint32_t *)data);
 		*(uint32_t *)data *= 100;
 		*size = 4;
 		break;
 	case AMDGPU_PP_SENSOR_GFX_SCLK:
-		ret = sienna_cichlid_get_smu_metrics_data(smu,
-							 METRICS_AVERAGE_GFXCLK,
-							 (uint32_t *)data);
+		ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_GFXCLK, (uint32_t *)data);
 		*(uint32_t *)data *= 100;
 		*size = 4;
 		break;

  -> Applying your own linux-6.4 patch /home/georgi/linux-tkg/linux64-tkg-userpatches/patch.mypatch
  -> 
  -> ######################################################
patching file drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
Hunk #1 FAILED at 1927.
1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c.rej
==> ERROR: A failure occurred in prepare().

sienna_cichlid_ppt.c.rej

--- drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -1927,16 +1927,12 @@ static int sienna_cichlid_read_sensor(struct smu_context *smu,
 		*size = 4;
 		break;
 	case AMDGPU_PP_SENSOR_GFX_MCLK:
-		ret = sienna_cichlid_get_smu_metrics_data(smu,
-							 METRICS_CURR_UCLK,
-							 (uint32_t *)data);
+		ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_UCLK, (uint32_t *)data);
 		*(uint32_t *)data *= 100;
 		*size = 4;
 		break;
 	case AMDGPU_PP_SENSOR_GFX_SCLK:
-		ret = sienna_cichlid_get_smu_metrics_data(smu,
-							 METRICS_AVERAGE_GFXCLK,
-							 (uint32_t *)data);
+		ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_GFXCLK, (uint32_t *)data);
 		*(uint32_t *)data *= 100;
 		*size = 4;
 		break;

Are you sure you're applying this on a 6.4.7 that doesn't have any other amd drivers patches applied? FTR, the respective chunk in a vanilla 6.4.7 should look like this:

https://github.com/gregkh/linux/blob/4e382c2b468348d6208e5a18dbf1591a18170889/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c#L1929

EDIT: The same chunk from 6.4.6 for comparison https://github.com/gregkh/linux/blob/79562f63d621517795c125092620cf4e5778ad44/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c#L1929

6.4.7. Script pulls 6.4.7 from git.

Line 1929 to 1942:

	case AMDGPU_PP_SENSOR_GFX_MCLK:
		ret = sienna_cichlid_get_smu_metrics_data(smu,
							  METRICS_CURR_UCLK,
							  (uint32_t *)data);
		*(uint32_t *)data *= 100;
		*size = 4;
		break;
	case AMDGPU_PP_SENSOR_GFX_SCLK:
		ret = sienna_cichlid_get_smu_metrics_data(smu,
							  METRICS_AVERAGE_GFXCLK,
							  (uint32_t *)data);
		*(uint32_t *)data *= 100;
		*size = 4;
		break;

Ran the whole sienna_cichlid_ppt from the github link you sent me against the one I have locally in a code comparison checker. Both are fully identical

I'm going YOLO at this. Changing both ret manually in sienna_cichlid_ppt and compiling:

L1929 to L1938 becomes:

	case AMDGPU_PP_SENSOR_GFX_MCLK:
		ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_UCLK, (uint32_t *)data);
		*(uint32_t *)data *= 100;
		*size = 4;
		break;
	case AMDGPU_PP_SENSOR_GFX_SCLK:
		ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_GFXCLK, (uint32_t *)data);
		*(uint32_t *)data *= 100;
		*size = 4;
		break;

FWIW, this actually applies fine from makepkg in a standard 6.4.7

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index e22fc563b462..8fe2e1716da4 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -1927,16 +1927,12 @@ static int sienna_cichlid_read_sensor(struct smu_context *smu,
                *size = 4;
                break;
        case AMDGPU_PP_SENSOR_GFX_MCLK:
-               ret = sienna_cichlid_get_smu_metrics_data(smu,
-                                                         METRICS_CURR_UCLK,
-                                                         (uint32_t *)data);
+               ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_UCLK, (uint32_t *)data);
                *(uint32_t *)data *= 100;
                *size = 4;
                break;
        case AMDGPU_PP_SENSOR_GFX_SCLK:
-               ret = sienna_cichlid_get_smu_metrics_data(smu,
-                                                         METRICS_AVERAGE_GFXCLK,
-                                                         (uint32_t *)data);
+               ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_GFXCLK, (uint32_t *)data);
                *(uint32_t *)data *= 100;
                *size = 4;
                break;
-- 
2.41.0

Could be that I'm compiling a custom kernel, although it doesn't have GPU patches. Doesn't matter, like I've said I've manually edited sienna_cichlid_ppt and compiled.

Here comes the interesting bit. Tested 6.4.7-vanilla and 6.4.7-your-patch. Results:

Vanilla:

Unigine Superposition DXVK - 9150 points
Forza Horizon 5 Benchmark - 101 FPS

Patched:

Unigine Superposition DXVK - 9150 points
Forza Horizon 5 Benchmark - 101 FPS

While testing with vanilla 6.4.7 the coreclock fluctuates (~1700 to 2430).

The patched kernel is stable and constant at 2490MHz (my OC).

Here comes my question, since both are getting the same result I recon that the same GPU work is being done. Which of the two reports the clock of the card correctly? Like I've said in my OP, card is set to the Compute Power Profile so I expect under load for it to stay at max coreclock.

EDIT 1: I see lower VRAM usage, also judging by temperature I'd way that the coreclock is indeed lower. Will try to compare frametimes. In DOOM Eternal I'm getting a bit less FPS than what is expected from 6.4.6 and previous kernels. Investigating further.

EDIT 2: Just compiled 6.4.8 with the patch and tested against un-patched 6.4.8. Way more stutters with the un-patched kernel compared to patched in the Forza Horizon 5 Benchmark.

I'm currently on 6.4.11. Tested both patched and unpatched with Unigine Superposition over DXVK. They seem to score the same. I'd like to ask my question again:

In which of the two instances does the driver report the coreclock properly?

I've had a few AMD cards om Linux (RX580, 6600XT, 6800) and while overclocking them under Linux I've come to notice that the power delivery on the cards gets hot (and as the VRMs get hot, coreclock starts fluctuating and dropping). So achieving a constant clock was a mix of monitoring the coreclock and undervolting. Afterwards coreclock monitoring was important under stress testing in order to see if it drops and further undervolt or a dialing the overclock back a bit was needed. With the commit since 6.4.7 that just isn't possible.

With the patch that reverts the commit in kernel 6.4.11 with my overclock I am getting a constant 2490MHz on the core. With vanilla 6.4.11 coreclock is between 2300~2430 (doesn't even hit 2490). Although benchmarks seem to be the same (tho I'd say there might be a few more microstutters) it's very important for us to know exactly which configuration of the driver provides the correct monitoring values!

What data do you want? Current clock or average clock?

Current clock, like it was with kernel 6.4.6 and below.

changed the description

So it seems you guys don't really want to bother with resolving this. No worries, I will ask again. In which state does the driver report correct coreclock? Pre kernel 6.4.7 or after the commit in 6.4.7?

The driver has reported average clock for this interface on all other dGPUs. The commit in question made navi2x consistent with the other dGPUs.

Thanks, guess we can close the issue then. cat /sys/class/drm/card0/device/pp_dpm_sclk seem to report current clock values. Indeed with the commit AMDGPU_PP_SENSOR_GFX_SCLKdoes provide average clock data. Seems that correct monitoring is more of a corectrl issue and I will open a bug there. Thanks for all the help!

closed

mentioned in issue #3057

Droping GPU coreclock since kernel 6.4.7 and 6.5-RC-3

Designs

Child items 0

Activity

Admin message

Admin message

Droping GPU coreclock since kernel 6.4.7 and 6.5-RC-3

Activity