Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Upgrading to kernel 6.4.7 causes drops and fluctuation in GPU coreclock.
Hardware
RX6800 (OC: 950~960mV, 2490 core, 2250 memory)
MB: GB DS3H B550M (Above 4G - ON, Re-BAR - ON, latest bios)
Software
OS: Arch Linux
Kernel's: Linux-Zen, Linux-TKG, both at 6.4.7 (and 6.5-RC-3)
DE: KDE Wayland
Corectrl for GPU monitoring
Mesa-git (compiled against llvm-minimal-git in a clean chroot as per AUR instructions) with reverted 9b008673 as per mesa/mesa#9443 (closed)
Card is set on the Compute GPU Power Profile. With both said kernels at 6.4.7 (and 6.5-RC-3) coreclock fluctuates between 2300MHz and ~2430MHz (not even hitting 2490MHz). Before being told it's an OC issue, I'd like to point out two things:
Same behavior with lowered coreclock, e.g. fluctuations in coreclock
Downgrading both kernels to 6.4.6 eliminates the problem and the card is rock-solid staying at a constant 2490MHz on the core. Previous kernels also didn't face the issue.
As of 08.20.23:
There was a commit since 6.4.7 that causes this behavior (as stated below in the comments). It can be reverted. From my testing whether using vanilla kernel (that reports fluctuating coreclock) or using a custom compiled kernel reverting the commit in both instances performances seems the same.
I'd like to stress out that the card is overclocked to achieve a constant 2490MHz, not fluctuate between 2300~2430MHz (not even reaching 2490) on kernels since 6.4.7.
Like stated above, going for custom lower clocks to test if coreclock can be kept constant in kernels since 6.4.7 fails, coreclock fluctuates even when set to 2200MHz on the GPU Compute preset.
So while performance seems the same, proper coreclock monitoring is very important for overlocking or thermal managing in general. As stated below in my experience AMD cards start to fluctuate coreclock when the VRMs heat up, so monitoring correctly reported coreclock is vital for the undervolting of these cards.
Going by that the most important question is:
In which of the two instances does the driver report the coreclock properly?
Edited
Designs
An error occurred while loading designs. Please try again.
Child items
0
Show closed items
GraphQL error: The resource that you are attempting to access does not exist or you don't have permission to perform this action
No child items are currently open.
Linked items
0
Link issues together to show that they're related.
Learn more.
Activity
Sort or filter
Newest first
Oldest first
Show all activity
Show comments only
Show history only
d3vilguardchanged title from Droping GPU coreclock since kernel 6.4.7 to Droping GPU coreclock since kernel 6.4.7 and 6.5-RC-3
changed title from Droping GPU coreclock since kernel 6.4.7 to Droping GPU coreclock since kernel 6.4.7 and 6.5-RC-3
It's probably caused by: a4eb11824170 ("drm/amdgpu/pm: make gfxclock consistent for sienna cichlid") which backported to stable as d28f75c986de ("drm/amdgpu/pm: make gfxclock consistent for sienna cichlid").
Figured that. I can't seem to understand how git revert works and use it with the pkgbuild of kernel-tkg (easier for me to compile). I really know that this isn't the place to ask such questions. I need help with the revert as I can't seem to do it myself. After that I will be able to compile 6.4.7 and 6.5rc and report if the issue persists. Thanks!
Compiling 647. Will report back tomorrow if the coreclock fluctuates. Will also try to do benchmarks with 647 and my OC settings to see if there is difference in score, will also report back.
-> Applying your own linux-6.4 patch /home/georgi/linux-tkg/linux64-tkg-userpatches/patch.mypatch -> -> ######################################################patching file drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.cHunk #1 FAILED at 1927.1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c.rej==> ERROR: A failure occurred in prepare(). Aborting...
sienna_cichlid_ppt.c.rej
--- drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c+++ drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c@@ -1927,12 +1927,16 @@ static int sienna_cichlid_read_sensor(struct smu_context *smu, *size = 4; break; case AMDGPU_PP_SENSOR_GFX_MCLK:- ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_UCLK, (uint32_t *)data);+ ret = sienna_cichlid_get_smu_metrics_data(smu,+ METRICS_CURR_UCLK,+ (uint32_t *)data); *(uint32_t *)data *= 100; *size = 4; break; case AMDGPU_PP_SENSOR_GFX_SCLK:- ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_GFXCLK, (uint32_t *)data);+ ret = sienna_cichlid_get_smu_metrics_data(smu,+ METRICS_AVERAGE_GFXCLK,+ (uint32_t *)data); *(uint32_t *)data *= 100; *size = 4; break;
-> Applying your own linux-6.4 patch /home/georgi/linux-tkg/linux64-tkg-userpatches/patch.mypatch -> -> ######################################################patching file drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.cHunk #1 FAILED at 1927.1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c.rej==> ERROR: A failure occurred in prepare().
sienna_cichlid_ppt.c.rej
--- drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c+++ drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c@@ -1927,16 +1927,12 @@ static int sienna_cichlid_read_sensor(struct smu_context *smu, *size = 4; break; case AMDGPU_PP_SENSOR_GFX_MCLK:- ret = sienna_cichlid_get_smu_metrics_data(smu,- METRICS_CURR_UCLK,- (uint32_t *)data);+ ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_UCLK, (uint32_t *)data); *(uint32_t *)data *= 100; *size = 4; break; case AMDGPU_PP_SENSOR_GFX_SCLK:- ret = sienna_cichlid_get_smu_metrics_data(smu,- METRICS_AVERAGE_GFXCLK,- (uint32_t *)data);+ ret = sienna_cichlid_get_current_clk_freq_by_table(smu, SMU_GFXCLK, (uint32_t *)data); *(uint32_t *)data *= 100; *size = 4; break;
Are you sure you're applying this on a 6.4.7 that doesn't have any other amd drivers patches applied? FTR, the respective chunk in a vanilla 6.4.7 should look like this:
case AMDGPU_PP_SENSOR_GFX_MCLK: ret = sienna_cichlid_get_smu_metrics_data(smu, METRICS_CURR_UCLK, (uint32_t *)data); *(uint32_t *)data *= 100; *size = 4; break; case AMDGPU_PP_SENSOR_GFX_SCLK: ret = sienna_cichlid_get_smu_metrics_data(smu, METRICS_AVERAGE_GFXCLK, (uint32_t *)data); *(uint32_t *)data *= 100; *size = 4; break;
Ran the whole sienna_cichlid_ppt from the github link you sent me against the one I have locally in a code comparison checker. Both are fully identical
Could be that I'm compiling a custom kernel, although it doesn't have GPU patches. Doesn't matter, like I've said I've manually edited sienna_cichlid_ppt and compiled.
Here comes the interesting bit. Tested 6.4.7-vanilla and 6.4.7-your-patch. Results:
Vanilla:
Unigine Superposition DXVK - 9150 points
Forza Horizon 5 Benchmark - 101 FPS
Patched:
Unigine Superposition DXVK - 9150 points
Forza Horizon 5 Benchmark - 101 FPS
While testing with vanilla 6.4.7 the coreclock fluctuates (~1700 to 2430).
The patched kernel is stable and constant at 2490MHz (my OC).
Here comes my question, since both are getting the same result I recon that the same GPU work is being done. Which of the two reports the clock of the card correctly? Like I've said in my OP, card is set to the Compute Power Profile so I expect under load for it to stay at max coreclock.
EDIT 1: I see lower VRAM usage, also judging by temperature I'd way that the coreclock is indeed lower. Will try to compare frametimes. In DOOM Eternal I'm getting a bit less FPS than what is expected from 6.4.6 and previous kernels. Investigating further.
EDIT 2: Just compiled 6.4.8 with the patch and tested against un-patched 6.4.8. Way more stutters with the un-patched kernel compared to patched in the Forza Horizon 5 Benchmark.
I'm currently on 6.4.11. Tested both patched and unpatched with Unigine Superposition over DXVK. They seem to score the same. I'd like to ask my question again:
In which of the two instances does the driver report the coreclock properly?
I've had a few AMD cards om Linux (RX580, 6600XT, 6800) and while overclocking them under Linux I've come to notice that the power delivery on the cards gets hot (and as the VRMs get hot, coreclock starts fluctuating and dropping). So achieving a constant clock was a mix of monitoring the coreclock and undervolting. Afterwards coreclock monitoring was important under stress testing in order to see if it drops and further undervolt or a dialing the overclock back a bit was needed. With the commit since 6.4.7 that just isn't possible.
With the patch that reverts the commit in kernel 6.4.11 with my overclock I am getting a constant 2490MHz on the core. With vanilla 6.4.11 coreclock is between 2300~2430 (doesn't even hit 2490). Although benchmarks seem to be the same (tho I'd say there might be a few more microstutters) it's very important for us to know exactly which configuration of the driver provides the correct monitoring values!
So it seems you guys don't really want to bother with resolving this. No worries, I will ask again. In which state does the driver report correct coreclock? Pre kernel 6.4.7 or after the commit in 6.4.7?
Thanks, guess we can close the issue then. cat /sys/class/drm/card0/device/pp_dpm_sclk seem to report current clock values. Indeed with the commit AMDGPU_PP_SENSOR_GFX_SCLKdoes provide average clock data. Seems that correct monitoring is more of a corectrl issue and I will open a bug there. Thanks for all the help!