Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
I used to be able to set power limit via apps like corectrl to 115W and with that I was achieving huge efficiency(perf. vs power drawn was less than 10% vs -90W). Now the new minimum is 190W. This is not the app issue. I read on the net that it may be due to kernel v6.7+, so I tried with v6.6.10 and the cap_min range was again lower and I was able to get my 115W. I reported to regressions@lists.linux.dev already but I also wanted to report it here as it may have to do with amd's driver module.
Here is an example of my cap set to 115W:
https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
commit 1958946858a62b6b5392ed075aa219d199bcae39Author: Ma Jun <Jun.Ma2@amd.com>Date: Thu Oct 12 09:33:45 2023 +0800 drm/amd/pm: Support for getting power1_cap_min value Support for getting power1_cap_min value on smu13 and smu11. For other Asics, we still use 0 as the default value. Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The reason you're getting a min cap this high is (probably) because your GPU AIB/vendor programmed it this way. While my 6600 XT reports 122W as the minimum, other 6600 XTs (and certainly other cards in general) might report different values.
You can try reverting that commit and compiling a custom kernel.
This is still a bad idea IMO and I would like this to be reversed. I was getting only about 7% less performance at 115W, which is ca. 90W less than default! If I could at least edit min_cap option via /sys but its readonly.
Hard locking to vendor "idea of min cap" without possibility to overwrite is IMO wrong, we don't talk about default or reference values and also not about upper(max_cap) values, where I could understand desire to protect HW. I bet Windows would allow me to under-power via some Afterburner app, yet Linux is locking this range.
Devs, please revert back or give us option to overwrite min_cap and consider not closing this issue before that. Thank you.
Many vendors simply do know not what they're doing. Please make this right, the ability to limit the power to what user wants should be a must in 2024.
I have an ITX RX 6600XT, which I bought primarily due its small size and an ability to limit the power below 100W. I am running it completely fanless, and it used to work great in Linux. Not anymore.
If vendors do not care about the world, some of us still do, and many of us run Linux because we do. Please let us do the right thing.
I bet Windows would allow me to under-power via some Afterburner app
I don't believe that's accurate. Windows (in both Adrenalin and Afterburner) allows me a power limit adjustment between -6% and +20%. This matches the new Linux 6.7 behavior on my 6600XT:
power1_cap_default = 145W
power1_cap_min = 136W (around 6% less than power1_cap_default)
power1_cap_max = 174W (exactly 20% more than power1_cap_default)
To be clear, I would love an option to go lower and have used power1_cap of 75W for a long time before 6.7.
But if we're comparing to Windows, we are apparently on par with the Windows behavior, not worse than Windows.
Regressions being introduced as features are never fun. This "feature" renders the power cap basically useless. On my system (RX6650 XT) I now can't set the power cap any lower than 154 W (default is 164 W), while before I could effectively cap it all the way down to 40 W under load in some games. I used to have it set at 120 W to keep my videocard cool, prolong its life, and reduce unnecessary power draw. The difference between 164 W and 120 W was zero to a few percent FPS, but a substantial power saving.
This "support" for power1_cap_min is anti-user, anti-powersaving, and anti-longevity if there's no way to override it. It's fine to have a power1_cap_min value, but writing a lower value to power1_cap should still be possible.
I assume we all use Arch here, if that's the case, there is a way to compile the kernel very easily with patches as described here.
I can try to create some kind of patch that overrides the limit or ignores it but I have never written a single line of kernel code before, so this might take some time.
Thank you fililip, I think it would be better to simply revert the commit you highlighted above(btw where did you find it?), since from the description the only thing it does is setting power1_cap_min value, precisely THE variable we want it not to.
Well, perhaps you're right, I'll probably start there. (What I wanted to do was to add some kind of parameter, like amdgpu.ignore_min_pcap that you could optionally set to 1 to preserve the newly added functionality.)
If the setting can be controlled, such as with param like amdgpu.ignore_min_pcap you mentioned, that is also ok, as long as it can be set early enough(before corectrl autostart from desktop for example). In fact, looking at link you just posted(where their issue was "setting a power limit that's too low for the current power state will actually disable it"), it may even be a better solution.
So any from the above mentioned solutions that will work is fine, I trust in your judgement and doing things right. And of course, thank you.
After you compile & install the patched kernel, you need to set the amdgpu.ignore_min_pcap=1 kernel boot parameter in GRUB/rEFInd/whichever boot manager you use.
fililip, you are The Dude, thank you so much!
Now, what does it take for this patch to become accepted kernel/amdgpu merge? So that kernel recompilation won't be necessary? Will this be a part of next kernel/amdgpu version, or do they still need to know about this patch and accept it?
I think I'd have to submit it to the linux kernel mailing list, which I am kinda scared of .
It could be better to submit that patch to Arch Linux maintainers; they could include it in their kernel builds.
Thank you so much @fililip, this issue has been a dangerous problem for me for several months. The regression caused several overheating freezes, green screens, and shutdowns. When I built the system last fall, I spent many hours testing it. I found an optimal power limit setting that was surprisingly far below the default. This resulted in a system that was able to control heat in an semi-enclosed cabinet and provide satisfactory performance.
I have finished testing the patch on an RX 7800XT running Arch Linux CachyOS RC 6.8.0-rc4-1. After building the kernel, setting the kernel parameter, and rebooting I was able to set the power limit to 160 watts. The GPU junction temps have gone from 90C down to 70C under load. I did not get the watt meter out to verify but cabinet temp and fan speed has noticeably improved.
The reason I tested 6.8.0 instead of 6.7.5 is I am unable to reboot with kernels < 6.8.0 because of another amd issue.
I can test linux-mainline, Xanmod, LQX, and Arch default if anyone is interested.
Thanks for testing! Patched linux-zen also works great for me (I sometimes use Waydroid).
I emailed Jan Alexander Steffens from Arch's maintainer team about the patch but still got no response. Maybe it's not a bad idea to just send it to amd-gfx, though I am worried it might not get accepted (since it's more a hack than an actual kernel patch). We'd have to ask the driver maintainers about it.
It honestly feels weird reading this discussion when the arguments against re-enabling doesn't apply to the default state when overdrive is disabled.
At the same time, I'm inclined to not bring it as an argument because it could result in power1_cap_min being enforced at vendor level in this state as well.
I just don't get why it's a bad idea to allow setting a too low minimum power cap, that can damage the hardware according to discussion, when the overdrive is disabled. The user might feel any setting is safe when od is disabled.
And then, while having the overdrive enabled and accepting risks of hardware damage through voltage increase, the minimum power cap is now disabled even though we're already in a risky environment.
In my opinion we should simply have multiple levels of control, such as no control (default), only undervolt and clock control and power limits within the bounding box, and finally, almost complete control.
This way we'd allow undervolting and safer settings without the risk of typos or forgetting the minus sign and the likes, and also allow the possibility of full control without requiring a custom kernel.
As it is right now, the available options aren't intuitive and they're also risky in either modes, while also not giving enough freedom to the user to enter "do as i say" mode.
Thank you for the patch @fililip. I've been using it for the past 2 weeks and happily setting power1_cap of 75W on 6600XT (together with a -70mV undervolt).
Also, without seeing the code itself, only the diff, I thing we can also do:
- if ((limit > smu->max_power_limit) || (limit < smu->min_power_limit)) {+ if (amdgpu_ignore_min_pcap) {+ if ((limit > smu->max_power_limit)) {+ dev_err(smu->adev->dev,+ "New power limit (%d) is over the max allowed %d\n",+ limit, smu->max_power_limit);+ return -EINVAL;+ }+ } else if ((limit > smu->max_power_limit) || (limit < smu->min_power_limit)) { dev_err(smu->adev->dev, "New power limit (%d) is out of range [%d,%d]\n", limit, smu->min_power_limit, smu->max_power_limit);
with:
if ((limit > smu->max_power_limit) || (!amdgpu_ignore_min_pcap && (limit < smu->min_power_limit))) { dev_err(smu->adev->dev, "New power limit (%d) is out of range [%d,%d]\n", limit, smu->min_power_limit, smu->max_power_limit);
In this case user know he explicitly set the flag at boot so seeing min_cap at 0 in the error log should be clear to him that this is not an actual out of range problem, only max.
You'd have to change smu->min_power_limit to 0 in that case, I think, but the reason I did it this way is to keep the old message before that commit and also the new one.
Unless you mean it as a kernel warning, then instead of dev_err you could use dev_warn (only with the module parameter set to 1), as recommended by amdgpu code:
/* * DO NOT use these for err/warn/info/debug messages. * Use dev_err, dev_warn, dev_info and dev_dbg instead. * They are more MGPU friendly. */
You are right I forgot about smu->min_power_limit, there is also this code:
if (amdgpu_ignore_min_pcap)+ *limit = 0;+ else+ *limit = smu->min_power_limit;
, it may be better to set smu->min_power_limit = 0 - if only for safety. It shouldn't remain uninitialized in case it is referenced somewhere. Then any code like above switch can be reduced to simple *limit = smu->min_power_limit, also no need for amdgpu_ignore_min_pcap anymore after assignment at somewhere in beginning with:
smu->min_power_limit = amdgpu_ignore_min_pcap ? 0 : whatever_default_smuxx;
and keep the original error logs unchanged as well.
@majun258, do you mean setting the mask in conjunction with the patch fililip posted, or does it work without it?
Also, I thought setting bits to 1 actually enable features, as per wiki?:
https://wiki.archlinux.org/title/AMDGPU <= see boot parameters
,whereas your mask set it to 0?
BTW thanks for very important info.
But disabling od on bit14 means you cannot edit pp_od_clk_voltage correct? This would disallow undervolting in exchange for setting a lower power limit, where we could actually do both before the changes.
What @majun258 means is that with 14th bit of ppfeaturemask at 0, the power1_cap_min is also at 0! This is super counterintuitive, but I just had it confirmed.
default amdgpu.ppfeaturemask, 14th bit @ 0, aka AMD OverDrive disabled:
cat /sys/module/amdgpu/parameters/ppfeaturemask0xfff7bfffcat /sys/class/drm/card1/device/pp_od_clk_voltagecat: /sys/class/drm/card1/device/pp_od_clk_voltage: No such file or directorycat /sys/class/drm/card1/device/hwmon/hwmon3/power1_cap130000000cat /sys/class/drm/card1/device/hwmon/hwmon3/power1_cap_max 130000000cat /sys/class/drm/card1/device/hwmon/hwmon3/power1_cap_min0echo 95000000 > /sys/class/drm/card1/device/hwmon/hwmon3/power1_capecho $?0
amdgpu.ppfeaturemask=0xfff7ffff, 14th bit @ 1, aka AMD Overdrive enabled, pp_od_clk_voltage API available:
Now, this does not make any sense. If one disables "OverDrive", arbitrary minimal power limit can be set. If one enables "OverDrive", minimal power limit is set at some vendor set value. So, goodbye any reasonable undevolt + powerlimit scenario. This is just wrong.
No we need both, just like before.
I also have an under-volt offset at -0.1V. That alone help boost performance a lot for the loses caused by power cap. Without under-volt, if I still remember I was getting about 15-20% performance less with only power limit alone, when set to 115W.
With both, I narrowed it to only ~7.3% less, perfectly stable.
This is my simplified version of the patch based on fililip's one. It apply min_cap=0 straight away, without need for any boot option to be set. Tested on v6.7.9 kernel.
amdgpu-power.patch