Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Linux 5.10 regression - manual fan control broken, causing critical temperatures on MSI RX 5700 XT Gaming
suecho 1 > /sys/class/drm/card0/device/hwmon/hwmon3/pwm1_enable # set fan control to manualecho 70 > /sys/class/drm/card0/device/hwmon/hwmon3/pwm1 # set fan speed to 70 PWM cycles
Results:
Linux 5.9.14: Fans correctly spin up
Linux 5.10: Fans don't start spinning. Neither is the pwm1 value applied, nor does the card's default fan curve get applied. -> GPU gets very hot under load, probably can overheat. There is no error verbosity when applying the values via terminal to sysfs.
Although this should be fixed, you could try UPP as a workaround (set FanAcousticLimitRPM to normal target RPM and FanThrottlingRPM to RPM expected under heavy load). It requires a suspend to ram before the settings are applied (on my system at least) but it should be more reliable otherwise and you have more settings available than is exposed by the AMD drivers.
Thanks, I once used UPP/SPPT on Polaris. It didn't work without side effects there, hopefully it's better with Navi. Though I'd like to stick with the regular manual fan control, as it does exactly what I want it to do do, unlike driver controlled fan adjustments (with it you can't get entirely rid of rpm fluctuations since Vega, though it's acceptable with my undervolting and thermal paste replacement).
It might take some time until I'll get to bisecting. If someone suggested some specific commits to revert, I could to that sooner.
If you manually set a low fan speed, the chip will get hot under load unless you switch back to automatic or manually change the fan speed. Does the fan not spin at all when you manually set it?
Yes, that's the issue: It doesn't start spinning at all with manual control. With 5.10 every pwm1 value doesn't let the fans spin up at all, they are always off. With Linux 5.9, a value of 70 e.g. lets them spin at something like 900rpm, whereas with 5.10 they aren't spinning at all.
I also checked whether auto control is unaffected by reverting this commit, which is the case. So, would it be possible to revert it upstream, or might there be implications which I'm not aware of? :)
I'm not able to achieve anything useful with the fan1* entries in my sysfs (with the aforementioned commit not reverted for testing purposes). Though I haven't found docs how to use them. Anything specific I should try?
I'd like to keep the functionality to adjust fans via PWM. This script relies on it and has worked flawlessly for me and others so far: https://github.com/grmat/amdgpu-fancontrol
See:
https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface
I'm just wondering if maybe your board doesn't support the rpm interface for some reason. If some of the rpm parameters are not populated in the vbios, using the rpm interface won't work properly. We should probably fallback to some default values in that case. That may be the root cause of the problems you are seeing.
Thanks for the link. I can set fan1_enable to 1 and fan1_target to change rpm (it annoyingly changes speed up and down all the time though). It returns "no permission" error when trying to set fan1_max/min/input. I don't have any comparison how it would work with other cards. But with the MSI Gaming, it seems to be just terrible vs. simple PWM changes or tweaked fan values via SPPT.
Running kernel 5.10.3 with a Radeon Pro W5500. I experience the same issue and can confirm reverting commit 8d6e65adc25e23fabbc5293b6cd320195c708dca fixes it.
just confirming 5.10.4 and 5.11rc1 both resulted in loss of pwm control, it will start working after I echo different pwm settings for a while and then stop working ...
It felt like I get some control, if the fans kick on "auto" ... then if I switch to manual while they are running then I'll have control for some time ... and then it will stop working again ...
After reverting 8d6e65adc25e23fabbc5293b6cd320195c708dca 5.11rc1 works exactly as 5.9 did before ... system is Ryzen 7 3800XT with 5700xt (still no fan rpm readout as that works once a week for few minutes ...sometimes)
Anecdotal addition here, but it's interesting. I experience the exact same issues but I have noticed that if I let the temp get over 60c, through radeon-profile's interface, I can often times regain control of the fan. However, upon resume or reboot, that control is lost and I can reach critical temps again until the card reaches 60+ and I repeat the process.
When reaching 60+ and setting the fan to auto, it will also somehow enable fan RPM reporting. It's the only time I've seen the fan's actual speed since installing the card.
I have experienced similar issues with 5.10.4 regarding fan control with my Sapphire 5700XT as well.
Fan control was not completely broken but whenever I enabled manual fan control and wrote a value to pwm1, pwm1 would oscillate around the value I wrote. If I wrote 150, pwm1 would oscillate between 114 and 220 causing the fans to slow down to around 1110 RPM and rise up t 2000 RPM constantly.
In addition, using the tool PowerUPP(frontend for https://github.com/sibradzic/upp) to reduce the max Gfx clock breaks auto fan control as well. Reducing the max Gfx clock causes the fans to get stuck at ~1500RPM constantly(they keep spinning even though the GPU temp is around 48C). If I revert the max Gfx clock change fans spin down normally and auto fan control starts working again.
Reverting 8d6e65adc25e23fabbc5293b6cd320195c708dca fixes both issues but RPM readout doesn't work properly. Somehow fan RPM readout works fine if you switch to manual control once the fans start spinning in auto control.
I don't think this is a regression in your case per se. We just changed how we report the manual pwm setting. We used to report the value specified by the user for pwm, whereas now we query the hardware for the current fan value. That way the value is consistent regardless of whether you are changing/querying the value via the fan interface or the pwm interface.
The idea does make sense but there has to be something else about 8d6e65adc25e23fabbc5293b6cd320195c708dca that changes how fan speed is set. I just updated from 5.9 to 5.10, noticed that fan control wasn't working properly anymore and had to built a custom kernel with that specific commit reverted to get it to work, that loss of functionality is a regression to be fair.
I'm not that knowledgeable about the amdgpu driver but 8d6e65adc25e23fabbc5293b6cd320195c708dca seems to completely remove a function called smu_v11_0_set_fan_speed_percent from drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c . Could that somehow break fan control via pwm1?
Just tested fan control using fan1_target on 5.10.10 with your patches applied. The behavior is the same as with pwm1. Fan RPM is always at zero and fan speed oscillated audibly.
There are two ways to program the fan controller: one using a percentage and one using rpms. We originally exposed both interfaces (one as pwm1 and the other as fan1). This lead to problems with consistency because updating one interface would not be reflected in the other. So what the patch set that 8d6e65adc25e23fabbc5293b6cd320195c708dca was part of did was use the rpm interface for both pwm1 and fan1. That way no matter which hwmon interface you use, you have a consistent view of the actual fan speed. We just convert between rpms and percentages when returning the values to hwmon. The SMU uses the rpm interface internally when automatic fan control is enabled so it should work fine.
Interesting, I just tried to set the fan speed using fan1_target (on 5.10.4 with 8d6e65adc25e23fabbc5293b6cd320195c708dca reverted) and it doesn't work. Fans start to turn at ~122RPM regardless of what value I write to fan1_target but fan control via pwm1 works normally.
Could the cause of this issue be a problem with the RPM interface that was exposed by the removal of the pwm interface? Did you experience anything similar while testing the patchset?
It's odd, the RPM interface doesn't seem to work with your patches on 5.10.10. Fan RPM is always at 0 in manual mode and although I can set the speed using fan1_target, it oscillates(fans spin at maximum speed briefly when I first write to fan1_target). I'm not a kernel programmer but the RPM interface doesn't seem to work. I can try this using an unpatched, stock kernel too if it would help.
Fan speed readout works correctly with my MSI 5700 XT Gaming X (before and after 8d6e65adc25e23fabbc5293b6cd320195c708dca) as long as fans are in auto mode. It does not work when setting PWM control to manual, it then always reads out 0rpm. However, before the aforementioned commit (or with it reverted), the fans' actual behavior in manual mode seems to be 100% reliable. So I don't mind the broken speed report much.
Of course it would be nice if we had correctly working fan speed readout also with manual PWM control. I just think to get manual PWM control working again at all without patched kernel is more important for the time being.
I had that behavior in 5.9 too, fan readout worked in auto mode and I had to turn on manual control after the fans start spinning in auto mode to keep fan readout. I guess 8d6e65adc25e23fabbc5293b6cd320195c708dca fixes the RPM readout issue while breaking manual control.
This issue is present on 5.10.10 too with the exact same behavior. Fans still oscillate around the given speed but this time the value of pwm1 was stuck at zero despite the fans turning. After trying to manually adjust the speed, I set the control mode to auto again and the fans never seem to stop once they exit zero RPM mode and start turning when the GPU reaches around 50C. The interesting thing is FanStopTemp is set to 50 according to UPP and the fans were turning at an audibly loud speed even after the GPU temp dropped below 50C(38C to be exact), not what normally happens in auto mode.
Reverting 8d6e65adc25e23fabbc5293b6cd320195c708dca makes control via pwm and rpm work again. I really don't want to sound entitled or demanding but the problematic commit is known and this regression affects multiple people with different cards, why doesn't anyone do anything about it? The amdgpu driver is really good but a crucial feature being broken for multiple people for a rather long time and nobody doing anything about it doesn't exactly seem right, at least from the perspective of a user.
My 2 cents are that this commit should just be reverted in mainline, as different users have tested different cards from Vega till Navi and no bad behavior has occurred with it reverted.
Does "echo 0 > pwm1_enable" work properly on your boards? That should force the fan to the max speed. Please also make sure your driver has this commit:
drm/amdgpu/pm/smu11: Fix fan set speed bugFix fan set speed calculation.Suggested-by: Kenneth Feng <kenneth.feng@amd.com>Signed-off-by: Arunpravin <Arunpravin.PaneerSelvam@amd.com>Acked-by: Alex Deucher <alexander.deucher@amd.com>Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>Cc: stable@vger.kernel.org
Can you also attach your dmesg output with the attached patch applied?amd1408.diff
Can you also attach the output of /sys/kernel/debug/dri/X/amdgpu_firmware_info (replace X with the appropriate number for your card if you have multiple GPUs).