Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Our infrastructure migration is complete. Please remember to update your SSH remote to point to ssh.gitlab.freedesktop.org; SSH to the old hostname will time out. You should not see any problems apart from that. Please let us know if you do have any other issues.
On Arch Linux, by using kernel 5.7.11, trying to Undervolt a 5700 gives back an error.
Hardware description:
CPU: AMD Ryzen 1700
GPU: AMD 5700XT - System Memory: 16GB
Display(s): 1
Type of Diplay Connection: DP
System information:
Distro name and Version: Arch Linux
Kernel version: 5.7.11
Custom kernel: Zen
AMD package version: No Package
How to reproduce the issue:
Update to kernel 5.7.11
Add boot parameters: amdgpu.ppfeaturemask=0xffffffff
Try to alter the corresponding file by echo "s 1 1900" > /sys/class/drm/card0/device/pp_od_clk_voltage
It will say: echo: write error: Invalid argument
Going back to 5.7.10, I can edit the file correctly.
I see the same here with a Polaris card (an RX480). Using kernel 5.7.11, I get write errors for the pp_od_clk_voltage file. There is no problem with kernel 5.7.10.
I'm actually working on bisecting this issue right now.
There's two problems.
At some point, something made pp_table writes fail every time. I filed a separate issue for this one as #1243 (closed)
At some point, a change to the sysfs handling code created the EINVAL errors for writing to most of the DPM sysfs interfaces.
I have both a POLARIS10 and NAVI10 card handy, so when I get a fix, I'll be testing with both.
EDIT[1]: I actually figured out that there are now two problems with the sysfs interface. One causing EINVAL, which got fixed as of the latest amd-staging-drm-next, but at that point there's a new problem with an infinite loop. Continuing to investigate and work on a fix.
@rropid@alosarjos - applying these two patches on top of the latest amd-staging-drm-nextshould resolve the issue for you. Though I haven't seen any issues with running with these reverted, there may be something that I just haven't run in to yet, but everything seems to be working for me with these two patches while we wait on debugging and a more robust fix that will do what these patches intended without the regression(s).
If you're not writing to pp_table, and only to pp_od_clk_voltage and friends, then you will only need the first patch. If you're writing to pp_table, then you'll need the second one as well.
Let me know if this works for you, as that'll confirm that it's not just me, and I'm looking in the right spot for the "real" fix.
Running the latest amd-staging-drm-next kernel with no patches, writes to pp_table on a Radeon VII work without issues. Writing to pp_od_clk_voltage, pegs one thread at 100%, and hangs indefinitely with no usable errors logged.
With the first of your posted patches applied to the amd-staging-drm-next kernel, writes to pp_od_clk_voltage work without errors. I didn't apply the second patch, since there were no issues writing to pp_table on a Radeon VII with the stock amd-staging-drm-next kernel.
Clock controls via writing to pp_od_clk_voltage are also broken with kernel 5.8-rc7. No issues with 5.8-rc6.
Writes to pp_table and hwmon(power_cap, fan1, etc) work without issue.
Reverting the patch to drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
fixed the errors writing to pp_od_clk_voltage, and worked without issues on my headless machines, but it broke the display on my workstation. Display issues were unrelated to the reverting the commit.
Applying only the amdgpu_set_pp_od_clk_voltage portion of the patch, fixed writing to pp_od_clk_voltage, and so far seems to not have any other adverse effects.
Evan got some work in on this, and I tested and I have a working system with only one of the above patches now. You will now need the following to "resolve" this issue.
0001-Revert-drm-amdgpu-fix-system-hang-issue-during-GPU-r.patch (see my comment above for the attached patch)
Update: While Evan's patches to fix the pp_table writes have landed upstream, part of the code from the other reverted commit has seeped elsewhere, and it is no longer easily revertable, so for now, we're stuck with this issue on mainline and amd-staging-drm-next.
To reiterate, the issue with writing to pp_od_clk_voltage causing an infinite loop (at least seemingly... since it pins one core of the CPU) was introduced in edad8312cbbf9a33c86873fc4093664f150dd5c1. Hopefully we can get some help from the AMD guys on this one, as this particular commit touches so many systems that (for now) it's out of my range of ability to understand fully without spending all day reading through every single system in there, which would likely take me a week.
So, while this is still a kernel bug, I found an interesting tidbit, leading to a workaround.
My custom overclocking tool for smu_v11_0 cards (amdgpu-smu-od), does not encounter this issue, likely due to the fact that it writes each command, and it's newline, in one write call.
So, that tidbit could help us narrow down where the issue is.
Workaround
For now, if you want to use the pp_od_clk_voltage interface, I wrote a stupid-simple tool, gpu-apply, that doesn't experience this bug. If you don't have Rust, I can provide binaries, but will only do so upon request.
Example usage (run from /sys/class/drm/cardX/device):
I have been running the mainline 5.8 kernel since it was released, and I haven't encountered the bug. I use a simple bash script to adjust power, clocks, and voltage.
The script doesn't have any logic, since each line is just an echo to the relevant sysfs file. I use upp to write a new pp_table prior to adjusting clocks and voltages.
Early on with amdgpu I encountered some bugs with forcing clocks to max values. The only way they would apply was to set the performance level to manual, adjust clocks, and then set the performance level to high. Without setting the performance level to manual and then high, memory clocks would stick at idle speeds with some compute loads.
I just checked out 5.8 from linus' tree, and it looks like edad8312cbbf9a33c86873fc4093664f150dd5c1 isn't in there, which is probably why this is working on that kernel for you.
Since it's still in amd-staging-drm-next, it could land in mainline elsewhere at any moment, so we still need to get this sorted out.
@agd5f I'm sure this is already on your radar, but just to clarify since the title might be misleading, this is still an issue on amd-staging-drm-next as of 20200813 (time of writing).
It's bisected, and was introduced in edad8312cbbf. I was running with that patch reverted for a while, but the semaphore pattern it uses has leaked in to patches that have come afterwards, so it's not a 100% trivial reversion anymore.
FWIW, it seems that the issue is (for some reason... still digging) only triggered when using tools like tee or the > operator in bash to write to the file. If you write to it directly, in onewrite call, (from like C or Rust), it seems to do ok.
(I've also tried these from outside of that directory, using fully qualified paths, just to make sure there wasn't some weirdness with the shell making /sys/class/drm/cardX/device in use by using it as it's working directory)
That patch touches so many different systems, that it's really hard for me to pick apart to see where the issue would lie. Given the symptom of the process that's trying to write to the file pinning one CPU core at 100%, it looks like either...
There's an issue with the sysfs handling code that's causing it to infinite loop (I know was able to cause this once before months ago by using buffered writes with Rust resulting in multiple write calls, but that doesn't explain why tee/> stopped working in that commit).
(seemingly more likely given the patch contents), there's a deadlock somewhere with either the reset semaphore, or one of the atomic checks, resulting in an infinite spinlock.
If there's any info you could give that would help me in fixing this, I'd appreciate it. Navi OD has kinda become my baby in userspace . Also let me know if you think that it would be more prudent to just send this whole report and all information I've gathered to amd-gfx, and I can do that, or just link this gitlab issue along with a short summary.
So, they decided to completely revert the commit that introduced the problem (for other reasons), so this patch is no longer needed as of 026acaeac2d205f22c0f682cc1c7b1a85b9ccd00 on amd-staging-drm-next. We can probably close this now