Linux 5.7.11 breaks the OC/UV capacities

I see the same here with a Polaris card (an RX480). Using kernel 5.7.11, I get write errors for the pp_od_clk_voltage file. There is no problem with kernel 5.7.10.

I'm actually working on bisecting this issue right now.

There's two problems.

At some point, something made pp_table writes fail every time. I filed a separate issue for this one as #1243 (closed)
At some point, a change to the sysfs handling code created the EINVAL errors for writing to most of the DPM sysfs interfaces.

I have both a POLARIS10 and NAVI10 card handy, so when I get a fix, I'll be testing with both.

EDIT[1]: I actually figured out that there are now two problems with the sysfs interface. One causing EINVAL, which got fixed as of the latest amd-staging-drm-next, but at that point there's a new problem with an infinite loop. Continuing to investigate and work on a fix.

@rropid @alosarjos - applying these two patches on top of the latest amd-staging-drm-next should resolve the issue for you. Though I haven't seen any issues with running with these reverted, there may be something that I just haven't run in to yet, but everything seems to be working for me with these two patches while we wait on debugging and a more robust fix that will do what these patches intended without the regression(s).

If you're not writing to pp_table, and only to pp_od_clk_voltage and friends, then you will only need the first patch. If you're writing to pp_table, then you'll need the second one as well.

Let me know if this works for you, as that'll confirm that it's not just me, and I'm looking in the right spot for the "real" fix.

Running the latest amd-staging-drm-next kernel with no patches, writes to pp_table on a Radeon VII work without issues. Writing to pp_od_clk_voltage, pegs one thread at 100%, and hangs indefinitely with no usable errors logged.

With the first of your posted patches applied to the amd-staging-drm-next kernel, writes to pp_od_clk_voltage work without errors. I didn't apply the second patch, since there were no issues writing to pp_table on a Radeon VII with the stock amd-staging-drm-next kernel.

Clock controls via writing to pp_od_clk_voltage are also broken with kernel 5.8-rc7. No issues with 5.8-rc6.

Writes to pp_table and hwmon(power_cap, fan1, etc) work without issue.

Reverting the patch to drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c fixed the errors writing to pp_od_clk_voltage, and worked without issues on my headless machines, ~~but it broke the display on my workstation.~~ Display issues were unrelated to the reverting the commit.

Applying only the amdgpu_set_pp_od_clk_voltage portion of the patch, fixed writing to pp_od_clk_voltage, and so far seems to not have any other adverse effects.

Hardware

CPU - AMD Threadripper 3960X
GPU - Radeon VII x4
Displays - 3
Display Connection - DP and HDMI

System

Distro: Arch Linux
Kernel: 5.8-rc7
Custom kernel: n/a

Interesting. You applied the patches that I posted above, yes?

I don't have multiple GPUs (or a Radeon VII to test with), so we're definitely going to need some logs to help you with that one.

I had already reverted the commit on 5.8-rc7 before I saw those patches. I'll spin up a fresh kernel with those patches and report back.

@tictoc @rropid @alosarjos

Evan got some work in on this, and I tested and I have a working system with only one of the above patches now. You will now need the following to "resolve" this issue.

0001-Revert-drm-amdgpu-fix-system-hang-issue-during-GPU-r.patch (see my comment above for the attached patch)
Evan's patch series to fix VNC/JPEG IP block states during DPM table uploads/resets

Update: While Evan's patches to fix the pp_table writes have landed upstream, part of the code from the other reverted commit has seeped elsewhere, and it is no longer easily revertable, so for now, we're stuck with this issue on mainline and amd-staging-drm-next.

To reiterate, the issue with writing to pp_od_clk_voltage causing an infinite loop (at least seemingly... since it pins one core of the CPU) was introduced in edad8312cbbf9a33c86873fc4093664f150dd5c1. Hopefully we can get some help from the AMD guys on this one, as this particular commit touches so many systems that (for now) it's out of my range of ability to understand fully without spending all day reading through every single system in there, which would likely take me a week.

Update 20200812

So, while this is still a kernel bug, I found an interesting tidbit, leading to a workaround.

My custom overclocking tool for smu_v11_0 cards (amdgpu-smu-od), does not encounter this issue, likely due to the fact that it writes each command, and it's newline, in one write call.

So, that tidbit could help us narrow down where the issue is.

Workaround

For now, if you want to use the pp_od_clk_voltage interface, I wrote a stupid-simple tool, gpu-apply, that doesn't experience this bug. If you don't have Rust, I can provide binaries, but will only do so upon request.

Example usage (run from /sys/class/drm/cardX/device):

gpu-apply 's 7 1450 1150' 'm 2 1900 950'

I have been running the mainline 5.8 kernel since it was released, and I haven't encountered the bug. I use a simple bash script to adjust power, clocks, and voltage.

The script doesn't have any logic, since each line is just an echo to the relevant sysfs file. I use upp to write a new pp_table prior to adjusting clocks and voltages.

max_radv0.sh

#!/bin/bash

echo "450000000" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/hwmon/hwmon*/power1_cap
echo "manual" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/power_dpm_force_performance_level
echo "vc 2 2125 1243" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage
echo "c" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage
echo "m 1 1150" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage
echo "c" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage
echo "s 1 2125" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage
echo "c" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage
echo "s 0 2125" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage
echo "c" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/pp_od_clk_voltage
echo "high" > /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/power_dpm_force_performance_level

Why are you setting power_dpm_force_performance_level twice?

Early on with amdgpu I encountered some bugs with forcing clocks to max values. The only way they would apply was to set the performance level to manual, adjust clocks, and then set the performance level to high. Without setting the performance level to manual and then high, memory clocks would stick at idle speeds with some compute loads.

I just checked out 5.8 from linus' tree, and it looks like edad8312cbbf9a33c86873fc4093664f150dd5c1 isn't in there, which is probably why this is working on that kernel for you.

Since it's still in amd-staging-drm-next, it could land in mainline elsewhere at any moment, so we still need to get this sorted out.

@agd5f I'm sure this is already on your radar, but just to clarify since the title might be misleading, this is still an issue on amd-staging-drm-next as of 20200813 (time of writing).

It's bisected, and was introduced in edad8312cbbf. I was running with that patch reverted for a while, but the semaphore pattern it uses has leaked in to patches that have come afterwards, so it's not a 100% trivial reversion anymore.

FWIW, it seems that the issue is (for some reason... still digging) only triggered when using tools like tee or the > operator in bash to write to the file. If you write to it directly, in one write call, (from like C or Rust), it seems to do ok.

Working examples would include:

Non-working examples would include:

echo 'm 1 903' > pp_od_clk_voltage
echo 'm 1 903' | sudo tee pp_od_clk_voltage
(I've also tried these from outside of that directory, using fully qualified paths, just to make sure there wasn't some weirdness with the shell making /sys/class/drm/cardX/device in use by using it as it's working directory)

That patch touches so many different systems, that it's really hard for me to pick apart to see where the issue would lie. Given the symptom of the process that's trying to write to the file pinning one CPU core at 100%, it looks like either...

There's an issue with the sysfs handling code that's causing it to infinite loop (I know was able to cause this once before months ago by using buffered writes with Rust resulting in multiple write calls, but that doesn't explain why tee/> stopped working in that commit).
(seemingly more likely given the patch contents), there's a deadlock somewhere with either the reset semaphore, or one of the atomic checks, resulting in an infinite spinlock.

If there's any info you could give that would help me in fixing this, I'd appreciate it. Navi OD has kinda become my baby in userspace . Also let me know if you think that it would be more prudent to just send this whole report and all information I've gathered to amd-gfx, and I can do that, or just link this gitlab issue along with a short summary.

Ok everyone, I've found the issue. I'm writing up the message for the patch, and I'll link it once submitted to amd-gfx.

Ok, this patch should fix it. @alosarjos @tictoc @rropid

I submitted it on the amd-gfx mailing list as well.

So, they decided to completely revert the commit that introduced the problem (for other reasons), so this patch is no longer needed as of 026acaeac2d205f22c0f682cc1c7b1a85b9ccd00 on amd-staging-drm-next. We can probably close this now

closed

Linux 5.7.11 breaks the OC/UV capacities

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Designs

Child items ...

Activity

Update 20200812

Workaround

Admin message

Admin message

Linux 5.7.11 breaks the OC/UV capacities

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Activity

Update 20200812

Workaround