Increased-voltage OD on Navi10 causes low voltage on Linux but not Windows
This is gonna be a long writeup, as my investigation has been in the works for months now.
Tooling
I wrote a simple tool to manage my OD settings, amdgpu-smu-od. While not relevant for this discussion, it's yml format is how I'll communicate my overdrive settings for readability purposes.
I used the Superposition benchmark to generate load on both Windows and Linux.
Mesa: 19.1.3 -> 20.0.0 at various points in time, all final results posted here are from 20.0.0-rc3
. Mesa version didn't seem to effect behavior (which is to be expected.
Expected Behavior
Windows works as as expected.
When I set the following settings through the Radeon Wattman GUI
power_limit: 279000000 # On windows this is 55% power limit
sclk:
- frequency: 800
voltage: 750
- frequency: 1505
voltage: 950
- frequency: 2211
voltage: 1217
mclk: 905
I see the following results
State | vddgfx (mV) |
---|---|
Idle | 731 |
Load (Superposition) | 2118 |
Observed behavior (Linux)
Linux behavior seems to work differently. While the voltage jumps around much more on Linux in general, this happens to a much greater degree at higher voltage settings.
Case 1
With the following settings
card: /sys/class/drm/card0/device
pp_table: /home/mcoffin/Documents/upp/pp_table-morepower-modded.bin
power_limit: 280000000
sclk:
- frequency: 800
voltage: 750
- frequency: 1505
voltage: 950
- frequency: 2180
voltage: 1190
I see the following behavior
State | vddgfx (mV) |
---|---|
Idle | 731 (stable) |
Load (Superposition) | 1120 - 1160 (unstable) |
Case 2
With the following settings
card: /sys/class/drm/card0/device
pp_table: /home/mcoffin/Documents/upp/pp_table-morepower-modded.bin
power_limit: 280000000
sclk:
- frequency: 800
voltage: 750
- frequency: 1505
voltage: 950
- frequency: 2211
voltage: 1217
mclk: 905
I see the following behavior
State | vddgfx (mV) |
---|---|
Idle | 731 (stable) |
Load (Superposition) | 1130 - 1180 (unstable) |
Case 3
To see if it was just the higher clocks causing the issue, I tried the same test from Case 2 with the high frequency at 1190 (the same as Case 1), and got similar results to Case 2.
CSV Data Dumps
While running the linux tests, I dumped CSV data for time, sclk, mclk, power, Tjunction, and Tmem once per second (ish). Here are the dumps for Cases 1 and 2 above.
Logs
-
dmesg
logs from the 2211 @ 1217mV run - mcoffin-dmesg.log
PowerPlay Table used.
Here is the powerplay table that was used in all tests (Linux and Windows).
Conclusions
There are a few odd issues at play here.
- Despite the voltage for the low frequency being set as 750mV at 800MHz, both linux and windows actually set the voltage to 731mV.
- On linux, increased voltages result in increase vddgfx instability, and lower voltages.
While (1) might not be an issue since it happens on Windows as well, (2) is definitely a Linux issue, but since they both relate to voltage control, I've included both observations in this report.
Thanks in advance for the help, I've tried to pick through the code to find the issue, but it almost feels like I'm missing an SMU message type for overriding some voltage parameters or something like that, that Windows is using, but amdgpu
is not.
Let me know what I can do to assist in debugging and fixing!