Renoir/Cezanne GPU power reporting issue
Brief summary of the problem:
The GPU power consumption data reported in /sys/kernel/debug/dri/0/amdgpu_pm_info
and lm_sensors
(/sys/class/drm/card0/device/hwmon/hwmon6/power1_average
) are not consistent and neither seems to be correct.
For example:
$ sensors
amdgpu-pci-0600
Adapter: PCI adapter
vddgfx: 943.00 mV
vddnb: 962.00 mV
edge: +60.0°C
PPT: 2.00 mW
...
$ cat /sys/kernel/debug/dri/0/amdgpu_pm_info
GFX Clocks and Power:
1600 MHz (MCLK)
400 MHz (SCLK)
0 MHz (PSTATE_SCLK)
0 MHz (PSTATE_MCLK)
737 mV (VDDGFX)
962 mV (VDDNB)
0.2 W (average GPU)
...
The GPU power seems to be way too low. Even under heavy load, the power shown here won't exceed 2 mW or 200 mW.
Looking at the driver code at https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L2799-L2811 and https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L3469-L3471:
r = amdgpu_dpm_read_sensor(adev, AMDGPU_PP_SENSOR_GPU_POWER,
(void *)&query, &size);
pm_runtime_mark_last_busy(adev_to_drm(adev)->dev);
pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
if (r)
return r;
/* convert to microwatts */
uw = (query >> 8) * 1000000 + (query & 0xff) * 1000;
return sysfs_emit(buf, "%u\n", uw);
if (!amdgpu_dpm_read_sensor(adev, AMDGPU_PP_SENSOR_GPU_POWER, (void *)&query, &size))
seq_printf(m, "\t%u.%u W (average GPU)\n", query >> 8, query & 0xff);
size = sizeof(value);
They should come from the same query
value obtained from amdgpu_dpm_read_sensor()
. It looks like the lowest 8 bits (query & 0xff
) would represent the fractional part in units of milliwatt, and the upper bits (query >> 8
) would separately represent an integer watt. (Though, what if the fractional part is smaller than 0.1 W or larger than 0.256 W?)
Anyway, this comes from https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L2799-L2811 https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c#L1198-L1200:
case AMDGPU_PP_SENSOR_GPU_POWER:
ret = renoir_get_smu_metrics_data(smu,
METRICS_AVERAGE_SOCKETPOWER,
(uint32_t *)data);
*size = 4;
break;
case METRICS_AVERAGE_SOCKETPOWER:
*value = (metrics->CurrentSocketPower << 8) / 1000;
break;
But CurrentSocketPower
is already in units of watt, according to https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu12_driver_if.h#L187:
uint16_t CurrentSocketPower; //[W]
So it should be good enough to just shift the number left by 8 bits leaving the lower bits at 0, and I can't really understand why the / 1000
is needed. Even if the data were in units of milliwatt, it still wouldn't be correct, but just the difference would be smaller.
After removing the / 1000
and rebuilding the driver (effectively reverting https://github.com/torvalds/linux/commit/137aac26a2ed6d8b43a83eb842c5091aeb203b73), the values shown in /sys/kernel/debug/dri/0/amdgpu_pm_info
and /sys/class/drm/card0/device/hwmon/hwmon6/power1_average
become the same, and much more reasonable, 5W or lower on idle, and up to tens of watts under load. Although I suspect this might have included power from the CPU part because it goes up even with CPU-only workloads.
Hardware description:
- CPU: AMD Ryzen 7 5800H with Radeon Graphics
- GPU: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
- System Memory: 16 GiB
- Display(s): 2
- Type of Display Connection: eDP/DP
System information:
- Distro name and Version: Fedora 37
- Kernel version: 6.0.15-300.fc37.x86_64
- Custom kernel: N/A
- AMD official driver version: N/A
How to reproduce the issue:
Run cat /sys/kernel/debug/dri/0/amdgpu_pm_info
as root, or run sensors
, and look at the reported GPU power.