Renoir/Cezanne GPU power reporting issue
Brief summary of the problem:
The GPU power consumption data reported in
/sys/class/drm/card0/device/hwmon/hwmon6/power1_average) are not consistent and neither seems to be correct.
$ sensors amdgpu-pci-0600 Adapter: PCI adapter vddgfx: 943.00 mV vddnb: 962.00 mV edge: +60.0°C PPT: 2.00 mW ...
$ cat /sys/kernel/debug/dri/0/amdgpu_pm_info GFX Clocks and Power: 1600 MHz (MCLK) 400 MHz (SCLK) 0 MHz (PSTATE_SCLK) 0 MHz (PSTATE_MCLK) 737 mV (VDDGFX) 962 mV (VDDNB) 0.2 W (average GPU) ...
The GPU power seems to be way too low. Even under heavy load, the power shown here won't exceed 2 mW or 200 mW.
Looking at the driver code at https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L2799-L2811 and https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L3469-L3471:
r = amdgpu_dpm_read_sensor(adev, AMDGPU_PP_SENSOR_GPU_POWER, (void *)&query, &size); pm_runtime_mark_last_busy(adev_to_drm(adev)->dev); pm_runtime_put_autosuspend(adev_to_drm(adev)->dev); if (r) return r; /* convert to microwatts */ uw = (query >> 8) * 1000000 + (query & 0xff) * 1000; return sysfs_emit(buf, "%u\n", uw);
if (!amdgpu_dpm_read_sensor(adev, AMDGPU_PP_SENSOR_GPU_POWER, (void *)&query, &size)) seq_printf(m, "\t%u.%u W (average GPU)\n", query >> 8, query & 0xff); size = sizeof(value);
They should come from the same
query value obtained from
amdgpu_dpm_read_sensor(). It looks like the lowest 8 bits (
query & 0xff) would represent the fractional part in units of milliwatt, and the upper bits (
query >> 8) would separately represent an integer watt. (Though, what if the fractional part is smaller than 0.1 W or larger than 0.256 W?)
Anyway, this comes from https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L2799-L2811 https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c#L1198-L1200:
case AMDGPU_PP_SENSOR_GPU_POWER: ret = renoir_get_smu_metrics_data(smu, METRICS_AVERAGE_SOCKETPOWER, (uint32_t *)data); *size = 4; break;
case METRICS_AVERAGE_SOCKETPOWER: *value = (metrics->CurrentSocketPower << 8) / 1000; break;
CurrentSocketPower is already in units of watt, according to https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu12_driver_if.h#L187:
uint16_t CurrentSocketPower; //[W]
So it should be good enough to just shift the number left by 8 bits leaving the lower bits at 0, and I can't really understand why the
/ 1000 is needed. Even if the data were in units of milliwatt, it still wouldn't be correct, but just the difference would be smaller.
After removing the
/ 1000 and rebuilding the driver (effectively reverting https://github.com/torvalds/linux/commit/137aac26a2ed6d8b43a83eb842c5091aeb203b73), the values shown in
/sys/class/drm/card0/device/hwmon/hwmon6/power1_average become the same, and much more reasonable, 5W or lower on idle, and up to tens of watts under load. Although I suspect this might have included power from the CPU part because it goes up even with CPU-only workloads.
- CPU: AMD Ryzen 7 5800H with Radeon Graphics
- GPU: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
- System Memory: 16 GiB
- Display(s): 2
- Type of Display Connection: eDP/DP
- Distro name and Version: Fedora 37
- Kernel version: 6.0.15-300.fc37.x86_64
- Custom kernel: N/A
- AMD official driver version: N/A
How to reproduce the issue:
cat /sys/kernel/debug/dri/0/amdgpu_pm_info as root, or run
sensors, and look at the reported GPU power.