Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • A amd
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 1,472
    • Issues 1,472
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • drm
  • amd
  • Issues
  • #2321
Closed
Open
Issue created Dec 28, 2022 by struq@struq

Renoir/Cezanne GPU power reporting issue

Brief summary of the problem:

The GPU power consumption data reported in /sys/kernel/debug/dri/0/amdgpu_pm_info and lm_sensors (/sys/class/drm/card0/device/hwmon/hwmon6/power1_average) are not consistent and neither seems to be correct. For example:

$ sensors
amdgpu-pci-0600
Adapter: PCI adapter
vddgfx:      943.00 mV 
vddnb:       962.00 mV 
edge:         +60.0°C  
PPT:           2.00 mW 
...
$ cat /sys/kernel/debug/dri/0/amdgpu_pm_info
GFX Clocks and Power:
	1600 MHz (MCLK)
	400 MHz (SCLK)
	0 MHz (PSTATE_SCLK)
	0 MHz (PSTATE_MCLK)
	737 mV (VDDGFX)
	962 mV (VDDNB)
	0.2 W (average GPU)
...

The GPU power seems to be way too low. Even under heavy load, the power shown here won't exceed 2 mW or 200 mW.

Looking at the driver code at https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L2799-L2811 and https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L3469-L3471:

	r = amdgpu_dpm_read_sensor(adev, AMDGPU_PP_SENSOR_GPU_POWER,
				   (void *)&query, &size);

	pm_runtime_mark_last_busy(adev_to_drm(adev)->dev);
	pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);

	if (r)
		return r;

	/* convert to microwatts */
	uw = (query >> 8) * 1000000 + (query & 0xff) * 1000;

	return sysfs_emit(buf, "%u\n", uw);
	if (!amdgpu_dpm_read_sensor(adev, AMDGPU_PP_SENSOR_GPU_POWER, (void *)&query, &size))
		seq_printf(m, "\t%u.%u W (average GPU)\n", query >> 8, query & 0xff);
	size = sizeof(value);

They should come from the same query value obtained from amdgpu_dpm_read_sensor(). It looks like the lowest 8 bits (query & 0xff) would represent the fractional part in units of milliwatt, and the upper bits (query >> 8) would separately represent an integer watt. (Though, what if the fractional part is smaller than 0.1 W or larger than 0.256 W?)

Anyway, this comes from https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L2799-L2811 https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c#L1198-L1200:

	case AMDGPU_PP_SENSOR_GPU_POWER:
		ret = renoir_get_smu_metrics_data(smu,
						  METRICS_AVERAGE_SOCKETPOWER,
						  (uint32_t *)data);
		*size = 4;
		break;
	case METRICS_AVERAGE_SOCKETPOWER:
		*value = (metrics->CurrentSocketPower << 8) / 1000;
		break;

But CurrentSocketPower is already in units of watt, according to https://github.com/torvalds/linux/blob/1b929c02afd37871d5afb9d498426f83432e71c2/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu12_driver_if.h#L187:

  uint16_t CurrentSocketPower;          //[W]

So it should be good enough to just shift the number left by 8 bits leaving the lower bits at 0, and I can't really understand why the / 1000 is needed. Even if the data were in units of milliwatt, it still wouldn't be correct, but just the difference would be smaller.

After removing the / 1000 and rebuilding the driver (effectively reverting https://github.com/torvalds/linux/commit/137aac26a2ed6d8b43a83eb842c5091aeb203b73), the values shown in /sys/kernel/debug/dri/0/amdgpu_pm_info and /sys/class/drm/card0/device/hwmon/hwmon6/power1_average become the same, and much more reasonable, 5W or lower on idle, and up to tens of watts under load. Although I suspect this might have included power from the CPU part because it goes up even with CPU-only workloads.

Hardware description:

  • CPU: AMD Ryzen 7 5800H with Radeon Graphics
  • GPU: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c5)
  • System Memory: 16 GiB
  • Display(s): 2
  • Type of Display Connection: eDP/DP

System information:

  • Distro name and Version: Fedora 37
  • Kernel version: 6.0.15-300.fc37.x86_64
  • Custom kernel: N/A
  • AMD official driver version: N/A

How to reproduce the issue:

Run cat /sys/kernel/debug/dri/0/amdgpu_pm_info as root, or run sensors, and look at the reported GPU power.

Edited Dec 28, 2022 by struq
Assignee
Assign to
Time tracking