Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
The full dmesg showing a suspend cycle is attached.dmesg.txt
I have tried a quick hack (hack.patch ) and with it the power consumption during suspend (measured with a revolt Energy Cost Meter) goes from 9.2W to 7.7W. With the hack the other errors are gone and I can start sway after the resumed. The full dmesg with the hack is: dmesg-hack.txt.
One issue that is left, not sure if related, is that before suspend the system was using 11.1W and after returning from suspend it is using 12.2W
Attached files:
Designs
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
This looks specifically like a BIOS problem to me. It's not advertising in the FADT low power idle support, nor is it exporting an ACPI device for amd_pmc to use.
@espindola you should use deep sleep (S3) mode instead.
I can try, but given that S0 does work with the hack I included, and given that:
According to the ACPI specification [1], the ACPI_FADT_LOW_POWER_S0flag merely means that it is better to use low-power S0 idle on thegiven platform than S3 (provided that the latter is supported) and itdoesn't preclude using either of them (which of them will be useddepends on the choices made by user space).
It seems that the BIOS implementation is valid and it should be possible to use S0.
The thing is the driver makes other assumptions about what is going to happen to the system as a result of whether or not it's being suspended using low power idle or not. We skip suspending certain IP blocks because the platform is going to be expected to do actions with them.
To make the change you did in your hack I think this would need to be tested on a few other systems specifically with and without amd_pmc in place so that the platform actions run or don't run.
If I understand it correctly, by the spec s2idle should always work, but on some bios using it with amdgpu can cause a crash. Using ACPI_FADT_LOW_POWER_S0 is a proxy for finding if a bios doesn't have the above bug.
Not sure if there is a more proper solution than explicitly listing known buggy systems :-(
As for testing, all I can think of is adding a boot option as asking for testers.
And note that it is not just about power, with the current setup the gpu doesn't correctly come back from suspend on my system.
The driver needs to behave differently depending on what the behavior of the SBIOS is. ACPI_FADT_LOW_POWER_S0 and the pmc ACPI node are indicators that the SBIOS will behave a certain way and the driver should target that behavior. Failure to align with the expectations of the SBIOS will lead to problems.
If the SBIOS is broken, I'm wondering if we're stuck in a weird place though. amdgpu_acpi_should_gpu_reset returns that the GPU shouldn't reset, but amdgpu_acpi_is_s0ix_active returns that S3 is active.
Maybe we should align amdgpu_acpi_should_gpu_reset along the results of amdgpu_acpi_is_s0ix_active.
I'm not sure if there is much value there. The only reason we would need a GPU reset is because the platform never powers down the GPU. In that case, suspend is largely useless because it uses as much or more power than if the platform were not suspended. For both S3 and S0i3, we expect the platform to ultimately power down the GPU. If that never happens, we should really have a third path in the driver where more or less do nothing as that will likely use less power than the preparations for power down that we currently do.
If that never happens, we should really have a third path in the driver where more or less do nothing as that will likely use less power than the preparations for power down that we currently do
That's what amdgpu_pmops_prepare was supposed to be doing. But I think it only works for the dGPU case w/ BOCO because we only set DPM_FLAG_SMART_PREPARE only in that case. Perhaps something like this is what we need so that code works:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.cindex cd4caaa295282..31a666e1aa5a9 100644--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c@@ -2170,15 +2170,13 @@ static int amdgpu_pci_probe(struct pci_dev *pdev, if (ret) DRM_ERROR("Creating debugfs files failed (%d).\n", ret);+ dev_pm_set_driver_flags(ddev->dev, DPM_FLAG_SMART_PREPARE |+ DPM_FLAG_SMART_SUSPEND |+ DPM_FLAG_MAY_SKIP_RESUME); if (adev->pm.rpm_mode != AMDGPU_RUNPM_NONE) { /* only need to skip on ATPX */ if (amdgpu_device_supports_px(ddev)) dev_pm_set_driver_flags(ddev->dev, DPM_FLAG_NO_DIRECT_COMPLETE);- /* we want direct complete for BOCO */- if (amdgpu_device_supports_boco(ddev))- dev_pm_set_driver_flags(ddev->dev, DPM_FLAG_SMART_PREPARE |- DPM_FLAG_SMART_SUSPEND |- DPM_FLAG_MAY_SKIP_RESUME); pm_runtime_use_autosuspend(ddev->dev); pm_runtime_set_autosuspend_delay(ddev->dev, 5000);
That's because your system doesn't appear to support either S3 or S0i3 so it's not clear in the driver should do for your platform. S0i3 requires a complex set of handshaking with the platform to powerly enter low power states. For S3 the platform powers down all of the device power rails once the drivers has saved their state and the OS tells the platform to enter S3. In your case the platform doesn't appear to do anything so it's not clear would the driver should do.
With this patch I am able to start sway and I don't see any errors in dmesg (dmesg-new-patch.txt, but the power consumption during suspend is now 10.5W, it is 9.2W with vanilla kernel and 7.7W with my patch.
@agd5f what's your thought? I guess with some of the recent commits to help s0ix cases (that are in 6.1.7 now) running s2idle without the platform support instead of skipping entirely does decrease power consumption. Maybe the hack patch isn't as bad as I suspected anymore.
I'm not worried about state. It's not clear to me why doing nothing uses more power than the s0i3 path, especially for other IPs in the system (MMHUB, VCN, DCN, etc.). At runtime, we have all sorts of power and clock gating features enabled that get disabled for s3/s0i3 because we tear down those other IPs in anticipation of power rails being turned off so we can re-initialize them on resume. But I guess if it works, it's fine.