Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Equinix is shutting down its operations with us on April 30, 2025. They have graciously supported us for almost 5 years, but all good things come to an end. We are expecting to transition to new infrastructure between late March and mid-April. We do not yet have a firm timeline for this, but it will involve (probably multiple) periods of downtime as we move our services whilst also changing them to be faster and more responsive. Any updates will be posted in freedesktop/freedesktop#2011 as it becomes clear, and any downtime will be announced with further broadcast messages.
[ 0.000000] DMI: Dell Inc. Precision 3660/, BIOS 2.1.0 01/19/2023
In order to get this further along for debugging, do you think you could put the card into an Intel Alder Lake CRB and reproduce it there as well? Perhaps then you can enable a serial kernel log with initcall_debug and pm_debug_messages on the kernel command line. We can see if that gets any more information about what happens with the failure mode.
Do you mean disable L0s via PCIe? quirk_disable_aspm_l0s() alone doesn't work, ASPM control isn't granted by BIOS [0], so that was commented out as a quick hack to test L0s.
After some trials and errors I found that changing data |= 0x9 << PCIE_LC_CNTL__LC_L1_INACTIVITY__SHIFT; [0] to data |= 0xe << PCIE_LC_CNTL__LC_L1_INACTIVITY__SHIFT; can workaround the issue.
The finding is quite baffling because according to the code, seems like L0s is not enabled at AMD ASIC, but somehow disabling L0s on PCIe or increasing L1 latency on AMD ASIC can workaround the issue?
Is "PCIe Link width switching" "Link bandwidth changing"? Bandwidth notification interrupt is not enabled by Linux kernel.
And I think I found a real fix. Since the root port of the GFX's PCIe switch has _PR3, the card was put to D3cold, hence a GPU reset is required. Will send out a patch soon.
PMFW already carries a workaround to disable link dpm in ASPM L1 enabled cases. Regardless, to confirm if speed change is causing the issue.
Load driver with sudo modprobe amdgpu ppfeaturemask=0xffffb
OR
Try with the attached patch fixed_speed.patch
The effect of reset is to get rid of the PMFW which is actually supposed to take meaningful action on D3hot or ASPM entry. This can be a workaround, but I think the proper solution is still missed and most likely it has nothing to do with PR3 support in the path.
Can you please confirm your ASDN hash? As you don't have anything on your kernel command line to override ASPM I would expect this is a different regression than your original bug. Your original bug has the W/A that you made: nouveau@2b072442 so ASPM shouldn't be used. Without ASPM the problematic things such as dynamic lane width switching and dynamic speed change that I mentioned AMD uses but it appears that Intel doesn't support will not be run.
This other separate regression: does it happen on 6.3.0 too?
@superm1@lijo I think I found the reason why the issue happens. Since the system is a Desktop so the GFX has external power connected, the GFX remains powered when power resources are turned off.
For example, the config space is still accessible after the GFX is in D3cold state.
So is there anyway to know that the card has external power connected? So we can use that information in amdgpu_acpi_should_gpu_reset() helper?
It looks like d3cold only works on systems with the ATPX ACPI method. If that is the case, we should use BACO rather than BOCO for runtime pm. Something like the attached patch.
Additionally, something like the attached two patches are needed to use BACO for s2idle if the platform doesn't actually power down the devices during s0ix.
It looks like d3cold only works on systems with the ATPX ACPI method. If that is the case, we should use BACO rather than BOCO for runtime pm. Something like the attached patch.
In principle D3cold should work for _PR3 case. The issue is a special case because of the following combination:
It's a desktop that utilizes s2idle
Since it's desktop, the GFX card has external power from power supply directly.
The root port has _PR3
So the assumption that turning off power resources listed by _PR3 doesn't work for this case.
I am unable to locate any AMD GFX card that doesn't require external power, but I would assume _PR3 D3cold work on those cards.
Can you please test those 3 patches together? They're intended to help another similar issue and it would be good to know how this issue reacts to it (even if not solved).
Attached is a modified version of 0002-drm-amdgpu-enable-BACO-BOCO-PX-for-s2idle.patch. Please apply the first two from Alex and the attached one. A dev_info message is added to know what sort of notification is given to PMFW.bxcx_notify_on_suspend.patch
Thanks, confirms that both methods of BACO entry notification to PMFW don't work. There could be different impact of D3cold ACPI calls as it generates PME.
Could you send a similar log without ASPM also? Will check if there is a difference in transition flow without ASPM. I guess the patches shouldn't make any difference without ASPM.
[Edit] It looks like non-D0 entry triggers L1 for the link on Linux and that is causing the trouble. In Windows, since US/DS are at D0 most likely L1 is not happening for the real link. Regardless, a dmesg log will help.
I was wondering if maybe the Intel hosts are setting the HASD bit in the link control/status register. If they are, we may be able to key off of that to make some changes in amdgpu based on this info. Can you check?