Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
admgpu aspm cause hang during suspend&resume with wx3200 on adl platform
@agd5f yes, the patch helps to resolve the suspend issue with wx3200.
but why is only wx3200 affected by aspm? Tried W6600 and didn't find the issue.
Are wx3200 and w6600 different series?
Thanks,
Both chips support ASPM, but they are different generations. I guess something in the hardware differences between those chips is not agreeable to your platform. Some platforms can have issues with ASPM as there is an interaction with the the PCI hardware on the platform.
@superm1 I checked it again,
With the same machine and wx3200, i still reproduced the issue with the patch.
In Ubuntu kernel, aspm is enabled so amdgpu_aspm would be as default(-1).
that's the same configuration before apply patch.
issue could still be triggered.
With the same kernel and force amdgpu.aspm=0, the issue would be gone.
Did you move the card between PCIe ports or change BIOS settings perhaps between tests? I think we need to understand why it worked before to ensure the right place in the kernel sets the right policy.
Also did you use the OEM tree to start when you tested the patch before? Or was it on top of a mainline kernel? If it's on the mainline kernel, maybe some other patches in pcie subsystem to indicate ASPM availability are needed to backport to OEM tree.
i didn't move cards between pcie ports and only disable the secure-boot(it's the same before restored).
i verified both on oem kernel and mainline kernel.
Let me check mainline kernel again.
Update: with the mainline kernel(20220303), still reproduce issue.
did you maybe have any debugging parameters on your kernel command line? can you reference your journal from the boots that it worked? Maybe it's a regression between the kernel you put the patch on top of and the latest mainline (along with something that backported to your OEM tree since then too)?
@superm1 currently the target is a clear env because i have reset to default.
There's no any additional options in the grub.
I reset the mainline branch with "-hard" and there's no any others modifications on it.
I just don't get the difference you're seeing. IOW is it possible something went upstream between that commit and now and came back to -stable (and your OEM tree) that caused it to not work?
What if you reset it to what commit you were at previously that you applied the patch and was working?
Can you check what commit hash you built that kernel on when you tested?
The reason I'm fixating on this is that ASPM decision should come from outside GPU driver. So this might be better to live in PCI quirks or in BIOS. But if you have proof that it worked without either of those at some time we should find out "why" that stopped working to revert that commit.
"acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration"
This means OS doesn't touch any ASPM functionality. Any change in BIOS configuration (reset to defaults, BIOS update etc.)?
/* * Disabling ASPM is intended to prevent the kernel from modifying * existing hardware state, not to clear existing state. To that end: * (a) set policy to POLICY_DEFAULT in order to avoid changing state * (b) prevent userspace from changing policy */
Does BIOS have an option to turn on/off ASPM? Maybe it's misconfigured in the first place. We could do stuff like intel_core_rkl_chk in the GPU driver, but what's the right criteria to match? Do all ALD systems fail? Or just this model from this manufacturer?
[ 0.000000] Linux version 5.15.0-051500-generic (kobako@barbatos) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #202110312130 SMP PREEMPT Fri Jan 28 13:03:24 CST 2022[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-051500-generic root=UUID=f33c77fa-74c6-49bf-807c-e6828e535d0f ro drm.debug=0x1ff log_buf_len=16m "dyndbg=file pci-driver.c +p;file pci.c +p"[ 0.000000] DMI: Dell Inc. Precision 3660/, BIOS 0.14.81 11/08/2021[…][ 2.228238] [drm:check_atom_bios [amdgpu]] ATOMBIOS detected[ 2.228338] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from VFCT[ 2.228339] amdgpu: ATOM BIOS: 113-D0155300-100[…]
Card:
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] (rev 10) (prog-if 00 [VGA controller]) Subsystem: Dell Lexa XT [Radeon PRO WX 3200]
@koba, where does it freeze exactly? How did you acquire the Linux logs? Where you able to log in over the network, on some other console like serial console?
Somehow the verbose log messages should give some hint, what is going wrong, shouldn’t they?