Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Lenovo L14/L15/P14s Gen2 fails to resume from s0ix
It enters suspend just fine (breathing effect on lid/power button led). Neither pressing a key nor the power button wakes the machine from suspend. Opening the lid while suspended does trigger a very short spinup of the fan, but fails to wake. A USB keyboard does wake it from suspend.
When rebooting after a S0ix suspend/resume cycle, the machine throws a beep error code (Fan Error) and shuts down if not bypassed (I haven't tried to). It boots fine afterwards.
Note:
It failed to enter suspend at all prior to v5.14.9 (SMU timeout; supposedly resolved by platform/x86: amd-pmc: Increase the response register timeout)
I've attached kernel logs of a suspend/resume cycle. Please let me know what else makes sense.
Thanks for the pointers. Setting iommu=pt (disabling iommu?) resolves the nvme page faults and makes it resume more swiftly, but none of the options enable the keyboard/powerbutton/lid to wake the machine.
Notes:
I've upgraded to v5.14.11 + GPIO patches.
Occasionally I get a IRQ related WARN in early boot (also present on v5.14.9) - mentioning because pinctrl_amd seems relevant for S0ix resume:
It's always IRQ9?
Please share output for /proc/interrupts and /sys/kernel/debug/gpio in addition to the request for acpidump so we can better understand what IRQ9 is supposed to be.
You may need patches in 5.15 to get it working. Would you mind trying out latest tip? Two rounds of driver updates to amdgpu is in there.
I successfully booted 5.15 on P14s Gen1 with latest BIOS (1.35) and suspend/resume s2idle, but not 5.14. There's 5.13/5.14 with special patchwork available.
It enters suspend just fine (breathing effect on lid/power button led). Neither pressing a key nor the power button wakes the machine from suspend. Opening the lid while suspended does trigger a very short spinup of the fan, but fails to wake. A USB keyboard does wake it from suspend.
Can you share your acpidump?
This sounds like EC caused APU to wake but EC's notification didn't make it into the OS. This probably needs Lenovo to dig into for their EC/BIOS bug and/or explain what they're expecting from Linux that is missing.
You may need patches in 5.15 to get it working. Would you mind trying out latest tip? Two rounds of driver updates to amdgpu is in there. I successfully booted 5.15 on P14s Gen1 with latest BIOS (1.35) and suspend/resume s2idle, but not 5.14. There's 5.13/5.14 with special patchwork available.
You guys have a Gen1 vs Gen2 here, and the EC may work differently between them. The EC is supposed to be setting that GPIO when you press power button or internal keyboard key or lid event. This probably needs Lenovo's firmware team to look into.
The only interrupt counts increasing during suspend tests are #9, #78 and #87.
ACPI GPIO #0 (Linux #78) seems to be related to various hardware switches - HWAK (= Hardware wake?) is an EC register queried for the interrupt cause: its contents seem to indicate LIDO/LIDC, ACIN/ACOU or PWRB.
Every time I suspend/resume using a USB keyboard and check /proc/interrupts pre/post the SCI (#9) count increases by 18
Somewhat expected, if I press any (non-USB keyboard) key, power button or open the lid during suspend the SCI count increases by one more (19 total), and ACPI GPIO #0 increases by one as well (regardless of the number of times pressed).
ACPI GPIO #7 increases every cycle - this seems tied to a device labeled RTL8 (Realtek?)
Based on the above, it seems an interrupt is definitely raised and reaches the OS eventually - though I wonder whether that happens during suspend, or only after being awaken by a different source.
Based on the above, it seems an interrupt is definitely raised and reaches the OS eventually - though I wonder whether that happens during suspend, or only after being awaken by a different source.
From what you've shared it seems in Lenovo's design that the EC acts like an arbitrator of sorts for when to assert the APU's GPIO00. All the events you mentioned are supposed to come back to GPIO00 for an interrupt.
This evidence also reaffirms why I think this is the EC's issue due to missing a notification or so.
This could be as simple as they expected some driver to call _STA or a _DSM on the way down since that's how it works on Windows ACPI but that doesn't happen with ACPICA in Linux.
it seems an interrupt is definitely raised and reaches the OS eventually
Which I guess makes sense? The EC queued up the event for the next wakeup and you use the USB keyboard to wakeup and then it goes through.
Which I guess makes sense? The EC queued up the event for the next wakeup and you use the USB keyboard to wakeup and then it goes through.
Isn't the EC supposed to actually wake the APU by asserting GPIO00? What if the EC asserts it but the APU (somehow) fails to notice until it's woken up via other means (for example, a USB keyboard / XHCI interrupt?)
Isn't the EC supposed to actually wake the APU by asserting GPIO00? What if the EC asserts it but the APU (somehow) fails to notice until it's woken up via other means (for example, a USB keyboard / XHCI interrupt?)
Yes, the keyboard wakeup works via the XHCI controller asserting a GPIO. Power button wakeup works via power button press caught by the EC (and debounced) and then EC asserts a GPIO. From the APU's perspective it just looks for that edge trigger from the EC while in s0i3. If the EC doesn't send it while down then we have this issue.
I have another (non-Lenovo) system that wakes from GPIO00 via a power button that is debounced by an EC. It start working with those GPIO patches for 5.15 linked above you confirmed you applied. Looking at my /sys/kernel/debug/gpio output for GPIO00:
pin0 Edge trigger| Active high| interrupt is enabled| interrupt is unmasked| enable wakeup in S0i3 state| enable wakeup in S3 state| disable wakeup in S4/S5 state| input is high| 4k pull-up| pull-up is enabled| Pull-down is disabled| output is disabled| debouncing filter (high and low) enabled| debouncing timeout is 0 (us)| 0x1578e0
Contrasting it with yours:
pin0 Edge trigger| Active high| interrupt is enabled| interrupt is unmasked| enable wakeup in S0i3 state| enable wakeup in S3 state| enable wakeup in S4/S5 state| input is high| pull-up is disabled| Pull-down is disabled| output is disabled| debouncing filter (high and low) enabled| debouncing timeout is 0 (us)| 0x5f8e0
The only difference in the configuration is that mine has a 4k pull up configured and yours doesn't. This may be a hardware difference in Lenovo's design versus the one I'm using. Given the power button "works" at runtime I don't think that's likely the cause.
I wonder if Lenovo really expects the EC registers to be read from the OS even in s0i3. What this would mean is that EC GPE has to be allowed to wake the system so that EC registers can be read and then system needs to go back into s0i3 if it shouldn't be a wake event.
IOW:
Some combination of reverting 7b167c4cb48ee3912f0068b9ea5ea4eacc1a5e36 and other TBD changes.
Adding a quirk for some Lenovo systems to not apply 7b167c4cb48ee3912f0068b9ea5ea4eacc1a5e36 behavior (or have another lenovo specific driver mark as a wakeup source)
I enabled PM debugging: echo 1 > /sys/power/pm_debug_messages, suspended the machine, pressed the power button (fan spins momentarily), then a few seconds later woke the machine using USB.
[36505.903636] PM: suspend entry (s2idle)[36510.181531] Filesystems sync: 4.277 seconds[36510.181537] PM: Preparing system for sleep (s2idle)[36510.182020] Freezing user space processes ... (elapsed 0.002 seconds) done.[36510.184193] OOM killer disabled.[36510.184194] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.[36510.185280] PM: Suspending system (s2idle)[36510.185281] printk: Suspending console(s) (use no_console_suspend to debug)[36510.557335] PM: suspend of devices complete after 371.911 msecs[36510.557347] PM: start suspend of devices complete after 372.064 msecs[36510.557866] PM: late suspend of devices complete after 0.514 msecs[36510.558912] ACPI: EC: interrupt blocked[36510.681499] PM: noirq suspend of devices complete after 123.416 msecs[36510.681509] ACPI: \_SB_.PCI0.GPP5: LPI: Constraint not met; min power state:D1 current power state:D0[36511.093554] PM: suspend-to-idle[36522.765829] PM: Timekeeping suspended for 12.269 seconds[36527.245017] PM: Timekeeping suspended for 3.717 seconds[36527.245246] PM: ACPI non-EC GPE wakeup[36527.245253] PM: resume from suspend-to-idle[36527.881972] ACPI: EC: interrupt unblocked[36527.919329] PM: noirq resume of devices complete after 37.584 msecs[36527.922912] PM: early resume of devices complete after 3.237 msecs
Note the Timekeeping suspended message being repeated. The first message appears to correlate with my power button press. I then enabled all power category tracepoints (except for power:cpu_idle) and repeated the sequence:
Interesting finding - can you please check /sys/kernel/debug/amd_pmc/smu_fw_info? There is a S0i3 cycle count. When you wakeup from just keyboard it should be "1". When you "wakeup" from power button and nothing happens, but then wakeup from keyboard does it bump up to 2 or higher (based on number of presses?)?
Something else that might be going on here - the EC has some traffic but since it's suspended the kernel is only waking up enough to know it's there an leave it alone. It doesn't mean that the GPIO that should have been used to trigger the wakeup has actually been asserted.
When you wakeup from just keyboard it should be "1"
Ack.
When you "wakeup" from power button and nothing happens, but then wakeup from keyboard does it bump up to 2 or higher (based on number of presses?)?
Indeed, it bumps to 2 (regardless of how many times I press the power button or laptop keyboard).
It doesn't mean that the GPIO that should have been used to trigger the wakeup has actually been asserted.
Sounds fair.
Related: how is it supposed to exit the s2idle loop? I'm not sure which of the checks in acpi_s2idle_wakeshould return true when a GPIO is asserted? In part because it does seem to check for GPEs, but reading the ACPI specs I got the idea that GPIO signaled events are distinct from GPEs? Aren't we supposed to end up in pinctrl_amd's interrupt handler so it can check whether one of the pins was the wakeup source? How are we supposed to get there?
In part because it does seem to check for GPEs, but reading the ACPI specs I got the idea that GPIO signaled events are distinct from GPEs? Aren't we supposed to end up in pinctrl_amd's interrupt handler so it can check whether one of the pins was the wakeup source? How are we supposed to get there?
Remember there is some hardware here that is programmed. When you see /sys/kernel/debug/gpio you can see the configuration for all the pins and what context they're supposed to do something. Also specifically remember https://github.com/torvalds/linux/commit/acd47b9f28e55b505aedb842131b40904e151d7c fixed things. This sets the IRQ for the controller itself to be a valid wake source. So a GPIO is asserted, controller sees it and if in the right state the controller asserts an IRQ. OS gets the IRQ and does the wakeup handling (calling into that handler).
Probably a good reference from on my local system that has all this working:
I put it to sleep with systemctl suspend. I wake it up with the power button press. Here's what I see with pm_debug_messages set:
[ 8166.257361] PM: suspend-to-idle[ 8166.245312] PM: Timekeeping suspended for 15.876 seconds[ 8166.260730] PM: Wakeup unrelated to ACPI SCI[ 8166.260730] PM: resume from suspend-to-idle
Interesting, your machine appears to have a dedicated IRQ line for the GPIO controller..
Mine doesn't, and it appears to be shared with the ACPI SCI (I assumed that would always be the case as per ACPI spec, although I couldn't find a description/example of interrupt flow for GPIO signaled interrupts)
9-fasteoi acpi, pinctrl_amd
You pointed out we're supposed to leave via the 'SCI is still armed' path.
Which begs the question, how's that supposed to work in my case if the GPIO controller shares the SCI IRQ?
You pointed out we're supposed to leave via the 'SCI is still armed' path.
Which begs the question, how's that supposed to work in my case if the GPIO controller shares the SCI IRQ?
Hmm - that's a great point. Maybe that's the source of this problem.
I did check two other working machines that I have in my possession (HP 635 RN and Lenovo P14s RN) and both are configured the same as my working reference machine I shared the example from where pinctrl_amd is on IRQ7 and acpi is on IRQ9.
So this might be an assumption in the s2idle handling code that needs adjusting for how your machine is doing it.
What would be really useful to rule out an EC problem would be to pop a scope on the APU's GPIO00 or EC's GPIO output and confirm whether it asserts, but I guess that's out of the question :)
FWIW, I had done some crude testing of my own (checking & bailing on GPIO00 in s2idle_loop) in the meantime, which reveals BIT(WAKE_STS_OFF) and BIT(INTERRUPT_STS_OFF) are both set - wonder whether it makes sense to explicitly check for the wake bit only.
Notes:
It reliably resumes on GPIO wake now - great!
GPIO00 is also asserted on lid close or AC unplug - which means these events also resume from suspend (not sure how that works on other devices?)
At the risk of mixing things up: I ran into a seemingly unrelated suspend failure while testing; it first throws a SMU timeout (timestamp: 691), after which it immediately re-attempts to suspend (timestamp: 698.602768; unsure whether I manually triggered that), now catastrophically locking up in amdgpu: dmesg_smu_timeout.log
FWIW, I had done some crude testing of my own (checking & bailing on GPIO00 in s2idle_loop) in the meantime, which reveals BIT(WAKE_STS_OFF) and BIT(INTERRUPT_STS_OFF) are both set - wonder whether it makes sense to explicitly check for the wake bit only.
GPIO00 is also asserted on lid close or AC unplug - which means these events also resume from suspend (not sure how that works on other devices?)
I guess this is up to Lenovo's EC decisions what to do..
At the risk of mixing things up: I ran into a seemingly unrelated suspend failure while testing; it first throws a SMU timeout (timestamp: 691), after which it immediately re-attempts to suspend (timestamp: 698.602768; unsure whether I manually triggered that), now catastrophically locking up in amdgpu: dmesg_smu_timeout.log
I'd say split that off to another issue. One issue per ticket makes things easier to work with.
The GPU looks pretty unhappy on the way down. I don't expect it to be caused by this patch (other than that you can reliably wake the system now so it's easier to hit other bugs).
I guess this is up to Lenovo's EC decisions what to do..
Shouldn't it just forward hardware events and leave it up to the OS to make decisions on how to handle them?
I'm still puzzled - how is a lid close handled on other platforms? Does it not raise a wake interrupt at all the machine is already in S0ix? Or does it exit the s2idle loop, and re-enter suspend in some other way?
I'd say split that off to another issue. One issue per ticket makes things easier to work with. The GPU looks pretty unhappy on the way down.
Should I manage to reproduce it reliably I'll do so.
Great. If you can please add a Tested-by tag to the upstream thread. If you're not familiar with doing it, it's basically download the 'raw' mbox file (https://lore.kernel.org/linux-gpio/20211015144332.700-1-mario.limonciello@amd.com/raw) and then open that in your email client and at the bottom reply with Tested-by: Foo Bar <foo@bar.com>
Shouldn't it just forward hardware events and leave it up to the OS to make decisions on how to handle them?
As I can tell there isn't a GPIO to the APU for the lid on your system. It seems that Lenovo instead uses APU GPIO00 to signal "something changed!" and then lets the OS figure out the state of everything on the machine through that ASL in _EVT. So from a wakeup perspective that GPIO being asserted will get the system awake, the EC gets turned back on for processing and everything in the _EVT for GPIO 00 runs. Try running acpi_listen across the suspend cycle and compare /proc/interrupts and you'll see what ACPI events map out to what GPIO.
Looking at your dmesg I do notice this:
[ 30.373371] ACPI: button: The lid device is not compliant to SW_LID.
It's entirely possible another quirk for your system is needed related to how lid handling works.
I'm still puzzled - how is a lid close handled on other platforms? Does it not raise a wake interrupt at all the machine is already in S0ix? Or does it exit the s2idle loop, and re-enter suspend in some other way?
Some other platforms have the LID on their own GPIO, or a fixed ACPI HW event. Since it's just a single GPIO for any of the wakeup related things it's up to the EC to send that at the right time.
Should I manage to reproduce it reliably I'll do so.