[Rog Ally] Inconsistent sleep behavior

@superm1 As requested here is the issue report.

changed the description

added s0ix label

2023-07-21 12:58:11,627 INFO: Battery BAT0 (ASUSTeK ASUS Battery) is operating at 100.04% of design

greater than 100 eh?

2023-07-21 12:58:36,490 DEBUG: 2023-07-21T12:58:33,691417-05:00 RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]

The IRQ put warnings you see are fixed in newer kernels. They're mostly harmless but I wanted to call them out. Please upgrade past 6.3.9 if you can. The latest 6.4.y perhaps?

2023-07-21 12:58:36,491 DEBUG: 2023-07-21T12:58:42,878818-05:00 amd_pmc AMDI0009:00: Last suspend in deepest state for 8313080us

So the good news is the APU got into the deepest state, so we don't have any driver problems .

2023-07-21 12:58:36,492 ERROR: ACPI BIOS errors found

The bad news is this ACPI BIOS error may be the cause for your problem. This is the part quoted in the logs:

2023-07-21 12:58:36,491 DEBUG:	2023-07-21T12:58:42,879929-05:00 ACPI: \_SB_.PEP_: _DSM function 8 evaluation successful
2023-07-21 12:58:36,491 DEBUG:	2023-07-21T12:58:42,880247-05:00 ACPI: \_SB_.PEP_: _DSM function 6 evaluation successful
2023-07-21 12:58:36,491 DEBUG:	2023-07-21T12:58:42,880875-05:00 ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SBRG.EC0.LID], AE_NOT_FOUND (20221020/psargs-330)
2023-07-21 12:58:36,491 DEBUG:	2023-07-21T12:58:42,880880-05:00 ACPI Error: Aborting method \_SB.PEP._DSM due to previous error (AE_NOT_FOUND) (20221020/psparse-529)
2023-07-21 12:58:36,491 DEBUG:	2023-07-21T12:58:42,880884-05:00 ACPI: \_SB_.PEP_: _DSM function 4 evaluation failed

This is possibly the reason that the fans don't turn off. The EC doesn't get notified properly due to the ACPI BIOS errors.

Can I please see a full acpidump? I'll see if I can make more sense of why that happened. But at least from the errors I see I suspect that this needs to be fixed by the BIOS.

Here is the acpidump acpidump

I could upgrade to a newer kernel if it helps us here, but this will take a bit of work on my end to pull off because in order for this device to have working Wifi, Bluetooth, Asus Keys, etc I need to patch and build the kernel.

to have working Wifi, Bluetooth, Asus Keys, etc I need to patch and build the kernel.

I'm pretty surprised by this. I think the wifi/bluetooth should work in the latest 6.4.y stable. I suspect you need this patch to make it work, which is in the stable trees.

I submitted the patch to get Bluetooth working which was set for 6.5 and the WiFi had an issue where it would stop working requiring a hard reset to fix and if memory serves me correct sleep/resume would be a common cause for this. The other time was whenever you entered the BIOS before booting into the OS (The bios has cloud recovery so it initializes the chip.. probably related). The hid-asus patch has not been upstreamed yet.

I'm not at my computer right now to verify, but last I checked 6.4 using linux-fimware-git had working WiFi sometimes.

I believe it was this patch that fixed the issue. I could be wrong.

https://github.com/ChimeraOS/linux-chimeraos/blob/main/linux/0011-wifi-mt76-mt7921e-fix-init-command-fail-with-enabled-device.patch

Yup; that's the exact same patch. If the UEFI network stack has run for any reason (as you outlined) then it has possibility to leave the card in a bad state. This has happened on multiple manufacturers. That patch is backported to 6.4.4 and 6.1.39.

Ok so on the way down these are the functions called:

2023-07-21 12:58:36,491 DEBUG:	2023-07-21T12:58:33,903676-05:00 ACPI: \_SB_.PEP_: _DSM function 3 evaluation successful
2023-07-21 12:58:36,491 DEBUG:	2023-07-21T12:58:33,904039-05:00 ACPI: \_SB_.PEP_: _DSM function 5 evaluation successful
2023-07-21 12:58:36,491 DEBUG:	2023-07-21T12:58:33,904456-05:00 ACPI: \_SB_.PEP_: _DSM function 7 evaluation successful

That order (3->5->7) is wrong. It should have been 3->7->5, which is fixed by this commit: https://github.com/torvalds/linux/commit/f198478cfdc8105a1c8d8945918904f0498d19be

Here is what 3, 5, and 7 do:

                        Case (0x03)
                        {
                            M000 (0x3E03)
                            M460 ("    Return (0x00)\n", Zero, Zero, Zero, Zero, Zero, Zero)
                            \_SB.PCI0.SBRG.EC0.CSEE (0xB7)
                            Return (Zero)
                        }
                        Case (0x05)
                        {
                            M000 (0x3E05)
                            M460 ("    Return (0x00)\n", Zero, Zero, Zero, Zero, Zero, Zero)
                            Return (Zero)
                        }
                        Case (0x07)
                        {
                            M000 (0x3E07)
                            If (((GGOV (Zero, 0x68) == Zero) && (GGOV (Zero, 0x69) == Zero)))
                            {
                                Notify (\_SB.I2CD.SPKR, 0xA1) // Device-Specific
                            }

                            M460 ("    Return (0x00)\n", Zero, Zero, Zero, Zero, Zero, Zero)
                            Return (Zero)
                        }

I believe the M000 call is a port 80 debug code, it won't matter if it's out of order.
The M460 call is a BIOS serial debug string, it also won't matter. \_SB.PCI0.SBRG.EC0.CSEE (0xB7) is notifying the EC as part of function 3. Notify (\_SB.I2CD.SPKR, 0xA1) // Device-Specific is notifying an amplifier (CSC3551) presumably to prevent pops or similar.

It looks like even though the order is wrong, functions 3 5 and 7 do work properly and probably aren't your issue.

On the way back up the order is 8->6->4, and again it's supposed to be 6->8->4.

                        Case (0x08)
                        {
                            M000 (0x3E08)
                            If (((GGOV (Zero, 0x68) == Zero) && (GGOV (Zero, 0x69) == Zero)))
                            {
                                Notify (\_SB.I2CD.SPKR, 0xA0) // Device-Specific
                            }

                            Notify (\_SB.PCI0.GPP7.CADR, One) // Device Check
                            M460 ("    Return (0x00)\n", Zero, Zero, Zero, Zero, Zero, Zero)
                            Return (Zero)
                        }
                        Case (0x06)
                        {
                            M000 (0x3E06)
                            M460 ("    Return (0x00)\n", Zero, Zero, Zero, Zero, Zero, Zero)
                            Return (Zero)
                        }
                        Case (0x04)
                        {
                            M000 (0x3E04)
                            M460 ("    Return (0x00)\n", Zero, Zero, Zero, Zero, Zero, Zero)
                            \_SB.PCI0.SBRG.EC0.CSEE (0xB8)
                            Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) // Status Change
                            Return (Zero)
                        }

Function 6 is just serial debug, and port 80 messages. Unlikely that the order patch changes anything.

Notify (\_SB.PCI0.GPP7.CADR, One) // Device Check is notifying this PCI device: Generic system peripheral Genesys Logic, Inc [17A0:9755]

A quick serach indicates this is an SD card reader.

The failing call is Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) // Status Change which is declared an external ACPI device in one the SSDTs: External (_SB_.PCI0.SBRG.EC0_.LID_, DeviceObj) but is never actually declared anywhere else.

It's a pure BIOS bug to reference an external LID in a handheld. There is no code that runs after this, so unfortunately I don't think it's likely the reason for your behavior.

By chance did everything work properly for the fan and such when you ran the s2idle report with automated wakeup? Or did it also have a problem?

I'm wondering if there might be a state machine tracking bug in their EC based on power button presses versus system initiated suspends?

I need to do more testing, but strangely enough I get different behavior when I boot ChimeraOS from the USB vs from the internal drive. At first I thought it was because of the SSDT override I did to remove the LID lines, but I removed the override and booted from the USB again and the same behavior happens.

So to clarify this is what I am seeing now.

I boot via USB and the gamepad/asus events never disappear when I sleep/wake the device
The system sometimes goes to sleep with a blinking light (seems almost like every 3rd cycle on average)
In both scenarios the fan is either spinning really slowly or it's not spinning at all.

I'll need to install ChimeraOS to the internal NVME again and run some tests because these results are completely different than before.

Also with multiple runs with DSDT overrides to set up the CSC3551 amp and without I did notice something happen once. It threw an error saying the NVME was not configured for S2idle.

2023-07-21 17:47:12,292 ERROR:	❌ NVME Micron Technology Inc  is not configured for s2idle in BIOS

s2idle_report-dsdt-override.txt

s2idle_report-nodsdt-override.txt

s2idle_report-dsdt-override-nvme-failed.txt

2023-07-21 17:47:12,285 ERROR: Kernel ringbuffer has wrapped, unable to accurately validate pre-requisites

The NVME error is because you're not using systemd and your ring buffer wrapped. So don't worry about that.

I've never seen this hardware myself before, so let me ask something perhaps fundamental to your issue.

What driver provides the ASUS N keys? ASUS-WMI?
Does the fan issue only happen when ASUS-WMI is loaded?

That is correct ASUS_WMI/HID_ASUS are responsible for the N keys
The fan issue still occurs with and without the modules loaded.

I did a fresh install of ChimeraOS onto my internal NVME and the issue where the keys would no longer work after sleeping is completely gone. I was running Windows for a little while and I think Asus pushed a new EC update automatically that fixed that issue. That great I suppose!

I'll have another user who hasn't ran Windows in a while boot into Windows and let it do all the updates to verify this.

So I guess the only issue that remains is the fact that half the time the device fails to sleep.

By chance did everything work properly for the fan and such when you ran the s2idle report with automated wakeup? Or did it also have a problem?

Apologies for the delay I needed to verify some things to be certain, but I ran the cycle 20 times and each time the system went to sleep successfully when using the s2idle script.

When I run echo mem | sudo tee /sys/power/state the device sleeps 100% of the time. Sometimes the firmware will fail to load for the Cirrus amp though meaning there is no sound. If I run systemctl suspend it acts the same way as pressing the power button manually.

[    7.894231] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: Firmware: 400a4 vendor: 0x2 v0.43.1, 2 algorithms
[    7.895450] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: 0: ID cd v29.63.1 XM@94 YM@e
[    7.895459] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: 1: ID f20b v0.1.0 XM@176 YM@0
[    7.895465] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: spk-prot: C:\Users\dchunyi\Documents\Asus_ROG\Project\NR2301\Tuning\20221125\104317F3_221125_V1_A0.bin
[    7.979710] snd_hda_codec_realtek hdaudioC1D0: bound i2c-CSC3551:00-cs35l41-hda.0 (ops cs35l41_hda_comp_ops [snd_hda_scodec_cs35l41])
[    7.979994] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: DSP1: Firmware version: 3
[    7.979996] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: DSP1: cirrus/cs35l41-dsp1-spk-prot-104317f3.wmfw: Fri 27 Aug 2021 14:58:19 W. Europe Daylight Time

Audio fails to work when you see this
[  189.903009] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: DSP1: Firmware: 0 vendor: 0x0 v0.0.0, 0 algorithms
[  189.903024] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: DSP1: No algorithms
[  189.903030] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Cannot Initialize Firmware. Error: -22
[  189.903404] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: Firmware version: 3
[  189.903409] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: cirrus/cs35l41-dsp1-spk-prot-104317f3.wmfw: Fri 27 Aug 2021 14:58:19 W. Europe Daylight Time
[  190.383930] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: Firmware: 0 vendor: 0x0 v0.0.0, 0 algorithms
[  190.383960] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: No algorithms
[  190.383967] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Cannot Initialize Firmware. Error: -22

It appears that error is symlink related. I deleted the symlink and manually copied the file over as the proper name and it works. The last thing remaining seems to be related to pipewire. I need to let the audio stream go idle for about 20 seconds and then do something to trigger audio for it to work.

I was running Windows for a little while and I think Asus pushed a new EC update automatically that fixed that issue. That great I suppose!

You can confirm this by checking the report again. You initial versions were: BIOS 5.29 (RC71L.322) released 06/30/2023 and EC 3.15.

It appears that error is symlink related. I deleted the symlink and manually copied the file over as the proper name and it works.

That's a bit weird. The kernel should treat them identically.

It appears that error is symlink related. I deleted the symlink and manually copied the file over as the proper name and it works.

OK, I think you'll need to raise this separately with the Cirrus guys on the ALSA lists.

If I run systemctl suspend it acts the same way as pressing the power button manually

So everything is good now for sleep in terms of fans and buttons and such, right? It was just that EC fixed it all?

Not exactly. When I run echo mem | sudo tee /sys/power/state the system will sleep. When I run systemctl suspend it acts the same way as pressing the power button, meaning half the time it turns the display off with the fan running still.

As for the initial report I must of had already updated the EC before making it. I wasn't aware of the fact that there was an EC update (I don't think there are any mentions of it anywhere..just the 323 bios which I don't have) but I do know that there is a separate prompt that pops up that updates something firmware related occasionally on Windows. I'm still waiting on a confirmation from someone else that after applying whatever updates on Windows works for them.

I switched to Windows for a few days after using Linux since launch on the Ally to investigate the sleep behavior because I was getting mixed reports about whether or not it worked on Windows (it worked fine).

As for the symlink situation it almost looks like it's a timing issue or something. I'm not entirely sure. It was a shot in the dark to manually manage the file and to my surprise it worked out.

Another ChimeraOS dev checked their Ally which still has the gamepad/asus key issues when sleep/resuming and they are on EC 3.13 which was likely the version I had recently.

it acts the same way as pressing the power button, meaning half the time it turns the display off with the fan running still.

It won't be possible to capture with the s2idle debugging tool, but can I please see a regular dmesg cycle from specifically when this happened with dynamic debugging first turned on for the pinctrl-amd driver?

Another ChimeraOS dev checked their Ally which still has the gamepad/asus key issues when sleep/resuming and they are on EC 3.13 which was likely the version I had recently.

OK, good to know that part is figured out.

It won't be possible to capture with the s2idle debugging tool, but can I please see a regular dmesg cycle from specifically when this happened with dynamic debugging first turned on for the pinctrl-amd driver?

I'm pretty sure I didn't set up the dynamic debugging correctly, but here is a log that looks like it has more information than before. I was experimenting with modprobe configs and boot params to enable dyndbg for pinctrl-amd and I'm not convinced that any of those methods worked.

failed-sleep.log

Also I left the system in the sleep state using /sys/power/state set to "mem" and within 8 or so hours I went from 83% battery down to 72%.

failed-sleep-no-sound.txt

Here we go, I think this got what we need.

When pressing the power button I can get any or all of these messages in the dmesg. I don't seem to ever see them when I use the /sys/power/state method.

[  198.765970] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Cannot Load/Unload firmware during Playback. Retrying...

[  177.798057] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Wake failed, re-enter hibernate: -42
[  177.798257] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Wake failed, re-enter hibernate: -42
[  177.939660] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Wake failed, re-enter hibernate: -42
[  177.939861] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Wake failed, re-enter hibernate: -42
[  178.081087] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Wake failed, re-enter hibernate: -42
[  178.081286] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Wake failed, re-enter hibernate: -42
[  178.082999] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Timed out waking device
[  178.083579] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Timed out waking device

[  133.311940] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Failed to set mailbox cmd 1 (status 0)
[  133.319018] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Failed to set mailbox cmd 1 (status 0)

So a quick update regarding the EC and the gamepad/asus keys problems. A user with the 323 bios and a newer EC than what I have (EC 3.16) has the problems still. So when they get the chance they will be downgrading the bios to 322 to match mine to see if the issue remains.

This is a bit frustrating but we're trying to figure out the true reason why the problem "disappeared" for me. I was thinking it's likely related to a configuration I had changed on Windows with Armory crate that worked around the issue. Hopefully with trial and error we'll find answers.

I still don't see anything in your logs for a power button press from the pinctrl-amd driver.

The most important message you should be looking for in your logs is:

[ 69.500476] amd_pmc AMDI0009:00: Last suspend didn't reach deepest state

I see that a few times, including again at the end of your logs

[ 1517.820258] amd_pmc AMDI0009:00: Last suspend didn't reach deepest state

What this means is that something is keeping the APU from going into the deepest state. What you can do is turn on dynamic debugging for the amd-pmc driver and look for what bits are active at suspend time. Hopefully just one bit is different and that will be a hint at what's actually different about your failures.

Would you mind moving up to 6.4.y? Like I said the WLAN patches are there now, so hopefully it's just porting your asus-wmi changes.

I'd like to make sure that we are looking at something that can still be fixed. 6.3.y is EOL.

We got a build up with 6.4.5 and I got a log here where sleep succeeds and also fails.

ally-6.4.5-dmesg.txt

Unfortunately the idle mask is identical between your "good" and "bad" attempts.

[   82.258242] [3325] amd_pmc: amd_pmc AMDI0009:00: SMU idlemask s0i3: 0xffb3eb5

But it does appear to me that on the failed attempt the big notable difference is that amplifier driver has some failures. So that may explain the problem.

[   89.199996] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Failed to read MBOX STS: -121
[   89.200184] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Failed to read MBOX STS: -121
.
[   93.132323] amd_pmc AMDI0009:00: Last suspend didn't reach deepest state

I suggest bringing this to the maintainers for that driver for comments.

I'm not ruling out the amp entirely, but that is a separate issue I'm tracking that occurs when sleep works or not and only sometimes. Let me do a few more sleep cycles to see if we can get more information.

Hmm...6.4.5 seems to be having way more issues than with 6.3.9. I'll need to do some more testing with this. Our new deployment wiped the "fix" I had for the firmware so the symlink is causing this problem again.

[   86.731275] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: DSP1: Firmware: 0 vendor: 0x0 v0.0.0, 0 algorithms
[   86.731298] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: DSP1: No algorithms
[   86.731303] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Cannot Initialize Firmware. Error: -22

and

[   72.985015] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Timed out waking device
[   72.993082] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Wake failed, re-enter hibernate: -42

I'm not ruling out the amp entirely, but that is a separate issue I'm tracking that occurs when sleep works or not and only sometimes. Let me do a few more sleep cycles to see if we can get more information.

Unfortunately any component can keep a system from getting into the deepest state. So the amp does sound plausible to me.

Hmm...6.4.5 seems to be having way more issues than with 6.3.9. I'll need to do some more testing with this. Our new deployment wiped the "fix" I had for the firmware so the symlink is causing this problem again.

Feel free to CC me in any discussion with the amp driver folks. These amps are used in a lot of other laptops too, but this is the first time I've heard of problems like you're describing.

So I've tested this patch set and this resolves all the audio related issues I was seeing. The system still has an issue where sometimes it fails to go to sleep though.

https://lore.kernel.org/lkml/ab5a2372-c377-a738-4bce-7f67efd656c1@opensource.cirrus.com/T/

Thanks for sharing. Glad to hear that progress has been made.

The cases that it fails to go to sleep with that series, does it match the bugzilla mentioned there as well or is it something different?

Let me know if you want me to look through a log for any analysis of the case.

No problem

I don't have any issues with the system freezing or audio not working when using the patches. I'll do more thorough testing tomorrow with it to see if letting the system sleep for longer than a few seconds changes anything.

I don't have any of the dynamic debugs enabled, but the dmesg looks to be the same as before minus the cs35l41 mbox and firmware loading issues. I'll poke around with this tomorrow as well to see if I find any more leads.

Also, there is still an unknown variable that changes the Asus Keyboard events that we've been having issues trying to replicate on purpose. There are been about five people who managed to get the Ally into the state where the Asus keys are always available. We've documented every single step we've made to get to this state, even if the notes wouldn't make sense, with no luck.

Any suggestions on how exactly we could troubleshoot this? I've been looking into dumping the EC, but the methods that work on other devices don't work on this. The system is using a BGA EC chip called the IT5125VG-192 which makes things complicated..

There are no Super I/O devices found when probing for 0x2E/0x4E.

Also, there is still an unknown variable that changes the Asus Keyboard events that we've been having issues trying to replicate on purpose. There are been about five people who managed to get the Ally into the state where the Asus keys are always available. We've documented every single step we've made to get to this state, even if the notes wouldn't make sense, with no luck.

Any suggestions on how exactly we could troubleshoot this? I've been looking into dumping the EC, but the methods that work on other devices don't work on this. The system is using a BGA EC chip called the IT5125VG-192 which makes things complicated.. There are no Super I/O devices found when probing for 0x2E/0x4E.

This to me likely points to a mistake in the WMI driver. I'd suggest decoding the MOF file and double checking everything matches the driver.

I'm not all to familiar with how to decode or manage MOF files, but I'll do some research and figure it out. At a glance it looks like the Asus WMI is defined in the DSDT and I'm finding many of the offsets listed in the asus-wmi.h header in the kernel. Some aren't there and I'm having to figure out what is what.

You can see what I'm looking at by searching for IIA0 in this .dsl file.

dsdt-ally.dsl

Use the wmi-bmof driver to get the binary mof, and you can use this to decompile it.

Then cross reference all the structures in the ASUS WMI driver against it and hopefully the problem is something is amiss.

I was able to use bmf2mof to create this which has human readable text. If you attempt to use bmfdec on this new file it says invalid input? The GUID here actually matches what I see in the asus-wmi.c file in the kernel.

https://github.com/torvalds/linux/blob/c1a515d3c0270628df8ae5f5118ba859b85464a2/drivers/platform/x86/asus-wmi.c#L57

[Dynamic, Provider("WmiProv"), WMI, GUID("{97845ED0-4E6D-11DE-8A39-0800200C9A66}")]
class AsusAtkWmi_WMNB {
  [key, read] string InstanceName;
  [read] boolean Active;

  [WmiMethodId(1414090313), Implemented] void INIT([in] uint32 State, [out] uint32 Status);
  [WmiMethodId(1398035266), Implemented] void BSTS([out] uint32 instance_key_status);
  [WmiMethodId(1314211411), Implemented] void SFUN([out] uint32 Report_supported_functions_applications);
  [WmiMethodId(1196377175), Implemented] void WDOG([in] uint32 watchdog_timer_control, [out] uint32 timer_or_status);
  [WmiMethodId(1229865547), Implemented] void KBNI([out] uint32 simulated_keyboard_notification);
  [WmiMethodId(1195656019), Implemented] void SCDG([in] uint32 Get_specific_calibration_data, [in] uint32 offset, [out] uint32 data);
  [WmiMethodId(1128616019), Implemented] void SPEC([in] uint32 Get_specification_or_model_type, [out] uint32 result);
  [WmiMethodId(1381389135), Implemented] void OSVR([in] uint32 Inform_BIOS_current_OS_version, [out] uint32 Status);
  [WmiMethodId(1397900630), Implemented] void VERS([in] uint32 get_implemented_version, [in] uint32 Application_version, [out] uint32 implemented_version_in_BIOS);
  [WmiMethodId(1145261127), Implemented] void GLCD([out] uint32 return_panel_EDID);
  [WmiMethodId(1230392897), Implemented] void ANVI([in] uint32 SLP_mode, [out] uint32 result);
  [WmiMethodId(1179080525), Implemented] void MWGF([in] uint32 Device_ID, [in] uint32 Control_Flag, [out] uint32 result);
  [WmiMethodId(1398035268), Implemented] void DSTS([in] uint32 Device_ID, [out] uint32 device_status);
  [WmiMethodId(1398162756), Implemented] void DEVS([in] uint32 Device_ID, [in] uint32 Control_status, [out] uint32 result);
};

As I'm cross examining my information, I'm seeing SMIF (Sleep Mode Information?) is set to 0x04 by ANVI. Which if you look at this it tells you that this is indeed sleep mode.

  [WmiMethodId(1230392897), Implemented] void ANVI([in] uint32 SLP_mode, [out] uint32 result);

DSDT

            Method (ANVI, 1, Serialized)
            {
                SMIF = 0x04
                Return (ASMI (Arg0))
            }

I don't think asus-wmi handles this at all.

Yup, that's totally the kind of thing I thought might be missing. If that's the case add suspend/resume callbacks to the asus-wmi driver to notify that kind of thing.

I'm hoping this is the case. I did an EC update that was available on Windows and now my Keyboard Events disappear again when suspending/waking the device. There are a total of 3 different cases.

State 1 Sleep/Suspend results in the keyboard disappearing forever until you reboot (Or cold boot)
State 2: Sleep/Suspend results in the keyboard disappearing every other cycle
State 3: Sleep/Suspend never has any issues. The keyboard is always available.

This has been consistent with multiple units.

I'll report back if I get back to state 3 while messing with the callbacks.

The issue seems to be directly related to the sleep states of the PCI XHCI adapter. I've spent a considerable amount of time going through the DSDT and issuing every variation of WMI ACPI calls available and even broke down the individual methods to know mostly what they do and there isn't anything related to do a WMI callback.

I've forced S3 by hacking the DSDT to add support and the N-Keys stay with sleep cycles when using "Platform" with `/sys/power/pm_test". If I don't use platform the system suspends and then the UEFI splash screen shows indefinitely after it wakes itself up.

I think there is a driver on Windows that handles the D3 Cold transitions correctly based on some of the documentation I have found.

I noticed there are quirks for the Surface tablet that can handle a PCI reset to cut the power to the wifi device to get it working, is it possible to do something similar here for testing?

\_SB_.PCI0.GP17.XHC0.RHUB.PRT3 is what the N-Key (Asus Keyboard) is connected to and the gamepad is connected to \_SB_.PCI0.GP17.XHC0.RHUB.PRT2 and this doesn't disappear when you cycle sleep/resume.

https://github.com/ruineka/rog-ally-re/blob/main/debug-commands.txt https://github.com/ruineka/rog-ally-re/blob/main/acpi-notes.txt

@superm1 The N-Key (Asus Keyboard) disappears when the system goes to sleep and the EC flush and GPE events are logged. During a sleep cycle where the N-Key remains, there are no logs indicating that this has happened. I was looking at the Steam Deck's DSDT and I noticed it has a lot of GPE wakeup notifications that the Ally does not have. As a matter of fact it looks like the GPEs on the Ally are a bit of a mess in general by comparison.

Keyboard disappears when logs show this.

[   74.922792] ACPI: EC: ACPI EC GPE status set
[   74.922805] ACPI: EC: ACPI EC GPE dispatched
[   74.923473] ACPI: EC: ACPI EC work flushed
[   74.923475] ACPI: PM: Rearming ACPI SCI for wakeup
[   74.923567] ACPI: EC: ACPI EC GPE status set
[   74.923576] ACPI: PM: Rearming ACPI SCI for wakeup
[   75.460058] ACPI: PM: Wakeup unrelated to ACPI SCI

Keyboard pops up again when the log has this

[  125.055022] ACPI: EC: interrupt blocked
[  125.182659] ACPI: \_SB_.PCI0.GP19.XHC2: LPI: Constraint not met; min power state:D3hot current power state:D0
[  128.139644] ACPI: PM: Wakeup unrelated to ACPI SCI
[  128.143115] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SBRG.EC0.LID], AE_NOT_FOUND (20230331/psargs-330)
[  128.143123] ACPI Error: Aborting method \_SB.PEP._DSM due to previous error (AE_NOT_FOUND) (20230331/psparse-529)
[  128.143149] clocksource:                       'acpi_pm' wd_nsec: 0 wd_now: 568ae wd_last: 5e4130 mask: ffffff
[  128.143157] clocksource:                       Clocksource 'tsc' skewed 3060334370 ns (3060 ms) over watchdog 'acpi_pm' interval of 0 ns (0 ms)
[  128.143371] ACPI: EC: interrupt unblocked
[  128.143822] clocksource: Switched to clocksource acpi_pm

https://github.com/ruineka/rog-ally-re/blob/3a0491d30d3980e03ddfba9e615e1b0bbdd0a99a/ally-sleep-analysis.txt#L43

The Steam Deck has this

    Scope (_GPE)
    {
        Method (_L08, 0, NotSerialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
        {
            If ((TBEN == Zero))
            {
                Notify (\_SB.PCI0.GPP0, 0x02) // Device Wake
                Notify (\_SB.PCI0.GPP1, 0x02) // Device Wake
            }

            Notify (\_SB.PCI0.GPP5, 0x02) // Device Wake
            Notify (\_SB.PCI0.GP17, 0x02) // Device Wake
            Notify (\_SB.PCI0.GP18, 0x02) // Device Wake
        }

        Method (_L0D, 0, NotSerialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
        {
            Notify (\_SB.PCI0.GPP2, 0x02) // Device Wake
        }

        Method (_L0E, 0, NotSerialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
        {
            Notify (\_SB.PCI0.GPP4, 0x02) // Device Wake
        }

        Method (_L0F, 0, NotSerialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
        {
            Notify (\_SB.PCI0.GPP3, 0x02) // Device Wake
        }

        Method (_L19, 0, NotSerialized)  // _Lxx: Level-Triggered GPE, xx=0x00-0xFF
        {
            Notify (\_SB.PCI0.GP17.XHC0, 0x02) // Device Wake
            Notify (\_SB.PCI0.GP17.XHC1, 0x02) // Device Wake
            Notify (\_SB.PCI0.GP19.XHC2, 0x02) // Device Wake
        }
    }

added Phoenix label

mentioned in issue #2808

@superm1 I need to add some info to this: I requested some advice from some of the ASUS engineers I am in contact with and while I didn't get much information out of them I got this:

In FW 308, a new feature was introduced wherein the MCU would disconnect the USB when system screen off (the first setup when entering Modern Standby). This was implement to address the issue caused by XINPUT not supporting USB selective suspend, which cause the system getting stuck and unable to enter Modern Standby. However, the feature is not present in FW 305.

I was also asked which of E3F32452-FEBC-43CE-9039-932122D37721 and 11E00D56-CE64-47CE-837B-1F898F9AA461 would be executing on sleep. I couldn't easily work this out myself, and it looks like support for both was added in https://github.com/torvalds/linux/commit/e555c85792bd5f9828a2fd2ca9761f70efb1c77b

I have had much the same result as Matthew did, and forwarded mine and his logs to ASUS. You can also see the various DSDT dumps here https://gitlab.com/asus-linux/reverse-engineering/-/tree/master/rog-ally?ref_type=heads

Given what I've just been told, it seems that a solution is within reach.

@superm1 I think I understand what is going on here now. First some background info:

ASUS has worked around an XInput issue that prevents suspend in windows by making the MCU disconnect the USB hub that the gamepad and N-Key are on.
They have made the MCU do this when the ACPI_LPS0_DSM_UUID_MICROSOFT is called with ACPI_LPS0_SCREEN_OFF

So I've done some reading of the DSDT I've dumped and I see in the MS ID block:

Case (0x03)
{
    M000 (0x3E03)
    M460 ("    Return (0x00)\n", Zero, Zero, Zero, Zero, Zero, Zero)
    \_SB.PCI0.SBRG.EC0.CSEE (0xB7)
    Return (Zero)
}
Case (0x04)
{
    M000 (0x3E04)
    M460 ("    Return (0x00)\n", Zero, Zero, Zero, Zero, Zero, Zero)
    \_SB.PCI0.SBRG.EC0.CSEE (0xB8)
    Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) // Status Change
    Return (Zero)
}

0x03 = suspend, 0x04 = resume. Following M000(arg) leads me down a rabbit hole of hex that looks like setting timers (and memory regions plus vars). \_SB.PCI0.SBRG.EC0.CSEE (arg) is the connection state of USB hub 0, 0xB7 = disconnect.

What I suspect is happening is that not enough time is being given for this disconnect process to finish. I think this because if we poke it manually in userland, then suspend the resume brings the hub back completely. And when I try a patch to asus-wmi driver to do this call plus a small msleep the devices return fine - and it must be done in resume_early to ensure the hub is active before other drivers need it (liek hid-asus).

DSDT dump here

I'm going to do a test of putting an msleep(2000) after the s2idle.c block:

	/* Screen off */
	if (lps0_dsm_func_mask > 0)
		acpi_sleep_run_lps0_dsm(acpi_s2idle_vendor_amd() ?
					ACPI_LPS0_SCREEN_OFF_AMD :
					ACPI_LPS0_SCREEN_OFF,
					lps0_dsm_func_mask, lps0_dsm_guid);

	if (lps0_dsm_func_mask_microsoft > 0)
		acpi_sleep_run_lps0_dsm(ACPI_LPS0_SCREEN_OFF,
				lps0_dsm_func_mask_microsoft, lps0_dsm_guid_microsoft);

quite likely Ally will still need the 0xB8 to CSEE func on early resume to ensure the hub is enabled early enough to prevent drivers and userland seeing a detach/attach event.

While I think I have solved the immediate issue by using prepare and resume_early in asus-wmi, I thought it prudent to write my findings here.

This is a great finding, and it certainly sounds plausible.

It's not the first time that we've seen bugs that "Linux is too fast" in the suspend sequence or resume sequence.

You can see an artificial delay is injected in amd-pmc driver for example on Cezanne. This is because Linux races with firmware. The proper fix would be in the firmware, but you never see the race on Windows so it's a tough case to make in fixing in firmware.

You'll notice that the methods for screen off and LPS0 entry and modern stand by entry don't really correspond well to the actions - Linux does all 3 back to back whereas in Windows they actually mean certain milestones in the suspend sequence. If ASUS actually expects a certain amount of time passes between them that definitely doesn't exist today in Linux.

I think your timing experiment will be enlightening but I don't think we can artificially slow it down for everyone without a spec to lean on. So I would ask if you could instead have one of the Asus drivers register an LPS0 hook for this case. If it finds this system then inject a delay into the process. You can again model how amd pmc does it, like I said that's exactly what it does.

Ah nice to know I'm not losing my marbles. I actually wrote a patch already https://lore.kernel.org/all/20231124082749.23353-1-luke@ljones.dev/ but I'll go ahead with your suggestion for v2 as this sounds exactly like what I was looking for.

@superm1 I can't use the LPS0 method as the call I need to make needs to be done before the screen-off.

If you explicitly make calls before LPS0 phase then yeah patching like your V1 makes sense.

But isn't the root of this issue timing? Can't you just inject more timing between the LPS0 calls the way I suggested?

I guess to add to my comment; is it timing between screen off and lps0 or is it timing between screen off command and actually suspending?

If it's the former then I think doing something in asus-wmi's PM ops callbacks is unfortunately the best bet.
If it's the latter then you should be able to register an LPS0 prepare() callback that just adds an msleep to the process. This should prevent the system from actually going into hardware sleep for that duration of time.

From what I can tell it's "between screen off command and actually suspending". If the testing so far is any indication.

I tried to find how to add a prepare() hook but couldn't, I could see only enough to do this in pci/quirks.c. Maybe I'm missing something?:

/*
 * ASUS ROG Ally
 */
static void asus_rog_usb0_connect_suspend(struct pci_dev *dev)
{
    if (dmi_match(DMI_BOARD_NAME, "RC71L")) {
        pci_info(dev, "ASUS ROG Ally found PCI quirk for suspend\n");
        /* sleep required to ensure USB0 is disabled before sleep continues */
        if (ACPI_FAILURE(acpi_execute_simple_method(NULL, "\\_SB.PCI0.SBRG.EC0.CSEE", 0xB7)))
            pci_info(dev, "ASUS ROG Ally failed to set USB hub power off\n");
        else
            msleep(1000);
    }
}

static void asus_rog_usb0_connect_resume_early(struct pci_dev *dev)
{
    if (dmi_match(DMI_BOARD_NAME, "RC71L")) {
        pci_info(dev, "ASUS ROG Ally found PCI quirk for resume\n");
        /* required to ensure USB0 is enabled before drivers notice */
        if (ACPI_FAILURE(acpi_execute_simple_method(NULL, "\\_SB.PCI0.SBRG.EC0.CSEE", 0xB8)))
            pci_info(dev, "ASUS ROG Ally failed to set USB hub power on\n");
        else
            msleep(1000);
    }
}

DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_AMD, 0x15b9, asus_rog_usb0_connect_suspend);
DECLARE_PCI_FIXUP_RESUME_EARLY(PCI_VENDOR_ID_AMD, 0x15b9, asus_rog_usb0_connect_resume_early);

Sorry, I've done so much today I'm forgetting things.

Yes I created a PM with acpi_register_lps0_dev etc and prepare(), adding an msleep of various lengths. It unfortunately did not work. It seems like the pause needs to be directly after the screen off - and I'm not sure if there is a race with other things..

After this didn't work I tried the PCI thing above. The only thing that seems to work for us on Linux is making that same call in acpi but much early than where the screen-off makes it.

Yes I created a PM with acpi_register_lps0_dev etc and prepare(), adding an msleep of various lengths.

With the acpi_register_lps0_dev() and prepare() approach could you tell whether it ran before or after amd-pmc? Can you add a debugging statement to your prepare() callback to confirm?

The only thing that seems to work for us on Linux is making that same call in acpi but much early than where the screen-off makes it.

It would be nice to get confirmation what's actually happening on the other end of that ACPI call (if ASUS will share it). That could help explain the dependency on where the wait is injected. Or maybe it's possible to query a register or an ASL variable to confirm something happened and is finished for this case.

As this is an on-going issue and also affects Ally X I'm going to consolidate the information I know so far to here.

Both Ally and Ally X have an MCU powersave option which can be set via a WMI call.
This sets something in the MCU which is affected by \_SB.PCI0.SBRG.EC0.CSEE
CSEE appears to signal something
And when this is done the timing is critical - it looks to me like a notif is missed when CSEE is completed?

The related code is this and should be similar for Ally (this is from X)

ElseIf ((Arg0 == ToUUID ("11e00d56-ce64-47ce-837b-1f898f9aa461") /* Unknown UUID */))
{
    Switch (ToInteger (Arg2))
    {
        Case (Zero)
        {
            Switch (ToInteger (Arg1))
            {
                Case (Zero)
                {
                    M460 ("    Return (Buffer (2) {0xF9, 0x01})\n", Zero, Zero, Zero, Zero, Zero, Zero)
                    Return (Buffer (0x02)
                    {
                         0xF9, 0x01                                       // ..
                    })
                }
                Default
                {
                    M460 ("    Return (Buffer (1) {0x00})\n", Zero, Zero, Zero, Zero, Zero, Zero)
                    Return (Buffer (One)
                    {
                         0x00                                             // .
                    })
                }

            }
        }
        Case (0x03)
        {
            M000 (0x3E03)
            M460 ("    Return (0x00)\n", Zero, Zero, Zero, Zero, Zero, Zero)
            \_SB.PCI0.SBRG.EC0.CSEE (0xB7)
            Return (Zero)
        }
        Case (0x04)
        {
            M000 (0x3E04)
            M460 ("    Return (0x00)\n", Zero, Zero, Zero, Zero, Zero, Zero)
            \_SB.PCI0.SBRG.EC0.CSEE (0xB8)
            Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) // Status Change
            Return (Zero)
        }

and this

        Method (CSEE, 1, Serialized)
        {
            If (ECAV ())
            {
                Acquire (MU4T, 0xFFFF)
                CMD = Arg0
                EDA1 = Arg0
                ECAC ()
                Release (MU4T)
                Return (Zero)
            }

            Return (Ones)
        }

and there is a loop check here:

        Method (ECAC, 0, NotSerialized)
        {
            MFUN = 0x30
            SFUN = One
            LEN = 0x10
            EROR = 0xFF
            CUNT = One
            While ((CUNT < 0x06))
            {
                ISMI (0x9C)
                If ((EROR != Zero))
                {
                    CUNT += One
                }
                Else
                {
                    Break
                }
            }
        }

and many other things to trace through.

So when the existing kernel patch calls this CSEE method it tries to do so very early in suspend with pm_op asus_hotk_prepare and early in resume with asus_hotk_resume_early. In these calls the msleep() is forced.

What I've found is the behaviour of the MCU can be heavily variant on the time length. We were at one point doing remove and bringback very early with a very short time to prevent devices getting lost but it was unreliable (300-600ms I think). It was then changed to 1500ms to let the devices fully detach (what the MCU does), then wait for reattach. It is not looking like this time is still not long enough.

The UUID above is ACPI_LPS0_DSM_UUID_MICROSOFT:

/* Microsoft platform agnostic UUID */
#define ACPI_LPS0_DSM_UUID_MICROSOFT      "11e00d56-ce64-47ce-837b-1f898f9aa461"

[Rog Ally] Inconsistent sleep behavior

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Designs

Child items ...

Activity

Admin message

Admin message

[Rog Ally] Inconsistent sleep behavior

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Activity