amdgpu: when detaching MST hub during suspend, MST will not work after resume (bisected)

changed the description

changed title from amdgpu: when detaching Thinkpad MST hub during suspend, MST will not work after resume (bisected) to amdgpu: when detaching MST hub during suspend, MST will not work after resume (bisected)

The issue does not specifically seem to be related to the Thinkpad dock, but also occurs on other docks/devices with MST hubs. For instance, I encountered it also today with the Thinkpad being directly attached via USB-C to a Dell U2722DE monitor. This monitor integrates a USB hub + a DisplayPort hub, with a DisplayPort output to daisy-chain another monitor. The steps to reproduce are the same, and the same failure is reported in dmesg when the system wakes up from suspend:

[  114.528017] [drm:dm_late_init [amdgpu]] *ERROR* DM_MST: Failed to start MST
[  114.528259] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <dm> failed -5
[  114.528422] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -5
[  114.528432] amdgpu 0000:07:00.0: PM: failed to resume async: error -5

Perfectly describing my Problem. I'd like to add a bit more information / cases to this thread:

I use a ThinkPad USB-C Dock Gen 2 at home and at work. Putting the Laptop (Lenovo Thinkpad P14s Gen1 AMD) into sleep mode s0ix / S3 and disconnecting it from it's dock at home, transporting it to work without waking it from sleep and re-attaching it to my works, same ThinkPad USB-C Dock Gen 2 mostly works.

If the laptop is set to sleep (in case that does not crash it), the dock disconnected and the laptop later woken up from sleep (without any dock attached), it crashes / gets stuck on a black screen and has to be hard rebooted.

Sometimes putting the laptop to sleep on the dock and waking it up (~8h later) results in display output but USB input being broken. What then works is switching the USB-C port of the docks upstream cable:

In some cases the laptop won't go to sleep when being attached to the dock, after multiple tries to get it to sleep the laptop then crashes and does a hard reboot.

What seems to add to this issue is using 2xQHD Dell monitors on my dock at home and 2xWQHD displays at work - a resolution change while sleeping also seems to break things.

Thanks for confirming I'm not the only one with this issue!

I can confirm on my device that the different ports are also creating issues. However, I'm not entirely sure if that is the same issue/has the same root cause (at least in my case). When the "failed to start MST" message pops up in dmesg, just unplugging and replugging will usually work reliably, even if staying with port 5 (as per your image).

However, sometimes (and for me this hasn't been very reproducible), port 5 will refuse to accept any display connections at all (but some USB devices work), but this doesn't leave any distinctive trace in dmesg. Even replugging or suspending/waking up will not help in that case, but it will start working again at some point (after a few hours or so). Port 6 will work fine. And I can't exactly say for sure that this was introduced by the same kernel commit as the other issue, since it's hard to reproduce. It could just as well have been introduced by some UEFI/firmware upgrade in the last half year.

Yes, exactly. These issues are hard to reproduce in most cases. What has been consistent is that sleeping the device, be it s0ix or S3 sleep has been a complete disaster and barely ever works. Adding USB docks to the equation produces even more issues.

I am not really sure if this is solely a vendor (AMD) issue or more a total firmware failure from Lenovo's side. I have not had this many issues with any Lenovo laptop in the past 9 years of using Thinkpads that I had with my Lenovo P14s Gen 1 AMD. I don't even want to get started about the lackluster sub 48h of sleep time / high battery draw if sleep ever works.

Sorry for this mostly offtopic rage but a laptop with these kinds of issues is really frustrating when using it for your day to day work.

I will try to add log files as issues occur!

Would it be possible to check whether this also happens in s2idle or its exclusive to s3? You can try to change sleep mode from "Linux" to "Windows 10" in your BIOS setup or using fwupd (on new enough fwupd version).

added S3 label

Thanks for the suggestion. I have just tried it out with s2idle by switching to the Windows mode in UEFI and the issue (described in my original post) persists and is reproducible in exactly the same way, with the same dmesg messages after waking up from suspend:

[349570.686749] [drm:dm_late_init [amdgpu]] *ERROR* DM_MST: Failed to start MST
[349570.687005] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <dm> failed -5
[349570.687193] amdgpu 0000:07:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0x1b0 returns -5
[349570.687199] amdgpu 0000:07:00.0: PM: failed to resume async: error -5

(this is on kernel 5.19.13)

Off-topic: I was positively surprised though that s2idle works stably at all now -- that was very different a year ago, and at least it's running stably now without any more crashes than S3. Even battery drain in s2idle mode is comparable with S3 now (which doesn't mean it's good, even S3 battery drain is terrible on this device at over 1% per hour). There are still some issues though, like the device waking up when closing the lid or detaching a power supply, and it takes really long (12secs) to wake up and display an image (compared with ~3secs in S3 and <1sec for my 10-year-old Intel-based Thinkpad running the same OS installation...).

added s0ix label

Thanks for the suggestion. I have just tried it out with s2idle by switching to the Windows mode in UEFI and the issue (described in my original post) persists and is reproducible in exactly the same way, with the same dmesg messages after waking up from suspend:

Thanks for checking it in s2idle. I suspect we want to revert that commit from stable, but I'd rather first make sure that we have this sorted in the newer kernels. There is some pretty big changes that happened in 6.1-rc1 and it has some known regression. Can you please try to upgrade to 6.1-rc4 after it tags this weekend and apply these two commits:

Off-topic: I was positively surprised though that s2idle works stably at all now -- that was very different a year ago, and at least it's running stably now without any more crashes than S3.

Yeah; it's been a big focus area. There are a lot of patches that were developed for it all around the kernel.

Even battery drain in s2idle mode is comparable with S3 now (which doesn't mean it's good, even S3 battery drain is terrible on this device at over 1% per hour).

Most people on Lenovo laptops report that s2idle has better power consumption than s3. If you're seeing power consumption issues, please on 6.1-rc4 when you do the check for the MST problem also add pm_debug_messages amd_pmc.dyndbg on your kernel command line and share a dmesg from a suspend cycle. I'd like to see how much time is actually spent in a suspend cycle in the deepest state and this will expose extra debugging information.

There are still some issues though, like the device waking up when closing the lid or detaching a power supply

This is a little bit surprising on older hardware, but I have a suspicion. In addition to your debug logs for the power consumption can you please also share the contents of # cat /sys/bus/pci/drivers/amdgpu/0000:04:00.0/fw_version/smc_version?

and it takes really long (12secs) to wake up and display an image (compared with ~3secs in S3 and <1sec for my 10-year-old Intel-based Thinkpad running the same OS installation...).

This I believe is likely a BIOS bug. We saw this on a number of models and root caused it, but Lenovo needs to roll out a firmware update to fix it properly.

A workaround was landed in 5.18-rc7.
Alternatively you can try iommu=pt on your kernel command line on older kernels.

Can you please try to upgrade to 6.1-rc4 after it tags this weekend and apply these two commits:

Thanks, I'll make sure to try that out it once it drops.

This is a little bit surprising on older hardware, but I have a suspicion. In addition to your debug logs for the power consumption can you please also share the contents of # cat /sys/bus/pci/drivers/amdgpu/0000:04:00.0/fw_version/smc_version?

$ cat /sys/bus/pci/drivers/amdgpu/0000:07:00.0/fw_version/smc_fw_version
0x00403d00

(I've adapted to the correct PCI address; smc_version does not exist for me)

This I believe is likely a BIOS bug. We saw this on a number of models and root caused it, but Lenovo needs to roll out a firmware update to fix it properly. A workaround was landed in 5.18-rc7.

Note that I'm on version 5.19.13, so the workaround does not work for me? However, the patch you linked only makes an entry for the 21A0 machine type of the P14s G2 AMD notebook, and there is also another type 21A1 (which mine is an example of). I don't exactly know the differences between the two and if the difference matters for whatever detection is done for the quirks, though. Since the quirks match on DMI information, I believe the difference will matter: dmidecode lists 21A1 part of the product name, but not 21A0.

I'll try with iommu=pt.

$ cat /sys/bus/pci/drivers/amdgpu/0000:07:00.0/fw_version/smc_fw_version 0x00403d00

Okay, I think you need 0x00404200 to pick up the fix I'm thinking of. I'll confirm it when I see your 6.1-rc4 debug logs and see if I can hypothesize a workaround based on what I see.

Note that I'm on version 5.19.13, so the workaround does not work for me? However, the patch you linked only makes an entry for the 21A0 machine type of the P14s G2 AMD notebook, and there is also another type 21A1 (which mine is an example of). I don't exactly know the differences between the two and if the difference matters for whatever detection is done for the quirks, though. Since the quirks match on DMI information, I believe the difference will matter: dmidecode lists 21A1 part of the product name, but not 21A0

Yeah yours is not in the list. When you build your 6.1-rc4 kernel try to add a new entry for your system to the list.

I've just tried 6.1.rc4 with the two patches linked by you and my 21A1 device added to the quirks file. The positive first: indeed, resume from s2idle is much faster now with the quirk (~2secs).

However, after resume, graphics output was screwed up. At first, the only thing changing on the lockscreen was the mouse pointer movement, while keyboard interaction and mouse clicks etc. didn't propagate to the output. After switching to another tty and back, additional framebuffer corruption appeared. Restarting sddm was not successful. Attached is the dmesg and an Xorg log from one of my attempts to restart X after the problem appeared.

I'm now back on a stable kernel, but I'm happy to test more tonight if you have any suggestions.

I've just tried 6.1.rc4 with the two patches linked by you and my 21A1 device added to the quirks file. The positive first: indeed, resume from s2idle is much faster now with the quirk (~2secs).

OK, that's good. Would you mind sending your patch to the mailing lists? You can add my tag to your patch Suggested-by: Mario Limonciello <mario.limonciello@amd.com>

However, after resume, graphics output was screwed up.

Is this unique to when the MST hub is connected across suspend?

Attached is the dmesg and an Xorg log from one of my attempts to restart X after the problem appeared.

To confirm the waking from lid or power adapter issue and the power consumption issue, can you please add those two things I suggested to the kernel command line and share a suspend cycle?

Thanks.

Is this unique to when the MST hub is connected across suspend?

No, it also appears with nothing apart from a power supply attached to the notebook. Here are some more configurations I have tested.

6.1rc4 with patches and no additional kernel options: issue and logs as described above, GPU does not work after resume, switching to another tty is possible
plain 6.1rc4 with no patches and pm_debug_messages amd_pmc.dyndbg: same issue as above. Here's a dmesg as requested. Note: from looking at the dmesg, I just realized that the kernel config I was using doesn't have CONFIG_AMD_PMC enabled. I should probably add that. I will post a fresh dmesg once the build has finished.
plain 6.1rc4 with no patches and pm_debug_messages amd_pmc.dyndbg iommu=pt: similar issue, but not even switching to another tty is possible. The only thing I was able to do was magic-sysrq to hard-reboot. So it appears there is some other bug with iommu=pt.

All of these are with s2idle. Another issue consistent across all of these configurations: wakeup via the power button does not work in 6.1rc4, only via the keyboard (this still works fine in my stable 5.19.13 config). I suppose this is an unrelated problem, but I personally don't really have the time nor energy to debug and report this to whatever the right issue tracker is.

Would you mind sending your patch to the mailing lists?

Okay, sure. Can't promise I will get to this today though.

I just realized that the kernel config I was using doesn't have CONFIG_AMD_PMC enabled

Originally I was worried about a very big regression the first comments from your message but that significantly changes the suspend flow for s2idle.

All of these are with s2idle. Another issue consistent across all of these configurations: wakeup via the power button does not work in 6.1rc4, only via the keyboard (this still works fine in my stable 5.19.13 config). I suppose this is an unrelated problem, but I personally don't really have the time nor energy to debug and report this to whatever the right issue tracker is.

Without CONFIG_AMD_PMC, I'm not surprised by this. I'll need to see another debug log with pm_debug_messages and amd-pmc dynamic debugging when you have that enabled.

[ 30.003673] amdgpu 0000:07:00.0: amdgpu: Power consumption will be higher as the kernel has not been compiled with CONFIG_AMD_PMC.

Your power consumption will be DRAMATICALLY higher without this.

Okay, sure. Can't promise I will get to this today though.

Yeah whenever you get to it is fine. As you made and tested the patch I wanted to give you the opportunity to do so. If you have problems submitting or would prefer not to, let me know and I'll send it up instead.

Okay, with CONFIG_AMD_PMC, waking up from suspend works fine. Sorry about this -- I shouldn't have used the dusty config bundled with the linux-git AUR package and am now using the current config used for linux by Arch. Here's the dmesg for unpatched 6.1rc4 with pm_debug_messages amd_pmc.dyndbg.

I'll try with the patches again next to check whether the original MST issue has improved.

Your power consumption will be DRAMATICALLY higher without this.

(note that my original complaints about power consumption were on the official 5.19 Arch kernel, which should have this option enabled)

[ 122.632428] amd_pmc AMDI0005:00: Last suspend in deepest state for 572540us

OK looks like you're getting to the deepest state now. With those debug messages in place, any time that the power consumption seems high look for multiple Timekeeping suspended messages. We should look at the circumstances of any of those to determine why the power consumption would be high. IE did the system wake up but not go back to the deepest state?

[ 133.450940] PM: noirq resume of devices complete after 10248.540 msecs

Once you add your patch back in, this will drop.

[ 122.632428] PM: Triggering wakeup from IRQ 9 [ 122.632720] PM: Triggering wakeup from IRQ 1 [ 122.632428] PM: Triggering wakeup from IRQ 0

This matches my suspicion for why the system wakes up from lid event or power supply. When you plug in the adapter or toggle the lid the EC will wake the APU from the deepest state (that's IRQ9). The kernel goes and runs some ACPI code and then gets data from the EC. It's supposed to go back to sleep, but if another IRQ is active at that time then it will wakeup.

SMU program 0 version is 64.61.0

There is a firmware bug with the SMC/SMU/PMFW (they're used interchangeably in this context) firmware on your system. The fix lands in 64.66.0, which Lenovo would need to roll out.

My idea to work around this is to modify /sys/bus/platform/drivers/i8042/i8042/serio0/power/wakeup from enabled to disabled. See if that avoids a visible wakeup on plug in/out AC adapter.

Note: this is specific to certain generations of hardware. A similar bug occurs in Rembrandt systems with IRQ9/IRQ7. This is a different root cause but similar end result.

Thanks for the explanations! I'll watch out for Timekeeping suspended when I notice anything odd.

My idea to work around this is to modify /sys/bus/platform/drivers/i8042/i8042/serio0/power/wakeup from enabled to disabled. See if that avoids a visible wakeup on plug in/out AC adapter.

Sadly, this does not seem to have changed anything.

Sadly, this does not seem to have changed anything.

Dang; I guess it's because the IRQ9 was active already before so the IRQ1 isn't actually the "wake source to be ignored". I can't think of any other cleaner workaround for this then. You would need to ask Lenovo when they'll upgrade the firmware to that minimum version I suggested which has this fix.

Yeah -- I've done a few more tests with the erroneous wakeups by power detach/lid events and the only wakeup trigger that shows up every time is IRQ9. If I interpret the log correctly, it will always wakeup first from IRQ9, then go back to sleep (emitting a Timekeeping suspended message) and then get another trigger from IRQ9, this time waking up fully.

I'll just hope that Lenovo updates it at some point. For now, this isn't too annoying, as my DE is anyways configured to suspend again when the lid is closed and no power is attached, so there's not much risk of it accidently draining the battery. Thanks for all your help with this!

Okay, next issue: with the kernel 6.1rc4 (no matter if patched or not), graphics output will freeze completely already when I attach the MST hub (with no suspend or detaching involved!). Switching to another tty doesn't affect screen output. I've managed to pull a dmesg via SSH, which shows some NULL dereference bug.

This does not manifest with every USB-C dock -- a Huawei monitor attached via USB-C (which also isn't affected by the original bug and apparently doesn't contain an MST hub) works just fine.

Just to make sure I'm not screwing stuff up again, here's also the config of the running kernel obtained from /proc/config.gz.

Yeah; I suspect you're now hitting #2171 (closed) which @lyudess is still working on a solution for. @lyudess, do you need anything else to help that?

Not yet! I'm making steady progress now that I've actually got a hacked up patch to get this issue to reproduce. Unfortunately this issue is turning out to be downright bizarre - it seems like we're somehow getting MST "streams" in amdgpu's driver that seem to both be marked as MST, but clearly don't actually have an initialized mst_mgr on them which seems to be leading to the bug we're seeing here.

Stuff like that smells like a use after free. Do you have KASAN on your build enabled that would trip up on it hopefully?

I do, had to fix one KASAN bug that's been neglected in amdgpu for ages to get it to stop complaining on driver load, but unfortunately it hasn't turned up anything of use. Could definitely use other amd engineers looking at this if they have the time too, because trying to follow what DC is even doing at any given point is honestly challenging enough. It seems like "stream" is a struct that references some active video stream, and it's somehow pointing to a connector without initialized mst_mgr, which might be the use after free? I'm not sure where it's even getting it's connector assignment from :|, it seems like maybe we create these during the atomic check?

I'm not familiar with this code myself, but I'll ask around if we can get some more eyes on it. Would certainly be better to not have to revert that change, but we're getting close to the end of the cycle, so I'm getting worried.

(Keeping in mind my above comment) I was just looking at the stack traces and I think that there is a disconnect with HDCP. AFAICT there is a separate workqueue for HDCP, and the aconnector it references when hdcp_update_display doesn't get updated when a display is disconnected. If that's all right; then maybe what is missing is an explicit call for dm_dp_mst_connector_destroy to destroy that ref.

i gotta be honest I will genuinely be upset if that change gets reverted with how long it's been on the list and the fact this could have been tested at any point prior in the multiple weeks it was waiting there!!!!!!!!!!!!!!!!!!!!!!!!! :|

I know; I would be upset too. I'm asking around to get some more eyes on this specific problem.

I'm going to continue this discussion over email, there is far more frustration here then just one bug and I'd like to press AMD to make real actual plans to start cleaning things up.

Check out issue #2210 (closed). I have a similar system with R7 6800U, Lenovo 40AY and Huawei MateView X. Long story short, use a DUAL MST dock to get things to work for now if you can't wait for a proper MST patch to be submitted and tested.

As for why the display works fine on its own, turn out this generation of Huawei monitors all use DP1.2 with 4 lanes, so no USB3.0 over type C, only USB2.0. With 4 lanes, no DSC or MST tricks are needed for single stream 2560384060Hz, thus the MST bug does not bother those displays.

I ended up with a CalDigit SOHO, and it works now happily. There are still stability issues, but that's a different story, see issue #2068, but the frequency it reproduces is low enough to be a high priority problem.

I am seeing the same crashes on an HP ProBook x360 435 G8 with an iiyama that supports MST.
It works fine when booted with the monitor connected (even in USB3.0 mode) but crashes the system when even just connecting the monitor while the system is booted.
I tried my best to crash Windows or at least break the functionality but it is rock-solid there.

added DC label

mentioned in commit nouveau@53e16a6e

amdgpu: when detaching MST hub during suspend, MST will not work after resume (bisected)

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Log files (for system lockups / game freezes / crashes)

Designs

Child items ...

Activity

Admin message

Admin message

amdgpu: when detaching MST hub during suspend, MST will not work after resume (bisected)

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Log files (for system lockups / game freezes / crashes)

Activity