Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Displays behind MST hubs non-functional (regression in kernel 6.1)
No image on display image, and we can found error in dmesg: [ 40.504804] amdgpu 0000:63:00.0: [drm] *ERROR* No payload for [MST PORT:0000000006da5d29] found in mst state 00000000b0e556d6
The next system suspend fails - power button is blinking (i.e. LPI works) but the eDP is still lit up.
Could you get a log for this with drm.debug=0x116 log_buf_len=50M added to your kernel commandline so I can see where things are breaking with the MST logic here? I might be able to write a patch beforehand though, but figured I should get that info anyway in case I can't figure it out
Mhhhhhh, not sure how exactly this is happening quite yet but so far it seems like there's a good chance either amdgpu is disabling a CRTC twice somehow, or there's some part of the atomic state we're not checking in the MST helpers to prevent us from removing the payload too early. Since earlier up in the log I see:
[ 45.165735] amdgpu 0000:63:00.0: [drm] *ERROR* No payload for [MST PORT:000000008eceac25] found in mst state 00000000a93ed26b[ 45.165737] amdgpu 0000:63:00.0: [drm:drm_atomic_helper_check_modeset [drm_kms_helper]] [CONNECTOR:129:DP-8] driver check failed[ 45.165745] [drm:amdgpu_dm_atomic_check [amdgpu]] drm_atomic_helper_check_modeset() failed[ 45.165912] [drm:amdgpu_dm_atomic_check [amdgpu]] Atomic check failed with err: -22
Mid-way through typing this comment I was looking at amdgpu's driver code, and immediately noticed a problem I apparently introduced by mistake . I'm not totally sure if this is going to fix it, but it's definitely a patch we need regardless. Would you mind trying it?
Looks like a different bug luckily, but this is definitely a weird one. I'll submit the amd patch in just a bit, have you tried running KASAN to see if you can get any more info on the new issue?
I suspect that this issue is closely related to #1926 (closed) . I've seen the guilty error messages for both issues pop up while experimenting with my new ThinkPad T14 Gen 2 AMD, running EndeavourOS with a KDE Plasma Wayland session.
The issue seems to stem from the exact combination of AMD integrated graphics, USB-C display output, KDE, Wayland and a high resolution/high refresh rate monitor:
The display works directly over HDMI.
The display works on GNOME with Wayland.
The display works on KDE with X11.
Displays seem to work if they are purposely downscaled, i.e. to 1080p from 4K.
I have access to a wide variety of potentially affected monitors and noticed a wide variety of results:
A Dell 4K USB-C monitor works about half the time, doesn't work the other half, and doesn't work coming out of sleep mode.
A Samsung 4K USB-C monitor causes the system to become unresponsive and the laptop display to start flickering until unplugged. I notice that the display rapidly connects and disconnects in the display settings.
An LG UltraGear 1440p 144hz monitor connected over a ThinkPad Thunderbolt 3 dock doesn't work at all. It works over the HDMI port (but is capped at 100hz due to monitor limitations over HDMI.)
A Sceptre ultra-wide 1440p 144hz monitor connected over a ThinkPad Ultra Dock works.
Two typical 1080p monitors connected over a ThinkPad Thunderbolt 3 dock (with MST) work.
Let me know if there's any logs I can provide to help or if there's someone I can chuck a beer at to get this looked at.
If any of these docking stations are MST hubs, it could be that the hub only supports DP 1.2, but the monitors may require DP 1.4 to work properly at their optimal settings.
The Thunderbolt 3 dock is an MST hub— I've previously had to mess around with the DP settings on that LG monitor, but setting it to either 1.2 or 1.4 yields the same result. Two of the monitors listed connect directly over USB-C, and the last one worked fine.
Every scenario can successfully connect with either a different computer or one of the workaround scenarios I describe above.
Wayne pointed out to me that I had the wrong fix, so I'm trying again today to see if I can come up with a fix for this. Will post as soon as I have it ready
So - after trying and failing to reproduce this locally, it occurred to me that we all may have made a mistake here. I realized there was a patch I received and applied from Intel before:
5d832b6694e0 ("drm/dp_mst: Avoid deleting payloads for connectors staying enabled")
It's possible this isn't the actual cause of the problem but I think it's worth double checking y'all are testing a branch with this commit added if y'all wouldn't mind, it should be in drm-tip.
I'm happy to try out anything you need. Unfortunately, I missed what you are saying, as I'm not incredibly familiar with kernel or GPU driver development— what are you asking me to do here? Are you asking me to build amdgpu with a certain patch applied that isn't in mainline yet? Or are you asking me to build something without this patch applied, as it was mainlined?
Further direction would be much appreciated... Sorry for not knowing what I'm doing ;)
(If there's an FAQ, IRC group or mailing list I can hop on to receive any support necessary to accomplish this, I can go there instead— I'm positive that every one of you all are sick of answering these kinds of questions.)
No it's totally fine! Basically just if you could try a newer kernel that would help, I think the next kernel to have the fixes that I'm thinking about you trying right now should be v6.1-rc1, so if you could try a kernel at least that new or newer that'd be appreciated.
Thanks for letting me know - unfortunately I did some searching today and while I have many amd machines apparently not a single one is thunderbolt capable :(.
I think I've got enough info to go off from this, but it'd be helpful if you could run the dmesg you got through ./scripts/decode_stacktrace.sh in the kernel source tree so it can decode the line numbers for the symbols in the splat.
I have a bad feeling though this is going to end up being something to do with amdgpu not passing -EDEADLK down through functions correctly as a result of using bool function signatures all over the place, but we'll see
Ah, seems I missed a couple of spots that were still using booleans so we still weren't handling deadlocks. Mind giving this patch a shot? lyudess/linux@78ede1d0
Agh! Sorry, I went to close vim and realized I forgot to save one of the files before committing this patch shouldn't miss anything lyudess/linux@1bf066b6
Poke, any testing for this? I don't have hardware that can reproduce this locally, but I'm happy to post this patch upstream once someone confirms it actually fixes the thing
I did try the 54f3ae37f commit on my laptop, but I still have hard locks when I try to use 2 HDMI monitors on a external USB C dock on my Thinkpad T14s.
Specs:
OS: Arch Linux x86_64 Host: 21CQ000GUS ThinkPad T14s Gen 3 Kernel: 6.1.0-rc3-1-mainline-custom Uptime: 4 hours, 39 mins Packages: 1359 (pacman), 10 (flatpak) Shell: zsh 5.9 Resolution: 1920x1080 DE: GNOME 43.0 WM: Mutter WM Theme: Adwaita Theme: adw-gtk3-dark [GTK2/3] Icons: Adwaita [GTK2/3] Terminal: gnome-terminal CPU: AMD Ryzen 7 PRO 6850U with Radeon Graphics (16) @ 4.768GHzGPU: AMD ATI Radeon 680M Memory: 16040MiB / 30851MiB
Kernel log:
Oct 31 18:56:26 thinkryzen kernel: [drm] DP Alt mode state on HPD: 1Oct 31 18:56:26 thinkryzen kernel: [drm] DM_MST: starting TM on aconnector: 00000000c7dc8ca9 [id: 95]Oct 31 18:56:26 thinkryzen boltd[555]: probing: started [1000]Oct 31 18:56:26 thinkryzen kernel: [drm] Downstream port present 1, type 2Oct 31 18:56:26 thinkryzen kernel: [drm] Downstream port present 1, type 2Oct 31 18:56:27 thinkryzen kernel: ------------[ cut here ]------------Oct 31 18:55:56 thinkryzen kernel: audit: type=1131 audit(1667260556.724:185): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=geoclue comm="systemd" ex>Oct 31 18:55:58 thinkryzen gnome-shell[1494]: meta_display_get_monitor_geometry: assertion 'monitor >= 0 && monitor < n_logical_monitors' failedOct 31 18:55:58 thinkryzen gnome-shell[1494]: meta_display_get_monitor_geometry: assertion 'monitor >= 0 && monitor < n_logical_monitors' failedOct 31 18:55:58 thinkryzen gnome-shell[1494]: meta_display_get_monitor_geometry: assertion 'monitor >= 0 && monitor < n_logical_monitors' failedOct 31 18:55:58 thinkryzen gnome-shell[1494]: meta_display_get_monitor_geometry: assertion 'monitor >= 0 && monitor < n_logical_monitors' failedOct 31 18:56:00 thinkryzen kernel: [drm] DP Alt mode state on HPD: 1Oct 31 18:56:01 thinkryzen kernel: Registered IR keymap rc-cecOct 31 18:56:01 thinkryzen kernel: rc rc0: DP-2 as /devices/pci0000:00/0000:00:08.1/0000:33:00.0/rc/rc0Oct 31 18:56:01 thinkryzen kernel: input: DP-2 as /devices/pci0000:00/0000:00:08.1/0000:33:00.0/rc/rc0/input36Oct 31 18:56:01 thinkryzen boltd[555]: probing: started [1000]Oct 31 18:56:01 thinkryzen systemd-logind[484]: Watching system buttons on /dev/input/event10 (DP-2)Oct 31 18:56:03 thinkryzen dbus-daemon[1433]: [session uid=1000 pid=1433] Activating via systemd: service name='org.gtk.vfs.Metadata' unit='gvfs-metadata.ser>Oct 31 18:56:03 thinkryzen kernel: usb 9-1.3: new full-speed USB device number 10 using xhci_hcdOct 31 18:56:03 thinkryzen systemd[1409]: Starting Virtual filesystem metadata service...Oct 31 18:56:03 thinkryzen dbus-daemon[1433]: [session uid=1000 pid=1433] Successfully activated service 'org.gtk.vfs.Metadata'Oct 31 18:56:03 thinkryzen systemd[1409]: Started Virtual filesystem metadata service.Oct 31 18:56:03 thinkryzen kernel: usb 9-1.3: config 1 has an invalid interface number: 3 but max is 2Oct 31 18:56:03 thinkryzen kernel: usb 9-1.3: config 1 has an invalid interface number: 3 but max is 2Oct 31 18:56:03 thinkryzen kernel: usb 9-1.3: config 1 has an invalid interface number: 3 but max is 2Oct 31 18:56:03 thinkryzen kernel: usb 9-1.3: config 1 has no interface number 2Oct 31 18:56:03 thinkryzen kernel: usb 9-1.3: New USB device found, idVendor=262a, idProduct=9023, bcdDevice= 0.01Oct 31 18:56:03 thinkryzen kernel: usb 9-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0Oct 31 18:56:03 thinkryzen kernel: usb 9-1.3: Product: DAC AudioOct 31 18:56:03 thinkryzen kernel: usb 9-1.3: Manufacturer: E+ Corp.Oct 31 18:56:03 thinkryzen kernel: hid-generic 0003:262A:9023.0007: No inputs registered, leavingOct 31 18:56:03 thinkryzen kernel: hid-generic 0003:262A:9023.0007: hidraw4: USB HID v1.00 Device [E+ Corp. DAC Audio] on usb-0000:34:00.4-1.3/input0Oct 31 18:56:04 thinkryzen mtp-probe[2466]: checking bus 9, device 10: "/sys/devices/pci0000:00/0000:00:08.3/0000:34:00.4/usb9/9-1/9-1.3"Oct 31 18:56:04 thinkryzen mtp-probe[2466]: bus: 9, device: 10 was not an MTP deviceOct 31 18:56:04 thinkryzen mtp-probe[2473]: checking bus 9, device 10: "/sys/devices/pci0000:00/0000:00:08.3/0000:34:00.4/usb9/9-1/9-1.3"Oct 31 18:56:04 thinkryzen mtp-probe[2473]: bus: 9, device: 10 was not an MTP deviceOct 31 18:56:04 thinkryzen systemd[1409]: Reached target Sound Card.Oct 31 18:56:04 thinkryzen dbus-daemon[483]: [system] Activating via systemd: service name='org.freedesktop.Avahi' unit='dbus-org.freedesktop.Avahi.service' >Oct 31 18:56:04 thinkryzen dbus-daemon[483]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.Avahi.service': Unit dbus-org.freedesktop.>Oct 31 18:56:04 thinkryzen rtkit-daemon[1141]: Supervising 7 threads of 3 processes of 1 users.Oct 31 18:56:04 thinkryzen rtkit-daemon[1141]: Successfully made thread 2475 of process 1632 owned by '1000' RT at priority 5.Oct 31 18:56:04 thinkryzen rtkit-daemon[1141]: Supervising 8 threads of 3 processes of 1 users.Oct 31 18:56:06 thinkryzen boltd[555]: probing: timeout, done: [2095147] (2000000)Oct 31 18:56:24 thinkryzen gnome-shell[1494]: meta_display_get_monitor_geometry: assertion 'monitor >= 0 && monitor < n_logical_monitors' failedOct 31 18:56:24 thinkryzen gnome-shell[1494]: meta_display_get_monitor_geometry: assertion 'monitor >= 0 && monitor < n_logical_monitors' failedOct 31 18:56:24 thinkryzen gnome-shell[1494]: meta_display_get_monitor_geometry: assertion 'monitor >= 0 && monitor < n_logical_monitors' failedOct 31 18:56:24 thinkryzen gnome-shell[1494]: meta_display_get_monitor_geometry: assertion 'monitor >= 0 && monitor < n_logical_monitors' failedOct 31 18:56:26 thinkryzen kernel: [drm] DP Alt mode state on HPD: 1Oct 31 18:56:26 thinkryzen kernel: [drm] DM_MST: starting TM on aconnector: 00000000c7dc8ca9 [id: 95]Oct 31 18:56:26 thinkryzen boltd[555]: probing: started [1000]Oct 31 18:56:26 thinkryzen kernel: [drm] Downstream port present 1, type 2Oct 31 18:56:26 thinkryzen kernel: [drm] Downstream port present 1, type 2Oct 31 18:56:27 thinkryzen kernel: ------------[ cut here ]------------Oct 31 18:56:27 thinkryzen kernel: WARNING: CPU: 0 PID: 1494 at drivers/gpu/drm/drm_modeset_lock.c:317 drm_modeset_lock+0xcd/0xe0Oct 31 18:56:27 thinkryzen kernel: Modules linked in: rfcomm ccm michael_mic xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink >Oct 31 18:56:27 thinkryzen kernel: crc32_pclmul amdgpu uvcvideo snd_rpl_pci_acp6x polyval_clmulni snd_intel_dspcfg videobuf2_vmalloc mac80211 snd_intel_sdw_>Oct 31 18:56:27 thinkryzen kernel: i8042 nvme_common xhci_pci_renesas serioOct 31 18:56:27 thinkryzen kernel: CPU: 0 PID: 1494 Comm: gnome-shell Not tainted 6.1.0-rc3-1-mainline-custom #1 a2aaa7fb81f8dd1feee143f3d88c83d5287c37bfOct 31 18:56:27 thinkryzen kernel: Hardware name: LENOVO 21CQ000GUS/21CQ000GUS, BIOS R22ET55W (1.25 ) 09/14/2022Oct 31 18:56:27 thinkryzen kernel: RIP: 0010:drm_modeset_lock+0xcd/0xe0Oct 31 18:56:27 thinkryzen kernel: Code: ff ff ff eb d5 e8 e3 5f 49 00 eb 91 0f 0b e9 75 ff ff ff 83 f8 8e 74 c0 83 f8 dd 75 bd 48 89 6b 18 c7 43 20 00 00 00>Oct 31 18:56:27 thinkryzen kernel: RSP: 0018:ffffaab506697728 EFLAGS: 00010286Oct 31 18:56:27 thinkryzen kernel: RAX: 0000000000000000 RBX: ffffaab506697c80 RCX: 0000000000000000Oct 31 18:56:27 thinkryzen kernel: RDX: ffff952053c18000 RSI: ffffaab506697c80 RDI: ffff952055b46550Oct 31 18:56:27 thinkryzen kernel: RBP: ffff952055b46550 R08: ffff952055b46540 R09: ffffaab506697a14Oct 31 18:56:27 thinkryzen kernel: R10: ffff9520aa541a00 R11: ffff952055afe800 R12: ffff952055b46578Oct 31 18:56:27 thinkryzen kernel: R13: ffff95200cd4e400 R14: ffff9520ca280000 R15: ffff952055b46000Oct 31 18:56:27 thinkryzen kernel: FS: 00007f33767dab00(0000) GS:ffff95271ee00000(0000) knlGS:0000000000000000Oct 31 18:56:27 thinkryzen kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033Oct 31 18:56:27 thinkryzen kernel: CR2: 00007fd79c016008 CR3: 0000000153c08000 CR4: 0000000000750ef0Oct 31 18:56:27 thinkryzen kernel: PKRU: 55555554Oct 31 18:56:27 thinkryzen kernel: Call Trace:Oct 31 18:56:27 thinkryzen kernel: <TASK>Oct 31 18:56:27 thinkryzen kernel: drm_atomic_get_private_obj_state+0x5c/0x160Oct 31 18:56:27 thinkryzen kernel: compute_mst_dsc_configs_for_link+0x59/0x9e0 [amdgpu eb77160797c5f4cb03fc88190c3f041934cb0319]
I'm willing to test if there are other commits I can try. The problem is reproducer 100%.
If I use a single HDMI connection I don't have crashes, but sometimes the image on the monitor goes dark for a few seconds. I did try Windows for a little bit and it seemed to work, so I'm guessing this is not a hardware problem.
Could you post the entire kernel log, preferably after booting with drm.debug=0x16 log_buf_len=50M added to the kernel commandline? (fwiw, you can just use dmesg or journalctl --dmesg). It's hard to tell if there's still some spot I'm missing in the patch from this backtrace, since it seems like we're still hitting the same issue of the lock list somehow not being empty when we expect it to be.
If you're willing to as well and know how: it'd definitely be really useful to have someone test this with KASAN (inline instrumentation specifically) turned on. Since unfortunately it's still not clear to me if list_empty() here is actually hitting a populated list or if it's hitting uninitialized memory.
Here's the full kernel log by journalctl, but without setting the drm.debug and log_buf_len. I'll add those later today and do another test and report back here with new logs.
About KASAN, I've never tried that before but I should be able to do it.
Any tips on using it? I've quickly read I have to rebuild the kernel with CONFIG_KASAN = y, but not sure if there are any specifics to whitelist some of the files on the stacktrace.
Booted with eDP (internal LCD) + 1 external HDMI plugged on the USB C hub
Plugged 2nd HDMI monitor on the same USB C hub -> crash
Tried also booting with both HDMI attached but it crashes when the graphical UI would be shown.
Caps-lock is still responding, but I can't go to a terminal (Ctrl + Alt + Fn).
Something very screwy is happening here :, this doesn't seem to be a deadlock like I thought it might be. Doing a bit of digging still, what's really confusing to me about this issue is that it almost looks like that the lock we're trying to acquire is somehow both already in the lock acquisition list, but it seems like ww_mutex() still returns 0 when we grab the actual ww lock despite the fact if it's in the acquisition list we should get -EAGAIN.
@khfeng@nicolas.frenay if you guys could get a dmesg with lockdep turned on as wel, including the bits for wounded warrior mutexes, that would be useful in the mean time as maybe lockdep will tell us something new about this. Or if you have any guesses on where we could go wrong grabbing the modesetting lock for a private object here, that'd help too!
Also, pushed another patch to my branch. I don't know if this is going to fix things as it's still not even really clear to me what's happening here. Please give another try with these two patches applied (on top of one another):
Here's the latest log with the last patch, lockdeps (I hope I flagged the right configs) and kernel params.
This time I didn't filter only dmesg, in case that helps.
For me it's working now.
Here's the log from the moment the DP cable is hotplugged into the dock (kernel has lockdep set, compiled with KASAN, and kernel command line has drm.debug=0x16).
dmesg_hotplug.txt
On the system I am using, the issue persists on 6.1-rc1 with above two commits.
It took a while to bisect, and the offending commit is 4d07b0bc403403438d9cf88450506240c5faf92f "drm/display/dp_mst: Move all payload info into the atomic state".
Could you double check with the commits on top of drm-tip or a more recent rc? Might be worth it seeing as it sounds like it fixed stuff on @superm1 's issue
Maybe I was just lucky because this is pretty straightforward topology with one external monitor. I'll try some more and see if I can trigger anything.
But it also seems that @nicolas.frenay still has some problems even with 6.1-rc3 with those two commits.
@lyudess to me all of this stuff is looking independent of the USB4 fabric. Actually @nicolas.frenay is reproducing it using DP-alt mode according to above logs.
Can you have a try with some of the docks w/ MST you have on hand with even with older hardware? Anything Titan Ridge or newer should fall back to DP-alt mode. It seems like we're probably at "hardware independent amdgpu issue with dp mst helpers" at this point.
Oh yeah it's definitely helper related at this point! The USB-C statement was just a guess at the start, now though I'm more convinced it's specifically DSC related (or rather, it only triggers on setups capable of DSC regardless of if it's being used or not is my guess). Basically the problem is the only devices I have that can do DSC are USB-C only as I actually did give quite a number of tries at reproducing this already with no luck :(.
There is a machine in RH's lab that may be able to reproduce this that I can try getting access to, we'll have to see.
@superm1 I only have problems when I plug a second HDMI monitor on the USB-C dock. Plugging just one monitor on either port of the USB-C dock works ok. I do have some image flickering but I don't think it's related to this issue.
Is there any other test or logs you would like me to provide? I might be able to try a different dock, and if it works I'll have to return this problematic one, which would be bad as I wouldn't be able to reproduce it anymore.
I'm unfortunatelly unfamiliar with the DRM code.
Also FWIW I'll make a point to ask my OEM contact about the machine I mentioned tomorrow, I might be able to get someone to give me a ride up to the office so we can get this done sooner then later.
The other option is if I can just find a DSC capable monitor with a DP hookup, I -think- that'd likely also hit the issue if I'm guessing correctly.
Thanks!
In the interim, maybe those patches should at least get out on the mailing list to have a shot to improve the situation without DSC for 6.1-rc?
Sorry for the spam, just had an idea: I'll also see if I can give a shot at just hacking together a patch for amdgpu to force it to try to consider DSC which in theory hopefully should hit it.
And yeah - I'll send out the patches tomorrow as well - both of these patches we kind of want regardless of if they fix the issue or not anyway.
BINGO got the backtrace :). Digging in now, will make sure the fixes I've got so far are on the ML at least by tomorrow and will figure out whatever else is still broken here.
agh no i didn't but i'm getting there sorry for the false alarm!
Some additional info that might be relevant:
I've used my computer today with the debug kernel, and used two different USB-C hubs. The initial hours using the one I've reported the bugs earlier, but only using one of the HDMI ports, so that I wouldn't crash.
As this device had HDMI inconsistencies (blank screens every few minutes), I've changed to another USB-C device that only has 1 HDMI port but has a consistent image and used it for some hours.
I now had a crash while coding in Intellij, unrelated to plugging another HDMI device, and went to the kernel logs and found that during the day I have a few drm_modeset_drop_locks traces thrown.
First hours: eDP (internal display) + USB-Dock with 2 hdmi ports, using 1 HDMI port to a LCD monitor (image blanks every few mins) + External LCD using USB-C port.
Last hours: eDP (internal display) + USB-Dock with 1 hdmi port, using 1 HDMI port to a LCD monitor (image consistent, no blanking) + External LCD using USB-port.
And here's ddcutil detect:
Invalid display I2C bus: /dev/i2c-14 DRM connector: card0-eDP-1 EDID synopsis: Mfg id: IVO - UNK Model: Product code: 35908 (0x8c44) Serial number: Binary serial number: 0 (0x00000000) Manufacture year: 2021, Week: 0 DDC communication failed This is an eDP laptop display. Laptop displays do not support DDC/CI.Display 1 I2C bus: /dev/i2c-15 DRM connector: card0-DP-1 EDID synopsis: Mfg id: DZX - UNK Model: K1301R Product code: 4880 (0x1310) Serial number: 000000000000 Binary serial number: 269488144 (0x10101010) Manufacture year: 2019, Week: 32 VCP version: 2.2Display 2 I2C bus: /dev/i2c-16 DRM connector: card0-DP-2 EDID synopsis: Mfg id: LEN - Lenovo Group Limited Model: T24i-2L Product code: 25264 (0x62b0) Serial number: VKNB5961 Binary serial number: 16843009 (0x01010101) Manufacture year: 2022, Week: 6 VCP version: 2.2
K1301R is plugged directly via USB-C to my laptop.
T24i-2L is plugged on the USB-C hub (either the 2 port or the 1 port)
Finally hit it! I should be able to come up with a fix from here, will submit the rest of the patches I've got as well by the end of the day regardless
FWIW, issue 2068 seems to be alleviated by adding iommu=pt in kernel command line, but it is not 100% reliable. It reduces multiple crashes per day to a few crashes per week.
The crash doesn't seem to be triggered by high GPU usage or temperature. I can play video games all day long without triggering it. However, sparse but simple OpenGL commands (like CAD) triggers it from time to time. This bug was never triggered in a browser or desktop, despite Gnome uses hardware acceleration here and there.
For issue 2210, my issue was gone after using a dual MST hub, but a triple MST hub still doesn't work.
As for 6.1 without DSC, no, it doesn't work at all. When the hub supports DSC and the driver is loaded with dcdebugmask=0x04, it constantly crashes on 6.1 and fails to output signal on 6.0.
OK. I don't have a fix yet but I figured I should let y'all know I'm definitely making progress and may know what the problem is finally, and if my theory is correct wow it is not at all what I expecting to run into
YEP. Cool, we get two bug fixes here too!!! And also this is a genuinely impressive bug, because I'm literally unsure how we were managing to avoid crashing the kernel at all.
SO: while everyone's bisects, understandably, led us to 4d07b0bc403403438d9cf88450506240c5faf92f I don't think that's actually where the bug was introduced, but merely where it became noticeable. As the actual problem is any kind of code that does this:
mutex_lock(&aconnector->mst_mgr->lock);
The reason this isn't correct is really subtle: aconnector is amdgpu_dm_connector, which is shared for all connector types including MST. However, aconnector->mst_mgr is only used for the actual root MST connector in a topology - which we'll basically never really be considering for DSC. Otherwise, it's simply completely zero-initialized! However, in pre_validate_dsc() (along with any other DSC state computation functions) -> pre_compute_mst_dsc_configs_for_state() guess what we do?
Bingo! Ignoring that this code doesn't look like it should work at all now: somehow, it does. Well, ish, the moment you turn on lockdep you'll start getting warnings:
This is the wild part, I think that this code has only been working up until now because aconnector is zero-initialized. And I guess mutex_lock() manages not to explode on totally zero-initialized memory! You can actually still hit this warning before 4d07b0bc403403438d9cf88450506240c5faf92f, suggesting the bug was actually introduced way back in 8c20a1ed9b4f. Anyway, the reason this starts crashing after 4d07b0bc403403438d9cf88450506240c5faf92f is likely due to the fact there's a lot more data we're reading from the topology manager after that commit. In particular, grabbing the modesetting locks for MST and potentially adding in the atomic state.
SO - I finally now have a patch, and I am much more confident this one should fix the issue now as it seems to on my setup. Keep in mind I did all of this without an actual DSC capable setup and instead just hacked up some patches to get amdgpu to hit the broken codepaths until I could figure out what the problem is :). So, definitely need someone to confirm this fully fixes the problem before submitting:
Just apply the patches from me onto your kernel and let me know if this fixes the problem. Fingers crossed!
EDIT: ALSO, probably worth mentioning that if anyone from amd has been running into inconsistent success with modesets and DSC - this patch should technically also fix a race condition, so it's very much worth trying this to see if it fixes such issues.
Ok, reporting back with some bad news @lyudess , or maybe we're going a little bit further.
Unless I made some mistake on my build, I'm still experiencing the crash when plugging in a second HDMI monitor on the USB hub, but the log is different this time:
Nov 09 21:54:32 thinkryzen kernel: [drm] DP Alt mode state on HPD: 1Nov 09 21:54:32 thinkryzen kernel: [drm] DM_MST: starting TM on aconnector: 000000001c1d46b1 [id: 95]Nov 09 21:54:32 thinkryzen boltd[562]: probing: started [1000]Nov 09 21:54:32 thinkryzen kernel: [drm] Downstream port present 1, type 2Nov 09 21:54:32 thinkryzen kernel: [drm] Downstream port present 1, type 2Nov 09 21:54:35 thinkryzen boltd[562]: probing: timeout, done: [2971957] (2000000)
while testing I had dmesg -w running and I was able to read until the "starting TM on aconnector" line on my screen, so the crash might have happened on the boltd probing started line above.
I'm gonna do some additional tests here to see if I can get more info.
It would be nice to have someone else test those patches, in case I'm having a different issue now.
Just to make sure we're on the same page, I'm running linux 6.1 RC4 with these patches applied:
Can you ssh in from another machine while this has happened? Or after you boot back up do you have anything from that last boot's journal and/or /var/lib/systemd/pstore from the crash?
Just an FYI this is actually good output to see :), if your kernel isn't segfaulting then we've definitely got the bug fixed, or at the very least are on the right track and there's multiple bugs here (considering how many patches I've racked up now, that doesn't seem too unlikely).
My one concern is we could be seeing something in DRM hanging here. I'm not sure that would actually make a difference to thunderbolt though? (unfortunately I'm not totally sure how DRM/thunderbolt work together, most of the time it's all just DP from the DRM driver's perspective - perhaps with extra bw control knobs)
@superm1@lyudess
I can SSH to the machine after the image freeze and have some additional info.
When I plug the HDMI connector, the log is as I've mentioned before:
[ 919.249174] [drm] DP Alt mode state on HPD: 1[ 919.334194] [drm] DM_MST: starting TM on aconnector: 000000003ff8c623 [id: 94][ 919.502545] [drm] Downstream port present 1, type 2[ 919.577925] [drm] Downstream port present 1, type 2
The frozen LCD screen shows up to the DM_MST line above.
But this time I waited longer on SSH to see if anything else happened:
[ 1106.439271] INFO: task fwupd:2275 blocked for more than 122 seconds.
I don't suspect fwupd is to blame here but rather is a victim. It's worth mentioning though it will probe over DP aux to look for synaptics MST hubs. At least until this other bug is sorted out I think it's worth masking or stopping that service to make sure it doesn't get involved with a race to make things more complicated.
OK - this is another bug but one I kinda had a feeling might happen (hence mentioning timeouts before). Knowing now that the mutex_lock(&aconnector->mst_mgr.lock); line was broken, that'd imply that until now we haven't been acquiring &mst_mgr.lock at all - which means the way this code locks here might not even work in the real world and there's been a deadlock hiding here. I assume fwupd does aux probing, and then the deadlock likely happens between whatever locking the block device uses and the MST topology lock.
## Lock Debugging (spinlocks, mutexes, etc...)#CONFIG_LOCK_DEBUGGING_SUPPORT=yCONFIG_PROVE_LOCKING=yCONFIG_PROVE_RAW_LOCK_NESTING=yCONFIG_LOCK_STAT=yCONFIG_DEBUG_RT_MUTEXES=yCONFIG_DEBUG_SPINLOCK=yCONFIG_DEBUG_MUTEXES=yCONFIG_DEBUG_WW_MUTEX_SLOWPATH=yCONFIG_DEBUG_RWSEMS=yCONFIG_DEBUG_LOCK_ALLOC=yCONFIG_LOCKDEP=yCONFIG_LOCKDEP_BITS=15CONFIG_LOCKDEP_CHAINS_BITS=16CONFIG_LOCKDEP_STACK_TRACE_BITS=19CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12CONFIG_DEBUG_LOCKDEP=y# CONFIG_DEBUG_ATOMIC_SLEEP is not set# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set# CONFIG_LOCK_TORTURE_TEST is not set# CONFIG_WW_MUTEX_SELFTEST is not set# CONFIG_SCF_TORTURE_TEST is not set# CONFIG_CSD_LOCK_WAIT_DEBUG is not set# end of Lock Debugging (spinlocks, mutexes, etc...)# CONFIG_DEBUG_IRQFLAGS is not setCONFIG_STACKTRACE=y# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set# CONFIG_DEBUG_KOBJECT is not set
Also: feel free to turn KASAN off at this point, will likely make things go faster and also we might need it off anyway to stop another KASAN bug in amdgpu I'm seeing (use-after-free in handle_hpd_irq_helper(), not sure it's related but it'll turn off lock debugging if you hit it).
And finally: if you could run the dmesg through the ./scripts/decode_stacktrace.sh script in the kernel source repo that'd be super useful! You'll likely only be able to do this if you have access to the uncompressed vmlinux file from the kernel build. You can do it by running:
$DMESG_CMD | ./scripts/decode_stacktrace.sh $VMLINUX auto $KMOD_DIR
Where $DMESG_CMD is the command you're using to get the dmesg (dmesg, journalctl, etc.), $VMLINUX is the patch to vmlinux, and $KMOD_DIR is the directory with all of your .ko files
If you can't get the last step working though feel free to post the dmesg anyway, running that script helps though since it'll give us actual line numbers from the kernel source instead of just symbol offsets
Sorry for flip flopping on this, but I kept thinking about this after getting off work and realized I think I might already know what the problem is :). I will post a patch to try in just a moment
Just tried this last patch, and also disabled fwupd.service temporarily.
Good news: I'm not experiencing gpu crashes anymore.
Bad news: when I plug both HDMI monitors on the USB dock I loose image on all 3 screens, but if I unplug one of them, they work again.
Following a sequence of events, it's a little more complicated:
Booted with eDP + USB-Hub with 1 HDMI: video output on both monitors OK
Plugged in additional HDMI monitor on USB-Hub: video output still working on the original 2 monitors, but not on the newly plugged one
Removing HDMI and replugging them all gets to no video output on all monitors, but system does not crash.
The order in which I plug them doesn't matter. As long as I have 3 monitors, they blackout, except from just after the initial boot, where when I plug the second HDMI, it just doesn't output on that one, but the other two keep working. Kind of weird.
On this playing around with different HDMI combos, I have the following log, with stacktraces:
crash-v1c-decoded.txt
So, I think this is good progress, but there might be a catch somewhere. Might even be elsewhere.
In the part below, I don't remember seeing this [drm] crtc[1] needs mode_changed log before.
[ 233.924007] [drm] DP Alt mode state on HPD: 1[ 234.116868] Registered IR keymap rc-cec[ 234.117176] rc rc0: DP-2 as /devices/pci0000:00/0000:00:08.1/0000:33:00.0/rc/rc0[ 234.117605] input: DP-2 as /devices/pci0000:00/0000:00:08.1/0000:33:00.0/rc/rc0/input57[ 258.404001] [drm] DP Alt mode state on HPD: 1[ 258.481702] [drm] DM_MST: starting TM on aconnector: 00000000d349c024 [id: 94][ 258.648748] [drm] Downstream port present 1, type 2[ 258.735750] [drm] Downstream port present 1, type 2[ 258.866885] [drm] crtc[1] needs mode_changed
Thanks for all the effort. I can help with testing tomorrow.
Thank you for all of the testing! When you get to this tomorrow, it would be good if I could see a log of you reproducing this with drm.debug=0x116 log_buf_len=50M added on the kernel cmdline from the version of the kernel where this worked, and one from the current kernel. The whole kernel logs for each one as well.
@lyudess I no longer see the kernel splat, but once the TBT dock (which connects to an HDMI monitor) is plugged, the screen freeze indefinitely until the dock is unplugged.
KH was testing DPIA (DP tunneling) but NF was testing DP-Alt mode. There are two cases that DPIA has problems that those two patches help.
They're both going to submit into 6.1-fixes next week so it's best to do the test with them in place because you can end up with stack corruption and hangs potentially which could make the effort to affirm the fixups that you've developed infinitely more difficult.
@khfeng Could you get a log from a kernel where this worked with the various kernel cmdline options added? Would like to see what we're doing differently that might be causing that atomic check failure
@lyudess just tested with the latest patch, and when I plug the second HDMI I get no output on both external monitors (using USB hub) but still have output on the eDP, with no crashes.
Interestingly, if I open gnome's display settings, I can see all 3 displays, even though I only have output on the laptop LCD.
The first errors on the log happen when I plug the second HDMI as usual.
The last error (at around 373.643691) happen when I unplug this monitor (did this for testing, just in case it helps).
Unplugging the HDMI2 gives me back image on eDP + HDMI1 external via usb.
I couldn't test yesterday, sorry about that.
Also, I don't have a successful reproducer kernel version, as this is a new laptop and I was running kernel 6.x since day one. I can try using an older kernel, but not sure if that would work.
@nicolas.frenay@khfeng could either one of you get the same logs but with drm.debug=0x16 log_buf_len=50M added as well to the kernel commandline? It's possible this is still an issue with the MST helpers