Unplugging an external monitor after Xorg starts makes i915 fail to detect that monitor again
Summary
Unplugging an external monitor after Xorg starts makes i915
fail to detect
that monitor and throws NVIDIA
off the bus, if one is present.
Description
I'm using a notebook with both integrated and dedicated nvidia graphics card. It has a USB-C DP output connected to the integrated graphics and I have a monitor that I plug in using a USB-C to DP port adapter hub.
Plugging off the monitor that was plugged in before Xorg had been initialized,
makes i915
modesetting drivers stop detecting monitor plug/unplug events,
thus the monitor is never recognized when plugged in anymore. Moreover, after
stopping Xorg, the following Kernel error message is produced:
[drm] *ERROR* [ENCODER:316:DDI TC1/PHY TC1][DPRX] Failed to enable link training
And Xorg takes a while longer to finish, keeping you in a black screen.
Lastly, if you also happen to have NVIDIA drivers loaded, along with the issues already described above, that final stage of stopping Xorg will fail irrecoverably, by throwing the GPU off the bus, turning the system unresponsive and forcing us to reboot by hard-resetting.
ACPI Error: Aborting method \IPCS due to previous error (AE_AML_LOOP_TIMEOUT) (20211217/psparse-529)
ACPI Error: Aborting method \MCUI due to previous error (AE_AML_LOOP_TIMEOUT) (20211217/psparse-529)
ACPI Error: Aborting method \SPCX due to previous error (AE_AML_LOOP_TIMEOUT) (20211217/psparse-529)
ACPI Error: Aborting method \_SB.PC00.PGSC due to previous error (AE_AML_LOOP_TIMEOUT) (20211217/psparse-529)
ACPI Error: Aborting method \_SB.PC00.PGOF due to previous error (AE_AML_LOOP_TIMEOUT) (20211217/psparse-529)
ACPI Error: Aborting method \_SB.PC00.PEG1.NPOF due to previous error (AE_AML_LOOP_TIMEOUT) (20211217/psparse-529)
ACPI Error: Aborting method \_SB.PC00.PEG1.PG01._OFF due to previous error (AE_AML_LOOP_TIMEOUT) (20211217/psparse-529)
nvidia 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)
NVRM: GPU at PCI:0000:01:00: GPU-1effe615-acbe-39f2-f56e-00ae11ba086c
NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
snd_hda_intel 0000:01:00.1: can't change power state from D3cold to D0 (config space inaccessible)
System
- OS: Arch Linux x86_64
- Using drm-tip kernel: 5.19.0-rc1-1-drm-tip-git-gcf217645e823
- Computer: Dell G15 5511 - Intel i5 11300H/NVIDIA RTX 3050
- Connector: DP through USB-C adapter
More context
-
I have attached a long log with every step I did in order to make the crash happen. For every step that I described above, I also logged it using
/dev/kmsg
so it becomes clear what exactly I was doing in order to produce the following log messages - just look for log entries prefixed withunknown:
. -
There are two log files, one for when I boot up without NVIDIA modules, and one with it. Although I have provided reproduction steps for both scenarios, I'm pretty confident that the former (system without nvidia modules) is the root cause and we could focus on figuring that out and I think it will also fix the latter. But as I'm not a subject-matter expert, I'm providing both logs just in case you want to check them out.
-
I tested with kernels ranging from
5.10
todrm-tip
(5.19-rc1), and all have presented that behavior. -
Lastly, I think this issue is probably related to this one, but I can't really confirm that.
Steps to reproduce
As I mentioned, having NVIDIA drivers loaded will only make this failure non recoverable, so I'll list steps to reproduce with both scenarios: with NVIDIA drivers and without it. Both scenarios are consistently reproducible using the step-by-step below.
Scenario 1: Without NVIDIA drivers loaded
- Boot up your computer normally without NVIDIA drivers and make sure you also blacklist nouveau just to better isolate the root cause.
- Make sure your system doesn't automatically starts Xorg through some display manager such as GDM. Ideally go right to the terminal.
- Plug in the external monitor on the USB-C port.
- Start Xorg with startx. (ideally with a single application like xterm)
- See output of
xrandr -q
and notice that the external monitor connected at DP-1 is connected. - Unplug the monitor.
- See output of
xrandr -q
and notice that the monitor has been successfully plugged off and is being correctly marked as disconnected. - Plug the monitor back in.
- See the output of
xrandr -q
and notice that the monitor has NOT been detected. - Exit Xorg
- Notice how the screen will freeze for a few (~5) seconds and you will eventually get back to the console.
- Notice that you will have several log messages saying with timeouts and finally that there was a failure while trying to enable link training.
In this scenario, you will still have your system responsive.
Scenario 2: With NVIDIA drivers loaded
- Boot up your computer normally with NVIDIA drivers loaded
- Make sure your system doesn't automatically starts Xorg through some display manager such as GDM. Ideally go right to the terminal.
- Plug in the external monitor on the USB-C port.
- Start Xorg with startx. (ideally with a single application like xterm)
- See output of
xrandr -q
and notice that the external monitor connected at DP-1-1 is connected. - Unplug the monitor.
- Run
xrandr -q
and notice that it will take a while (like 5 secs) and the monitor will be detected as disconnected. Notice how the GPU has been thrown off the bus at this point. We can say at this point that the issue is already reproducible, but we can continue the same test scenario as we did for the case without nvidia drivers. - Plug the monitor back in.
- See the output of
xrandr -q
and notice that the monitor has NOT been detected. - Exit Xorg
- Notice how the screen will freeze and keyboard will be unresponsive.
In this scenario, your system will become unresponsive and you will need to force reboot it (even SSH reboot won't work).