DP-MST is crashing on ICL with current drm-tip(machine check exception)
Severe regression in ICL with recent drm-tip:
[ 50.513634] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 6: ba00000011000402 [ 50.539139] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f5774e3335a> [ 50.558910] mce: [Hardware Error]: TSC 1c1194e091 [ 50.573229] mce: [Hardware Error]: PROCESSOR 0:706e5 TIME 1577458276 SOCKET 0 APIC 0 microcode 1e [ 50.599756] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 50.620046] mce: [Hardware Error]: Machine check: Processor context corrupt [ 50.640857] Kernel panic - not syncing: Fatal machine check [ 51.708383] Shutting down cpus with NMI [ 51.719860] Kernel Offset: disabled [ 51.764341] Rebooting in 5 seconds..
Can be reproduced by hotplug or some plane tests like kms_atomic_transition.
Been trying to check own patches, after rebasing started to face this issue. Together with Jani we've been tracking which build caused this, so we ended up here:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7596/git-log-oneline.txt
Build 7595 still works..
Commits in this build should be analyzed:
ed91b14e8d65 drm-tip: 2019y-12m-18d-14h-50m-24s UTC integration manifest
773b4b54351c drm/i915: Move stuff from haswell_crtc_disable() into encoder .post_disable()
f5271ee50d28 drm/i915: Pass old crtc state to intel_crtc_vblank_off()
cfb627c44851 drm/i915: Pass old crtc state to skylake_scaler_disable()
17bef9baf339 drm/i915: Nuke .post_pll_disable() for DDI platforms
6a6d79de4d19 drm/i915: Call hsw_fdi_link_train() directly()
74cb2751d42e drm/i915: Introduce intel_plane_state_reset()
979e94c1d64a drm/i915: Introduce intel_crtc_state_reset()
6643453987c4 drm/i915: Introduce intel_crtc_{alloc,free}()
f44bfa7fbfbb drm/i915: s/intel_crtc/crtc/ in intel_crtc_init()
ab2dd990f4ab drm: Add __drm_atomic_helper_crtc_state_reset() & co.
a51894d26ffe drm-tip: 2019y-12m-18d-13h-04m-32s UTC integration manifest
Proposing to set severity as major as after checking it also bricks the production Dell XPS laptop completely. Also I guess also this is a kind of a signal that we really need to improve our testing in MST area, as such kind of severe regression bugs should not go completely unnoticed.