Kernel 5.5.2 - i915_gem_evict_something, i915_gem_gtt_insert hang with no recovery/reset
I experienced frequent i915 hangs on 5.4.x, that would on occasion recover. Decided to try 5.5.2, but things got much worse. The system does not recover from the hang below, there are no RCS0 resets etc and hard power-off is required.
This is on Coffelake.
00:02.0 VGA compatible controller [0300]: Intel Corporation UHD Graphics 630 (Mobile) [8086:3e9b] (rev 02)
[ 0.000000] Command line: BOOT_IMAGE=(hd1,gpt2)/vmlinuz-5.5.2-200.fc31.x86_64 root=/dev/mapper/fedora-root ro rootflags=discard i915.enable_guc=3 l1tf=flush tsc=reliable systemd.unified_cgroup_hierarchy=0
[ 0.331103] Kernel command line: BOOT_IMAGE=(hd1,gpt2)/vmlinuz-5.5.2-200.fc31.x86_64 root=/dev/mapper/fedora-root ro rootflags=discard i915.enable_guc=3 l1tf=flush tsc=reliable systemd.unified_cgroup_hierarchy=0
[ 2.177288] i915 0000:00:02.0: Incompatible option enable_guc=3 - GuC submission is N/A
[ 2.178731] fb0: switching to inteldrmfb from EFI VGA
[ 2.179788] i915 0000:00:02.0: vgaarb: deactivate vga console
[ 2.181779] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[ 2.181781] [drm] Driver supports precise vblank timestamp query.
[ 2.182489] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[ 2.182775] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
[ 2.512763] [drm] GuC communication enabled
[ 2.523661] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
[ 2.523663] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
[ 2.525210] [drm] Initialized i915 1.6.0 20191101 for 0000:00:02.0 on minor 0
[ 2.585515] fbcon: i915drmfb (fb0) is primary device
[ 2.593555] i915 0000:00:02.0: fb0: i915drmfb frame buffer device
[ 6.758175] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training
[ 8.313170] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
Feb 09 13:52:40 hostname kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Feb 09 13:52:40 hostname kernel: rcu: 14-....: (60000 ticks this GP) idle=cb6/1/0x4000000000000002 softirq=113308/113308 fqs=14684
Feb 09 13:52:40 hostname kernel: (t=60000 jiffies g=182065 q=10805)
Feb 09 13:52:40 hostname kernel: NMI backtrace for cpu 14
Feb 09 13:52:40 hostname kernel: CPU: 14 PID: 2000 Comm: Xorg Tainted: G U OEL 5.5.2-200.fc31.x86_64 #1
Feb 09 13:52:40 hostname kernel: Hardware name: Dell Inc. Precision 5540/0V030K, BIOS 1.5.0 12/25/2019
Feb 09 13:52:40 hostname kernel: Call Trace:
Feb 09 13:52:40 hostname kernel: <IRQ>
Feb 09 13:52:40 hostname kernel: dump_stack+0x66/0x90
Feb 09 13:52:40 hostname kernel: nmi_cpu_backtrace.cold+0x14/0x53
Feb 09 13:52:40 hostname kernel: ? lapic_can_unplug_cpu.cold+0x3e/0x3e
Feb 09 13:52:40 hostname kernel: nmi_trigger_cpumask_backtrace+0xdb/0xdd
Feb 09 13:52:40 hostname kernel: rcu_dump_cpu_stacks+0x92/0xc0
Feb 09 13:52:40 hostname kernel: rcu_sched_clock_irq.cold+0x1e5/0x3cd
Feb 09 13:52:40 hostname kernel: update_process_times+0x24/0x50
Feb 09 13:52:40 hostname kernel: tick_sched_handle+0x22/0x60
Feb 09 13:52:40 hostname kernel: tick_sched_timer+0x38/0x80
Feb 09 13:52:40 hostname kernel: ? tick_sched_do_timer+0x70/0x70
Feb 09 13:52:40 hostname kernel: __hrtimer_run_queues+0xf6/0x270
Feb 09 13:52:40 hostname kernel: hrtimer_interrupt+0x10e/0x240
Feb 09 13:52:40 hostname kernel: smp_apic_timer_interrupt+0x6c/0x130
Feb 09 13:52:40 hostname kernel: apic_timer_interrupt+0xf/0x20
Feb 09 13:52:40 hostname kernel: </IRQ>
Feb 09 13:52:40 hostname kernel: RIP: 0010:i915_gem_evict_something+0x145/0x450 [i915]
Feb 09 13:52:40 hostname kernel: Code: 89 42 08 48 89 10 48 8b 04 24 4c 89 ea 4c 89 e7 48 8b b0 f0 01 00 00 48 89 74 24 08 e8 14 c2 e0 c3 84 c0 74 21 48 8b 74 24 08 <48> 8b 04 24 4c 89 a0 f0 01 00 00 4d 89 ae 00 02 00 00 49 89 b6 08
Feb 09 13:52:40 hostname kernel: RSP: 0018:ffffb59e40db7940 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Feb 09 13:52:40 hostname kernel: RAX: 0000000000000001 RBX: ffff96025aa1d800 RCX: 0000000000000000
Feb 09 13:52:40 hostname kernel: RDX: ffff9603270b4c40 RSI: ffff9603270b4c40 RDI: ffff9603270b4400
Feb 09 13:52:40 hostname kernel: RBP: ffffb59e40db7960 R08: ffff9603286058e8 R09: ffff9603286058e8
Feb 09 13:52:40 hostname kernel: R10: 000000000000040c R11: ffff9601fb3be300 R12: ffff9603270b4400
Feb 09 13:52:40 hostname kernel: R13: ffff9603286058e8 R14: ffff9603270b4200 R15: ffff9603270b5ac0
Feb 09 13:52:40 hostname kernel: i915_gem_gtt_insert+0x174/0x250 [i915]
Feb 09 13:52:40 hostname kernel: i915_vma_pin+0x62f/0x6f0 [i915]
Feb 09 13:52:40 hostname kernel: i915_gem_object_pin+0x12d/0x1a0 [i915]
Feb 09 13:52:40 hostname kernel: i915_gem_object_pin_to_display_plane+0xa9/0xf0 [i915]
Feb 09 13:52:40 hostname kernel: intel_pin_and_fence_fb_obj+0x9d/0x1c0 [i915]
Feb 09 13:52:40 hostname kernel: intel_plane_pin_fb+0x44/0xd0 [i915]
Feb 09 13:52:40 hostname kernel: intel_prepare_plane_fb+0xe0/0x310 [i915]
Feb 09 13:52:40 hostname kernel: drm_atomic_helper_prepare_planes+0x8a/0x110 [drm_kms_helper]
Feb 09 13:52:40 hostname kernel: intel_atomic_commit+0xd9/0x350 [i915]
Feb 09 13:52:40 hostname kernel: drm_atomic_helper_page_flip+0x5c/0xc0 [drm_kms_helper]
Feb 09 13:52:40 hostname kernel: drm_mode_page_flip_ioctl+0x54b/0x5d0 [drm]
Feb 09 13:52:40 hostname kernel: ? drm_mode_cursor2_ioctl+0x10/0x10 [drm]
Feb 09 13:52:40 hostname kernel: drm_ioctl_kernel+0xaa/0xf0 [drm]
Feb 09 13:52:40 hostname kernel: drm_ioctl+0x208/0x390 [drm]
Feb 09 13:52:40 hostname kernel: ? drm_mode_cursor2_ioctl+0x10/0x10 [drm]
Feb 09 13:52:40 hostname kernel: ? enqueue_hrtimer+0x36/0x90
Feb 09 13:52:40 hostname kernel: ? timerqueue_del+0x1e/0x40
Feb 09 13:52:40 hostname kernel: ? __remove_hrtimer+0x35/0x70
Feb 09 13:52:40 hostname kernel: do_vfs_ioctl+0x461/0x6d0
Feb 09 13:52:40 hostname kernel: ? do_setitimer+0xad/0x1f0
Feb 09 13:52:40 hostname kernel: ? __x64_sys_setitimer+0xa3/0xf0
Feb 09 13:52:40 hostname kernel: ksys_ioctl+0x5e/0x90
Feb 09 13:52:40 hostname kernel: __x64_sys_ioctl+0x16/0x20
Feb 09 13:52:40 hostname kernel: do_syscall_64+0x5b/0x1c0
Feb 09 13:52:40 hostname kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 09 13:52:40 hostname kernel: RIP: 0033:0x7fae4bd1238b
Feb 09 13:52:40 hostname kernel: Code: 0f 1e fa 48 8b 05 fd 9a 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 9a 0c 00 f7 d8 64 89 01 48
Feb 09 13:52:40 hostname kernel: RSP: 002b:00007fff1190b9e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Feb 09 13:52:40 hostname kernel: RAX: ffffffffffffffda RBX: 00007fff1190baa0 RCX: 00007fae4bd1238b
Feb 09 13:52:40 hostname kernel: RDX: 00007fff1190baa0 RSI: 00000000c01864b0 RDI: 000000000000000f
Feb 09 13:52:40 hostname kernel: RBP: 00000000c01864b0 R08: 0000000000000ac8 R09: 000055c9c4d31740
Feb 09 13:52:40 hostname kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000