[skl/iris] gpu hang while using blender
I hadn't used blender in a few months so I don't know when this started, but for the last 2 weeks I've experience seemingly randomly-happening GPU hangs; sometimes as soon as I do something in blender, other times I can use it for hours before it happens.
This is on a skylake gt2 (191B
) running on iris; I have just tried on i965 for about an hour but haven't been able to reproduce the issue, but given the low reproducibility that doesn't necessarily mean something.
If it's iris-specific, then it might not even be a regression and iris always had this bug but I just wasn't using it as my daily until mesa 20.0 made it the default; I'm not sure how long it had been since I did anything on blender.
Reproduced on:
- linux 5.8.0 - 5.8.3
- mesa 20.1.4 - 20.1.6
- blender 2.83.4 - 2.83.5
on any of these:
- sway 1.5 + xwayland 1.20.8
- i3 4.18.2 + xserver 1.20.8
- gnome-shell 3.36.5 + xserver 1.20.8
dmesg excerpt (full dmesg attached):
[ 4794.394424] i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe A (start=26 end=27) time 14665 us, min 1073, max 1079, scanline start 216, end 81
[ 4820.980419] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[ 4820.980432] i915 0000:00:02.0: [drm] blender[2588] context reset due to GPU hang
[ 4821.014670] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:87f99eb9, in blender [2588]
[ 4821.014673] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 4821.014674] Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.
[ 4821.014675] Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
[ 4821.014676] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 4821.014677] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 4821.014678] GPU crash dump saved to /sys/class/drm/card0/error
[ 4821.015173] ------------[ cut here ]------------
[ 4821.015181] WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:4580 default_wake_function+0x16/0x30
[ 4821.015182] Modules linked in: fuse ccm algif_aead des_generic libdes ecb arc4 libarc4 algif_skcipher cmac md4 8021q algif_hash garp mrp af_alg stp llc bbswitch(OE) nls_iso8859_1 nls_cp437 vfat fat btusb btrtl btbcm btintel bluetooth ecdh_generic ecc crc16 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc snd_hda_codec_hdmi brcmfmac snd_hda_codec_realtek joydev mousedev snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp hid_multitouch coretemp brcmutil snd_hda_intel kvm_intel snd_intel_dspcfg snd_hda_codec ee1004 cfg80211 iTCO_wdt mei_hdcp mei_wdt intel_pmc_bxt iTCO_vendor_support snd_hda_core intel_rapl_msr kvm dell_wmi wmi_bmof snd_hwdep mxm_wmi intel_wmi_thunderbolt dell_laptop ledtrig_audio snd_pcm dell_smbios snd_timer psmouse irqbypass dell_wmi_descriptor rapl intel_cstate dcdbas intel_uncore snd i2c_i801 pcspkr input_leds mei_me soundcore i2c_smbus rfkill processor_thermal_device mei intel_lpss_pci intel_hid sparse_keymap
[ 4821.015225] intel_rapl_common intel_lpss intel_pch_thermal idma64 i2c_hid intel_soc_dts_iosf battery int3403_thermal wmi dell_smo8800 int3400_thermal int3402_thermal int340x_thermal_zone evdev ac acpi_thermal_rel mac_hid pkcs8_key_parser typec_displayport typec usb_storage usbmon wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha libblake2s_generic tun dell_smm_hwmon loop crypto_user acpi_call(OE) ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys dm_mod trusted tpm rng_core hid_generic usbhid hid rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper xhci_pci xhci_pci_renesas xhci_hcd rtsx_pci i8042 serio i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm intel_agp intel_gtt agpgart
[ 4821.015268] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G OE 5.8.3-zen1-1-zen #1
[ 4821.015269] Hardware name: Dell Inc. XPS 15 9550/0X2P13, BIOS 1.13.1 12/12/2019
[ 4821.015273] RIP: 0010:default_wake_function+0x16/0x30
[ 4821.015276] Code: e8 2f 79 48 00 eb 99 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f7 c2 fe ff ff ff 75 09 48 8b 7f 08 e9 1a f9 ff ff <0f> 0b 48 8b 7f 08 e9 0f f9 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
[ 4821.015277] RSP: 0018:ffffafb3c003cd78 EFLAGS: 00010086
[ 4821.015279] RAX: ffffffffb1709850 RBX: ffffafb3c0343de8 RCX: ffffafb3c003cd90
[ 4821.015281] RDX: 00000000fffffffb RSI: 0000000000000003 RDI: ffffafb3c0343de8
[ 4821.015282] RBP: ffff9c16e75cf568 R08: 0000000000000001 R09: 0000000000000001
[ 4821.015283] R10: ffff9c15c04fbd00 R11: 0000000000000001 R12: ffffafb3c0343e00
[ 4821.015285] R13: 0000000000000046 R14: ffffafb3c003cd90 R15: ffff9c16e75cf560
[ 4821.015287] FS: 0000000000000000(0000) GS:ffff9c1775c80000(0000) knlGS:0000000000000000
[ 4821.015288] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4821.015290] CR2: 00005605f67da653 CR3: 00000001b080a002 CR4: 00000000003606e0
[ 4821.015291] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4821.015293] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4821.015294] Call Trace:
[ 4821.015296] <IRQ>
[ 4821.015301] autoremove_wake_function+0xe/0x30
[ 4821.015349] __i915_sw_fence_complete.part.0+0x147/0x1c0 [i915]
[ 4821.015390] dma_i915_sw_fence_wake_timer+0x49/0x70 [i915]
[ 4821.015430] signal_irq_work+0x2de/0x3f0 [i915]
[ 4821.015439] ? usb_hcd_submit_urb+0xc6/0xdb0
[ 4821.015444] irq_work_run_list+0x53/0x70
[ 4821.015447] irq_work_run+0x26/0x50
[ 4821.015450] __sysvec_irq_work+0x2d/0xe0
[ 4821.015454] sysvec_irq_work+0x41/0xe0
[ 4821.015458] asm_sysvec_irq_work+0x12/0x20
[ 4821.015462] RIP: 0010:__do_softirq+0x8e/0x33c
[ 4821.015464] Code: e8 57 ea 2d ff c7 44 24 18 0a 00 00 00 48 c7 c7 27 c9 9f b2 e8 d3 95 d7 ff 65 66 c7 05 b9 ba c2 4d 00 00 fb 66 0f 1f 44 00 00 <b8> ff ff ff ff 48 c7 c3 c0 50 e0 b2 41 0f bc c7 89 c1 83 c1 01 89
[ 4821.015465] RSP: 0018:ffffafb3c003cfa0 EFLAGS: 00000292
[ 4821.015467] RAX: 0000000000000001 RBX: ffff9c177454dd00 RCX: 000000000000001f
[ 4821.015468] RDX: 0000000000000000 RSI: ffffffffb29fc927 RDI: ffffffffb2994336
[ 4821.015470] RBP: ffffafb3c00d7d50 R08: 000004627ae567af R09: 0000000000000000
[ 4821.015471] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c176da08800
[ 4821.015472] R13: ffffffffb1735570 R14: 0000000000000000 R15: 0000000000000001
[ 4821.015476] ? handle_level_irq+0x1a0/0x1a0
[ 4821.015481] ? handle_irq_event+0x78/0xb0
[ 4821.015485] ? handle_level_irq+0x1a0/0x1a0
[ 4821.015487] asm_call_on_stack+0xf/0x20
[ 4821.015489] </IRQ>
[ 4821.015492] do_softirq_own_stack+0x5f/0x80
[ 4821.015495] irq_exit_rcu+0xc5/0x120
[ 4821.015499] common_interrupt+0xd1/0x200
[ 4821.015502] asm_common_interrupt+0x1e/0x40
[ 4821.015505] RIP: 0010:cpuidle_enter_state+0xc9/0x840
[ 4821.015507] Code: e8 1c 31 7c ff 80 7c 24 0f 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 f2 05 00 00 31 ff e8 ee e7 83 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 e4 01 00 00 49 63 d4 4c 2b 7c 24 10 48 8d 04 52 48
[ 4821.015509] RSP: 0018:ffffafb3c00d7e68 EFLAGS: 00000246
[ 4821.015510] RAX: ffff9c1775c80000 RBX: ffff9c1775cb6800 RCX: 000000000000001f
[ 4821.015511] RDX: 0000000000000000 RSI: ffffffffb2961d90 RDI: ffffffffb296c0d1
[ 4821.015513] RBP: ffffffffb2ec9f20 R08: 000004627ae5511f R09: 0000000000000417
[ 4821.015514] R10: 0000000000000417 R11: 0000000000000007 R12: 0000000000000006
[ 4821.015515] R13: ffff9c1775cb6800 R14: 0000000000000006 R15: 000004627ae5511f
[ 4821.015520] ? cpuidle_enter_state+0xa4/0x840
[ 4821.015523] cpuidle_enter+0x29/0x40
[ 4821.015528] do_idle+0x1fb/0x2d0
[ 4821.015532] cpu_startup_entry+0x19/0x20
[ 4821.015536] start_secondary+0x1b8/0x200
[ 4821.015540] secondary_startup_64+0xb6/0xc0
[ 4821.015544] ---[ end trace 64581018dc260fbd ]---
[ 4821.748418] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[ 4821.748431] i915 0000:00:02.0: [drm] blender[2588] context reset due to GPU hang
[ 4821.763778] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dfbfff, in blender [2588]
[ 4857.926273] ------------[ cut here ]------------
[ 4857.926279] WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:2474 ttwu_queue_wakelist+0xbd/0xd0
[ 4857.926279] Modules linked in: fuse ccm algif_aead des_generic libdes ecb arc4 libarc4 algif_skcipher cmac md4 8021q algif_hash garp mrp af_alg stp llc bbswitch(OE) nls_iso8859_1 nls_cp437 vfat fat btusb btrtl btbcm btintel bluetooth ecdh_generic ecc crc16 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc snd_hda_codec_hdmi brcmfmac snd_hda_codec_realtek joydev mousedev snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp hid_multitouch coretemp brcmutil snd_hda_intel kvm_intel snd_intel_dspcfg snd_hda_codec ee1004 cfg80211 iTCO_wdt mei_hdcp mei_wdt intel_pmc_bxt iTCO_vendor_support snd_hda_core intel_rapl_msr kvm dell_wmi wmi_bmof snd_hwdep mxm_wmi intel_wmi_thunderbolt dell_laptop ledtrig_audio snd_pcm dell_smbios snd_timer psmouse irqbypass dell_wmi_descriptor rapl intel_cstate dcdbas intel_uncore snd i2c_i801 pcspkr input_leds mei_me soundcore i2c_smbus rfkill processor_thermal_device mei intel_lpss_pci intel_hid sparse_keymap
[ 4857.926304] intel_rapl_common intel_lpss intel_pch_thermal idma64 i2c_hid intel_soc_dts_iosf battery int3403_thermal wmi dell_smo8800 int3400_thermal int3402_thermal int340x_thermal_zone evdev ac acpi_thermal_rel mac_hid pkcs8_key_parser typec_displayport typec usb_storage usbmon wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha libblake2s_generic tun dell_smm_hwmon loop crypto_user acpi_call(OE) ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys dm_mod trusted tpm rng_core hid_generic usbhid hid rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper xhci_pci xhci_pci_renesas xhci_hcd rtsx_pci i8042 serio i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm intel_agp intel_gtt agpgart
[ 4857.926361] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W OE 5.8.3-zen1-1-zen #1
[ 4857.926362] Hardware name: Dell Inc. XPS 15 9550/0X2P13, BIOS 1.13.1 12/12/2019
[ 4857.926364] RIP: 0010:ttwu_queue_wakelist+0xbd/0xd0
[ 4857.926365] Code: 64 09 00 b8 01 00 00 00 5b 5d 41 5c 41 5d c3 31 c0 c3 31 c0 40 f6 c5 08 74 ee 48 c7 c2 40 c3 02 00 83 7c 11 04 01 77 e0 eb 85 <0f> 0b 31 c0 eb d8 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44
[ 4857.926366] RSP: 0018:ffffafb3c003ccf0 EFLAGS: 00010046
[ 4857.926367] RAX: 0000000000000001 RBX: ffff9c176ddc0000 RCX: ffff9c1775c80000
[ 4857.926368] RDX: 000000000002c340 RSI: ffffffffb2961d90 RDI: ffffffffb296c0d1
[ 4857.926369] RBP: 00000000ffffffff R08: 0000046b12f84355 R09: ffff9c177479c720
[ 4857.926369] R10: 0000000000000005 R11: 0000000000000005 R12: 0000000000000001
[ 4857.926370] R13: 0000000000000001 R14: ffff9c176ddc07ac R15: 000000000002c340
[ 4857.926371] FS: 0000000000000000(0000) GS:ffff9c1775c80000(0000) knlGS:0000000000000000
[ 4857.926372] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4857.926372] CR2: 00007f975bdb800c CR3: 00000001b080a006 CR4: 00000000003606e0
[ 4857.926373] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4857.926374] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4857.926374] Call Trace:
[ 4857.926376] <IRQ>
[ 4857.926379] try_to_wake_up+0x1a8/0x610
[ 4857.926382] autoremove_wake_function+0xe/0x30
[ 4857.926428] __i915_sw_fence_complete.part.0+0x147/0x1c0 [i915]
[ 4857.926445] dma_i915_sw_fence_wake_timer+0x49/0x70 [i915]
[ 4857.926463] signal_irq_work+0x2de/0x3f0 [i915]
[ 4857.926466] irq_work_run_list+0x53/0x70
[ 4857.926467] irq_work_run+0x26/0x50
[ 4857.926469] __sysvec_irq_work+0x2d/0xe0
[ 4857.926471] sysvec_irq_work+0x41/0xe0
[ 4857.926473] asm_sysvec_irq_work+0x12/0x20
[ 4857.926475] RIP: 0010:__do_softirq+0x8e/0x33c
[ 4857.926476] Code: e8 57 ea 2d ff c7 44 24 18 0a 00 00 00 48 c7 c7 27 c9 9f b2 e8 d3 95 d7 ff 65 66 c7 05 b9 ba c2 4d 00 00 fb 66 0f 1f 44 00 00 <b8> ff ff ff ff 48 c7 c3 c0 50 e0 b2 41 0f bc c7 89 c1 83 c1 01 89
[ 4857.926476] RSP: 0018:ffffafb3c003cfa0 EFLAGS: 00000292
[ 4857.926477] RAX: 0000000000000001 RBX: ffff9c177454dd00 RCX: 000000000000001f
[ 4857.926478] RDX: 0000000000000000 RSI: ffffffffb29fc927 RDI: ffffffffb2994336
[ 4857.926478] RBP: ffffafb3c00d7d50 R08: 0000046b12f773c5 R09: 0000000000000000
[ 4857.926479] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c176da08800
[ 4857.926479] R13: ffffffffb1735570 R14: 0000000000000000 R15: 0000000000000001
[ 4857.926481] ? handle_level_irq+0x1a0/0x1a0
[ 4857.926484] ? handle_irq_event+0x78/0xb0
[ 4857.926485] ? handle_level_irq+0x1a0/0x1a0
[ 4857.926486] asm_call_on_stack+0xf/0x20
[ 4857.926487] </IRQ>
[ 4857.926489] do_softirq_own_stack+0x5f/0x80
[ 4857.926491] irq_exit_rcu+0xc5/0x120
[ 4857.926492] common_interrupt+0xd1/0x200
[ 4857.926493] asm_common_interrupt+0x1e/0x40
[ 4857.926496] RIP: 0010:cpuidle_enter_state+0xc9/0x840
[ 4857.926496] Code: e8 1c 31 7c ff 80 7c 24 0f 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 f2 05 00 00 31 ff e8 ee e7 83 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 e4 01 00 00 49 63 d4 4c 2b 7c 24 10 48 8d 04 52 48
[ 4857.926497] RSP: 0018:ffffafb3c00d7e68 EFLAGS: 00000246
[ 4857.926498] RAX: ffff9c1775c80000 RBX: ffff9c1775cb6800 RCX: 000000000000001f
[ 4857.926498] RDX: 0000000000000000 RSI: ffffffffb2961d90 RDI: ffffffffb296c0d1
[ 4857.926499] RBP: ffffffffb2ec9f20 R08: 0000046b12f767ef R09: 0000000000000749
[ 4857.926499] R10: 0000000000000749 R11: 0000000000000007 R12: 0000000000000006
[ 4857.926500] R13: ffff9c1775cb6800 R14: 0000000000000006 R15: 0000046b12f767ef
[ 4857.926502] ? cpuidle_enter_state+0xa4/0x840
[ 4857.926504] cpuidle_enter+0x29/0x40
[ 4857.926505] do_idle+0x1fb/0x2d0
[ 4857.926507] cpu_startup_entry+0x19/0x20
[ 4857.926509] start_secondary+0x1b8/0x200
[ 4857.926512] secondary_startup_64+0xb6/0xc0
[ 4857.926513] ---[ end trace 64581018dc260fbe ]---
I don't really know how to read aubinator's output, but I did notice these two error which sound relevant:
$ aubinator_error_decode blender-gpu-hang.drm-dump | grep unknown
0xfff59040: unknown instruction 1e9e02d1
0xfff5a5a4: unknown instruction 791b0002