After the system has been running for a few hours, the display freezes with "GPU HANG" error in the kernel log
1. A clear subject describing the issue:
After the system has been running for a few hours, the display freezes.
2. Steps to reproduce the issue:
Using Ubuntu 16.04 LTS, leaving running a set of applications similar to our production environment:
- glxgears (mesa-utils)
- nw.js (https://nwjs.io/)
- a simple opengl custom application
- guvcview (http://guvcview.sourceforge.net/)
3. How often does the steps listed above trigger the issue?
Sometimes after a few hours, sometimes after several hours
4. BUG informations:
-
system architecture: x86_64
-
kernel version: 5.5.3-050503-generic
-
Linux distribution: Ubuntu 16.04.1 LTS xenial
-
Machine or mother board model: see "dmidecode.txt" in the zip attachments
-
Behavior:
When the display freezes, we can find mainly three types of errors in the kernel logs:
A) GPU HANG (see gpu_hang.zip in attachments):
In this case there is also the error dump (from /sys/class/drm/card0/error) and the intel_reg_dump (attached in the zip)
[84027.714807] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[84027.715686] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[84027.715820] [drm:intel_engine_reset [i915]] Failed to reset rcs0, ret=-110
[84034.947357] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[84034.948187] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[84034.948297] [drm:intel_engine_reset [i915]] Failed to reset rcs0, ret=-110
[84038.019550] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[84038.020394] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[84038.020517] [drm:intel_engine_reset [i915]] Failed to reset rcs0, ret=-110
[84041.091051] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, stopped heartbeat on rcs0
[84041.091057] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[84041.091058] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[84041.091060] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[84041.091062] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[84041.091063] GPU crash dump saved to /sys/class/drm/card0/error
B) refcount_t: saturated; leaking memory + BUG: unable to handle page fault for address: 000000003d0647ae (see gpu_freeze_1.zip in attachments):
[55531.285391] ------------[ cut here ]------------
[55531.285395] refcount_t: saturated; leaking memory.
[55531.285457] WARNING: CPU: 0 PID: 3806 at lib/refcount.c:19 refcount_warn_saturate+0x8e/0xf0
[55531.285459] Modules linked in: tcp_diag inet_diag intel_rapl_msr mei_hdcp intel_rapl_common intel_telemetry_pltdrv intel_punit_ipc intel_telemetry_core intel_pmc_ipc x86_pkg_temp_thermal coretemp kvm_intel snd_sof_pci snd_sof_intel_byt snd_sof_intel_ipc snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda kvm snd_sof snd_hda_ext_core intel_cstate nls_iso8859_1 intel_rapl_perf snd_soc_acpi_intel_match snd_soc_acpi ledtrig_audio snd_soc_core snd_compress snd_hda_codec_hdmi ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core lpc_ich uvcvideo snd_usb_audio videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_seq_midi snd_usbmidi_lib snd_seq_midi_event videobuf2_common snd_hwdep videodev joydev input_leds mc snd_pcm snd_rawmidi hid_multitouch intel_lpss_pci intel_lpss snd_seq intel_xhci_usb_role_switch idma64 virt_dma roles mac_hid snd_seq_device snd_timer snd soundcore mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp
[55531.285512] libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c hid_generic usbhid hid mmc_block crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i915 drm_kms_helper aesni_intel crypto_simd cryptd glue_helper igb sdhci_pci syscopyarea cqhci sysfillrect sysimgblt sdhci fb_sys_fops ahci dca i2c_algo_bit libahci drm video pinctrl_broxton pinctrl_intel
[55531.285545] CPU: 0 PID: 3806 Comm: cat Not tainted 5.5.3-050503-generic #202002110832
[55531.285547] Hardware name: QubicaAMF ST05/0C06, BIOS 1.07.1 05/14/2019
[55531.285551] RIP: 0010:refcount_warn_saturate+0x8e/0xf0
[55531.285554] Code: 8f 5f 2d 01 01 e8 87 18 b6 ff 0f 0b 5d c3 80 3d 81 5f 2d 01 00 75 b1 48 c7 c7 d0 8e 3d 92 c6 05 71 5f 2d 01 01 e8 67 18 b6 ff <0f> 0b 5d c3 80 3d 5e 5f 2d 01 00 75 91 48 c7 c7 28 8f 3d 92 c6 05
[55531.285556] RSP: 0018:ffffb2f5c0debc00 EFLAGS: 00010286
[55531.285559] RAX: 0000000000000000 RBX: ffffb2f5c0debd08 RCX: 0000000000000007
[55531.285561] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff9822bbc19800
[55531.285562] RBP: ffffb2f5c0debc00 R08: 0000000000b0e12f R09: 0000000000000004
[55531.285564] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9822b3a34000
[55531.285566] R13: 0000000080000000 R14: 0000000000000001 R15: ffff98218374a938
[55531.285568] FS: 00007f224f125700(0000) GS:ffff9822bbc00000(0000) knlGS:0000000000000000
[55531.285570] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[55531.285572] CR2: 00000000018d2c20 CR3: 000000017609c000 CR4: 00000000003406f0
[55531.285574] Call Trace:
[55531.285658] per_file_stats+0x1cf/0x1f0 [i915]
[55531.285721] ? gpu_state_release+0x50/0x50 [i915]
[55531.285726] idr_for_each+0x60/0xd0
[55531.285789] print_context_stats+0x1b9/0x350 [i915]
[55531.285796] ? seq_vprintf+0x35/0x50
[55531.285799] ? seq_printf+0x53/0x70
[55531.285861] i915_gem_object_info+0x54/0x60 [i915]
[55531.285865] seq_read+0xdc/0x470
[55531.285870] full_proxy_read+0x5c/0x90
[55531.285874] __vfs_read+0x1b/0x40
[55531.285877] vfs_read+0xab/0x160
[55531.285880] ksys_read+0x67/0xe0
[55531.285883] __x64_sys_read+0x1a/0x20
[55531.285888] do_syscall_64+0x57/0x1b0
[55531.285894] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[55531.285897] RIP: 0033:0x7f224ec51260
[55531.285900] Code: 0b 31 c0 48 83 c4 08 e9 ae fe ff ff 48 8d 3d 27 b4 09 00 e8 b2 1e 02 00 66 90 83 3d e9 24 2d 00 00 75 10 b8 00 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e de 01 00 48 89 04 24
[55531.285902] RSP: 002b:00007ffe18623e58 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[55531.285904] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f224ec51260
[55531.285906] RDX: 0000000000020000 RSI: 00007f224f128000 RDI: 0000000000000003
[55531.285908] RBP: 0000000000020000 R08: ffffffffffffffff R09: 0000000000000000
[55531.285909] R10: 000000000000037b R11: 0000000000000246 R12: 00007f224f128000
[55531.285911] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000020000
[55531.285915] ---[ end trace 0bcec234ef6b4317 ]---
....
[55531.294867] BUG: unable to handle page fault for address: 000000003d0647ae
[55531.294877] #PF: supervisor read access in kernel mode
[55531.294879] #PF: error_code(0x0000) - not-present page
[55531.294882] PGD 1795f9067 P4D 1795f9067 PUD 1795f8067 PMD 0
[55531.294889] Oops: 0000 [#1] SMP NOPTI
[55531.294895] CPU: 0 PID: 3755 Comm: glxgears Tainted: G W 5.5.3-050503-generic #202002110832
[55531.294898] Hardware name: QubicaAMF ST05/0C06, BIOS 1.07.1 05/14/2019
[55531.294908] RIP: 0010:kmem_cache_alloc+0x85/0x220
[55531.294912] Code: 65 49 8b 50 08 65 4c 03 05 e8 75 d6 6e 4d 8b 30 4d 85 f6 0f 84 7c 01 00 00 41 8b 5f 20 49 8b 3f 48 8d 4a 01 4c 89 f0 4c 01 f3 <48> 33 1b 49 33 9f 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 74 bd
[55531.294915] RSP: 0018:ffffb2f5c0ba3998 EFLAGS: 00010202
[55531.294918] RAX: 000000003d0647ae RBX: 000000003d0647ae RCX: 0000000000312a9d
[55531.294921] RDX: 0000000000312a9c RSI: 0000000000000dc0 RDI: 0000000000035e70
[55531.294923] RBP: ffffb2f5c0ba39c8 R08: ffff9822bbc35e70 R09: ffff9821a0e90e40
[55531.294926] R10: 0000000000000000 R11: ffff9822aecef040 R12: 0000000000000dc0
[55531.294928] R13: ffff9822ba8acfc0 R14: 000000003d0647ae R15: ffff9822ba8acfc0
[55531.294932] FS: 00007fed72858740(0000) GS:ffff9822bbc00000(0000) knlGS:0000000000000000
[55531.294934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[55531.294937] CR2: 000000003d0647ae CR3: 0000000179442000 CR4: 00000000003406f0
[55531.294939] Call Trace:
[55531.295029] ? i915_vma_instance+0xe0/0x4e0 [i915]
[55531.295099] i915_vma_instance+0xe0/0x4e0 [i915]
[55531.295165] eb_lookup_vmas+0x655/0xb00 [i915]
[55531.295231] i915_gem_do_execbuffer+0x3c1/0xe70 [i915]
[55531.295242] ? recalibrate_cpu_khz+0x10/0x10
[55531.295247] ? ktime_get_mono_fast_ns+0x4e/0xa0
[55531.295304] ? intel_runtime_pm_put_unchecked+0x33/0x40 [i915]
[55531.295309] ? kvmalloc_node+0x7b/0x90
[55531.295315] ? __kmalloc_node+0x30d/0x320
[55531.295380] i915_gem_execbuffer2_ioctl+0x1eb/0x3d0 [i915]
[55531.295449] ? i915_gem_madvise_ioctl+0x120/0x2c0 [i915]
[55531.295513] ? i915_gem_execbuffer_ioctl+0x2d0/0x2d0 [i915]
[55531.295550] drm_ioctl_kernel+0xae/0xf0 [drm]
[55531.295578] drm_ioctl+0x234/0x3d0 [drm]
[55531.295643] ? i915_gem_execbuffer_ioctl+0x2d0/0x2d0 [i915]
[55531.295648] ? vfs_writev+0xc3/0xf0
[55531.295654] do_vfs_ioctl+0x458/0x6d0
[55531.295660] ? __sys_recvmsg+0x59/0xa0
[55531.295664] ksys_ioctl+0x67/0x90
[55531.295669] __x64_sys_ioctl+0x1a/0x20
[55531.295674] do_syscall_64+0x57/0x1b0
[55531.295680] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[55531.295684] RIP: 0033:0x7fed71addf47
[55531.295688] Code: 00 00 00 48 8b 05 51 6f 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48
[55531.295691] RSP: 002b:00007ffed7f3c3a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[55531.295694] RAX: ffffffffffffffda RBX: 00007fed72830040 RCX: 00007fed71addf47
[55531.295697] RDX: 00007ffed7f3c400 RSI: 0000000040406469 RDI: 0000000000000004
[55531.295699] RBP: 00007ffed7f3c400 R08: 0000000000000004 R09: 0000000000000001
[55531.295702] R10: 00007fed72830040 R11: 0000000000000246 R12: 0000000040406469
[55531.295704] R13: 0000000000000004 R14: 00007fed6f2d5f40 R15: 00007fed728550e8
[55531.295708] Modules linked in: tcp_diag inet_diag intel_rapl_msr mei_hdcp intel_rapl_common intel_telemetry_pltdrv intel_punit_ipc intel_telemetry_core intel_pmc_ipc x86_pkg_temp_thermal coretemp kvm_intel snd_sof_pci snd_sof_intel_byt snd_sof_intel_ipc snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda kvm snd_sof snd_hda_ext_core intel_cstate nls_iso8859_1 intel_rapl_perf snd_soc_acpi_intel_match snd_soc_acpi ledtrig_audio snd_soc_core snd_compress snd_hda_codec_hdmi ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core lpc_ich uvcvideo snd_usb_audio videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_seq_midi snd_usbmidi_lib snd_seq_midi_event videobuf2_common snd_hwdep videodev joydev input_leds mc snd_pcm snd_rawmidi hid_multitouch intel_lpss_pci intel_lpss snd_seq intel_xhci_usb_role_switch idma64 virt_dma roles mac_hid snd_seq_device snd_timer snd soundcore mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp
[55531.295759] libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c hid_generic usbhid hid mmc_block crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i915 drm_kms_helper aesni_intel crypto_simd cryptd glue_helper igb sdhci_pci syscopyarea cqhci sysfillrect sysimgblt sdhci fb_sys_fops ahci dca i2c_algo_bit libahci drm video pinctrl_broxton pinctrl_intel
[55531.295791] CR2: 000000003d0647ae
[55531.295796] ---[ end trace 0bcec234ef6b4319 ]---
C) general protection fault: 0000 [#1 (moved)] SMP NOPTI (see gpu_freeze_2.zip in attachments):
[ 170.539569] general protection fault: 0000 [#1] SMP NOPTI
[ 170.539578] CPU: 0 PID: 3325 Comm: glxgears Not tainted 5.5.3-050503-generic #202002110832
[ 170.539580] Hardware name: QubicaAMF ST05/0C06, BIOS 1.07.1 05/14/2019
[ 170.539587] RIP: 0010:kmem_cache_alloc+0x85/0x220
[ 170.539590] Code: 65 49 8b 50 08 65 4c 03 05 e8 75 b6 49 4d 8b 30 4d 85 f6 0f 84 7c 01 00 00 41 8b 5f 20 49 8b 3f 48 8d 4a 01 4c 89 f0 4c 01 f3 <48> 33 1b 49 33 9f 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 74 bd
[ 170.539592] RSP: 0018:ffffbba840747970 EFLAGS: 00010202
[ 170.539595] RAX: 16f888bb12336eba RBX: 16f888bb12336eba RCX: 0000000000027e8a
[ 170.539596] RDX: 0000000000027e89 RSI: 0000000000000cc0 RDI: 00000000000324f0
[ 170.539598] RBP: ffffbba8407479a0 R08: ffff90597bc324f0 R09: ffff905977ea8148
[ 170.539600] R10: 0000000000000cc0 R11: ffff90596ee80000 R12: 0000000000000cc0
[ 170.539601] R13: ffff90597b19b340 R14: 16f888bb12336eba R15: ffff90597b19b340
[ 170.539604] FS: 00007fdcd8d90740(0000) GS:ffff90597bc00000(0000) knlGS:0000000000000000
[ 170.539605] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 170.539607] CR2: 00007fc770dfd0ec CR3: 0000000176e4a000 CR4: 00000000003406f0
[ 170.539608] Call Trace:
[ 170.539679] ? i915_active_ref+0x68/0x190 [i915]
[ 170.539725] i915_active_ref+0x68/0x190 [i915]
[ 170.539772] __i915_vma_move_to_active+0x36/0x40 [i915]
[ 170.539817] i915_vma_move_to_active+0x2b/0x170 [i915]
[ 170.539892] eb_submit+0xff/0x490 [i915]
[ 170.539935] i915_gem_do_execbuffer+0x955/0xe70 [i915]
[ 170.539995] ? i915_gem_gtt_pwrite_fast+0x128/0x440 [i915]
[ 170.539999] ? __kmalloc_node+0x30d/0x320
[ 170.540039] i915_gem_execbuffer2_ioctl+0x1eb/0x3d0 [i915]
[ 170.540082] ? i915_gem_madvise_ioctl+0x120/0x2c0 [i915]
[ 170.540122] ? i915_gem_execbuffer_ioctl+0x2d0/0x2d0 [i915]
[ 170.540147] drm_ioctl_kernel+0xae/0xf0 [drm]
[ 170.540164] drm_ioctl+0x234/0x3d0 [drm]
[ 170.540205] ? i915_gem_execbuffer_ioctl+0x2d0/0x2d0 [i915]
[ 170.540210] ? __switch_to_asm+0x34/0x70
[ 170.540212] ? __switch_to_asm+0x40/0x70
[ 170.540216] do_vfs_ioctl+0x458/0x6d0
[ 170.540219] ? __schedule+0x2e0/0x760
[ 170.540222] ksys_ioctl+0x67/0x90
[ 170.540225] __x64_sys_ioctl+0x1a/0x20
[ 170.540229] do_syscall_64+0x57/0x1b0
[ 170.540232] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 170.540234] RIP: 0033:0x7fdcd8015f47
[ 170.540237] Code: 00 00 00 48 8b 05 51 6f 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48
[ 170.540238] RSP: 002b:00007fffcb122f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 170.540241] RAX: ffffffffffffffda RBX: 00007fdcd8d68040 RCX: 00007fdcd8015f47
[ 170.540242] RDX: 00007fffcb122fb0 RSI: 0000000040406469 RDI: 0000000000000004
[ 170.540244] RBP: 00007fffcb122fb0 R08: 0000000000000004 R09: 0000000000000001
[ 170.540245] R10: 00007fdcd8d68040 R11: 0000000000000246 R12: 0000000040406469
[ 170.540247] R13: 0000000000000004 R14: 00007fdcd580df40 R15: 00007fdcd8d8d0e8
[ 170.540249] Modules linked in: tcp_diag inet_diag intel_rapl_msr mei_hdcp snd_sof_pci intel_rapl_common snd_sof_intel_byt snd_sof_intel_ipc snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda intel_telemetry_pltdrv snd_sof intel_punit_ipc intel_telemetry_core intel_pmc_ipc x86_pkg_temp_thermal coretemp kvm_intel snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi ledtrig_audio kvm intel_cstate intel_rapl_perf snd_soc_core snd_compress snd_hda_codec_hdmi ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_hda_codec uvcvideo snd_hda_core joydev snd_usb_audio snd_usbmidi_lib snd_hwdep snd_seq_midi videobuf2_vmalloc snd_seq_midi_event snd_rawmidi videobuf2_memops snd_seq snd_pcm videobuf2_v4l2 videobuf2_common videodev input_leds hid_multitouch mc snd_seq_device intel_lpss_pci snd_timer lpc_ich intel_lpss idma64 virt_dma intel_xhci_usb_role_switch roles snd soundcore mac_hid mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
[ 170.540290] scsi_transport_iscsi autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c hid_generic usbhid hid crct10dif_pclmul crc32_pclmul mmc_block ghash_clmulni_intel i915 drm_kms_helper aesni_intel crypto_simd syscopyarea cryptd glue_helper igb sysfillrect sysimgblt dca fb_sys_fops sdhci_pci ahci cqhci i2c_algo_bit sdhci libahci drm video pinctrl_broxton pinctrl_intel
[ 170.540378] ---[ end trace 1392097859148919 ]---
[ 170.540383] RIP: 0010:kmem_cache_alloc+0x85/0x220
[ 170.540385] Code: 65 49 8b 50 08 65 4c 03 05 e8 75 b6 49 4d 8b 30 4d 85 f6 0f 84 7c 01 00 00 41 8b 5f 20 49 8b 3f 48 8d 4a 01 4c 89 f0 4c 01 f3 <48> 33 1b 49 33 9f 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 74 bd
[ 170.540387] RSP: 0018:ffffbba840747970 EFLAGS: 00010202
[ 170.540389] RAX: 16f888bb12336eba RBX: 16f888bb12336eba RCX: 0000000000027e8a
[ 170.540390] RDX: 0000000000027e89 RSI: 0000000000000cc0 RDI: 00000000000324f0
[ 170.540392] RBP: ffffbba8407479a0 R08: ffff90597bc324f0 R09: ffff905977ea8148
[ 170.540394] R10: 0000000000000cc0 R11: ffff90596ee80000 R12: 0000000000000cc0
[ 170.540395] R13: ffff90597b19b340 R14: 16f888bb12336eba R15: ffff90597b19b340
[ 170.540397] FS: 00007fdcd8d90740(0000) GS:ffff90597bc00000(0000) knlGS:0000000000000000
[ 170.540399] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 170.540400] CR2: 00007fc770dfd0ec CR3: 0000000176e4a000 CR4: 00000000003406f0