[CI][BAT] [ICL only] igt@* - incomplete - timeout/system hang?
Submitted by Martin Peres @mupuf
Assigned to LAKSHMINARAYANA VUDUM @l4kshmi
Link to original bug (#107713)
Description
Blocking
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Bugzilla Migration User added CI feature: display/Other platform: ICL priority::high severity::normal + 1 deleted label
added CI feature: display/Other platform: ICL priority::high severity::normal + 1 deleted label
Martin Peres@mupuf
said:Increasing the priority of ICL bugs.
Martin Peres@mupuf
said:https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_109/fi-icl-u/igt@kms_rotation_crc@primary-yf-tiled-reflect-x-90.html
<4>
[ 114.464229] WARNING: CPU: 5 PID: 2382 at kernel/rcu/tree_plugin.h:342 rcu_note_context_switch+0x6d/0x670
<4>
[ 114.464232] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 ax88179_178a usbnet mii x86_pkg_temp_thermal coretemp crct10dif_pclmul snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core e1000e snd_pcm prime_numbers
<4>
[ 114.464261] CPU: 5 PID: 2382 Comm: kms_rotation_cr Tainted: G UD W 4.19.0-rc3-g5fd5d0c19c4a-drmtip_109+ #1 (moved)
<4>
[ 114.464268] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2313.A01.1808012121 08/01/2018
<4>
[ 114.464274] RIP: 0010:rcu_note_context_switch+0x6d/0x670
<4>
[ 114.464279] Code: c9 74 10 44 8b 83 7c 08 00 00 45 85 c0 0f 84 c7 01 00 00 40 84 ed 8b 83 78 03 00 00 0f 85 ef 00 00 00 85 c0 0f 8e ef 00 00 00<0f>
0b 80 bb 7c 03 00 00 00 0f 84 fa 01 00 00 e8 6f d4 ff ff e8 fa
<4>
[ 114.464285] RSP: 0018:ffffa2b94074b400 EFLAGS: 00010002
<4>
[ 114.464292] RAX: 0000000000000001 RBX: ffff9541a18d0040 RCX: 0000000080000003
<4>
[ 114.464299] RDX: 0000000000000002 RSI: ffffffffa010405f RDI: ffffffffa00a3a47
<4>
[ 114.464303] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
<4>
[ 114.464308] R10: ffffa2b94074b280 R11: ffffffffa0246be0 R12: ffff9541a18d0040
<4>
[ 114.464313] R13: ffff9541f0761dd8 R14: 0000000000021dc0 R15: 0000000000000000
<4>
[ 114.464320] FS: 00007fb1180ff980(0000) GS:ffff9541f0740000(0000) knlGS:0000000000000000
<4>
[ 114.464325] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>
[ 114.464330] CR2: 00007fb10a9e8000 CR3: 00000004a165c004 CR4: 0000000000760ee0
<4>
[ 114.464336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>
[ 114.464341] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>
[ 114.464345] PKRU: 55555554
<4>
[ 114.464351] Call Trace:
<4>
[ 114.464359] __schedule+0xbc/0xb40
<4>
[ 114.464364] ? wait_for_common+0x116/0x1f0
<4>
[ 114.464370] schedule+0x2d/0x90
<4>
[ 114.464374] schedule_timeout+0x236/0x4f0
<4>
[ 114.464379] ? lock_acquire+0xa6/0x1c0
<4>
[ 114.464383] ? wait_for_common+0x48/0x1f0
<4>
[ 114.464389] ? wait_for_common+0x116/0x1f0
<4>
[ 114.464393] wait_for_common+0x13a/0x1f0
<4>
[ 114.464398] ? wake_up_q+0x70/0x70
<4>
[ 114.464404] virt_efi_set_variable+0x11b/0x170
<4>
[ 114.464410] ? efi_call_virt_check_flags+0x80/0x80
<4>
[ 114.464418] efivar_entry_set_safe+0xea/0x1d0
<4>
[ 114.464426] ? efi_pstore_write+0x105/0x150
<4>
[ 114.464433] ? efi_pstore_write+0xa2/0x150
<4>
[ 114.464439] efi_pstore_write+0x105/0x150
<4>
[ 114.464450] pstore_dump+0x12b/0x350
<4>
[ 114.464461] kmsg_dump+0x87/0x1c0
<4>
[ 114.464467] oops_end+0x3e/0x90
<4>
[ 114.464473] general_protection+0x1e/0x30
<4>
[ 114.464480] RIP: 0010:__list_del_entry_valid+0x25/0x90
<4>
[ 114.464485] Code: c3 0f 1f 40 00 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 57 08 48 39 c8 74 26 48 b9 00 02 00 00 00 00 ad de 48 39 ca 74 2e<48>
8b 32 48 39 fe 75 3a 48 8b 50 08 48 39 f2 75 48 b8 01 00 00 00
<4>
[ 114.464491] RSP: 0018:ffffa2b94074ba38 EFLAGS: 00010002
<4>
[ 114.464499] RAX: fffff5d4d250f988 RBX: 0000000000000000 RCX: dead000000000200
<4>
[ 114.464504] RDX: fffb9541ee800480 RSI: 00000000ffffffff RDI: fffff5d4d2998d08
<4>
[ 114.464510] RBP: ffffa2b94074bb00 R08: 00000000f33181c6 R09: ffff9541e6635e48
<4>
[ 114.464514] R10: fffff5d4d2998d08 R11: 0000000000000000 R12: 0000000000000008
<4>
[ 114.464518] R13: fffff5d4d250f980 R14: fffff5d4d2998d00 R15: ffff9541ee80f880
<4>
[ 114.464527] get_partial_node.isra.29+0x178/0x460
<4>
[ 114.464534] ? __lock_acquire+0x3c8/0x1b50
<4>
[ 114.464541] ? ___slab_alloc.constprop.34+0x1af/0x390
<4>
[ 114.464546] ___slab_alloc.constprop.34+0x1af/0x390
<4>
[ 114.464603] ? i915_gem_object_get_pages_gtt+0xa4/0x620 [i915]
<4>
[ 114.464612] ? lock_acquire+0xa6/0x1c0
<4>
[ 114.464655] ? i915_gem_object_get_pages_gtt+0xa4/0x620 [i915]
<4>
[ 114.464668] ? __slab_alloc.isra.27.constprop.33+0x3d/0x70
<4>
[ 114.464674] __slab_alloc.isra.27.constprop.33+0x3d/0x70
<4>
[ 114.464716] ? i915_gem_object_get_pages_gtt+0xa4/0x620 [i915]
<4>
[ 114.464723] kmem_cache_alloc_trace+0x228/0x290
<4>
[ 114.464764] i915_gem_object_get_pages_gtt+0xa4/0x620 [i915]
<4>
[ 114.464812] ____i915_gem_object_get_pages+0x1d/0xa0 [i915]
<4>
[ 114.464854] __i915_gem_object_get_pages+0x59/0xb0 [i915]
<4>
[ 114.464895] i915_gem_set_domain_ioctl+0x35e/0x430 [i915]
<4>
[ 114.464937] ? i915_gem_obj_prepare_shmem_write+0x280/0x280 [i915]
<4>
[ 114.464944] drm_ioctl_kernel+0x7c/0xf0
<4>
[ 114.464950] drm_ioctl+0x2e6/0x3a0
<4>
[ 114.464993] ? i915_gem_obj_prepare_shmem_write+0x280/0x280 [i915]
<4>
[ 114.465003] ? rcu_lockdep_current_cpu_online+0x8f/0xd0
<4>
[ 114.465009] do_vfs_ioctl+0xa0/0x6d0
<4>
[ 114.465014] ? __task_pid_nr_ns+0xb9/0x1f0
<4>
[ 114.465020] ksys_ioctl+0x35/0x60
<4>
[ 114.465025] __x64_sys_ioctl+0x11/0x20
<4>
[ 114.465029] do_syscall_64+0x55/0x190
<4>
[ 114.465034] entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>
[ 114.465037] RIP: 0033:0x7fb1177b45d7
<4>
[ 114.465041] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05<48>
3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
<4>
[ 114.465044] RSP: 002b:00007fff9561df58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
<4>
[ 114.465049] RAX: ffffffffffffffda RBX: 000055f8561e7f74 RCX: 00007fb1177b45d7
<4>
[ 114.465052] RDX: 00007fff9561dfac RSI: 00000000400c645f RDI: 0000000000000003
<4>
[ 114.465056] RBP: 00007fff9561dfac R08: 000055f8561e7f78 R09: 000055f8561e7f74
<4>
[ 114.465061] R10: 00000000ffffffd9 R11: 0000000000000246 R12: 00000000400c645f
<4>
[ 114.465066] R13: 0000000000000003 R14: 000055f8561e7f88 R15: 0000000000000000
<4>
[ 114.465076] irq event stamp: 38933664
<4>
[ 114.465082] hardirqs last enabled at (38933663): [<ffffffff9f0025ed>
] do_syscall_64+0xd/0x190
<4>
[ 114.465089] hardirqs last disabled at (38933664): [<ffffffff9f1fdea9>
] __slab_alloc.isra.27.constprop.33+0x19/0x70
<4>
[ 114.465096] softirqs last enabled at (38933496): [<ffffffff9fc0031d>
] __do_softirq+0x31d/0x483
<4>
[ 114.465102] softirqs last disabled at (38933489): [<ffffffff9f0901f9>
] irq_exit+0xa9/0xc0
<4>
[ 114.465109] WARNING: CPU: 5 PID: 2382 at kernel/rcu/tree_plugin.h:342 rcu_note_context_switch+0x6d/0x670
<4>
[ 114.465115] ---[ end trace f5b9d6df661f800e ]---
But this one is even funkier (enjoy making sense of this!):
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_109/fi-icl-u/igt@perf@invalid-oa-exponent.html
<4>
[ 127.333330] NMI backtrace for cpu 0
[...]
<3>
[ 131.214804] [drm:gen8_de_irq_handler [i915]] ERROR Fault errors on pipe A: 0x00000180
[...]
<3>
[ 132.630167] [drm:gen8_de_irq_handler [i915]] ERROR The master control interrupt lied (DE PIPE)! Chris Wilson@ickle
said:(In reply to Martin Peres from comment 3)
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_109/fi-icl-u/
igt@kms_rotation_crc@primary-yf-tiled-reflect-x-90.html
<4>
[ 114.464229] WARNING: CPU: 5 PID: 2382 at kernel/rcu/tree_plugin.h:342
rcu_note_context_switch+0x6d/0x670
...
<4>
[ 114.464480] RIP: 0010:__list_del_entry_valid+0x25/0x90
<4>
[ 114.464485] Code: c3 0f 1f 40 00 48 8b 07 48 b9 00 01 00 00 00 00 ad
de 48 8b 57 08 48 39 c8 74 26 48 b9 00 02 00 00 00 00 ad de 48 39 ca 74 2e
<48>
8b 32 48 39 fe 75 3a 48 8b 50 08 48 39 f2 75 48 b8 01 00 00 00
<4>
[ 114.464491] RSP: 0018:ffffa2b94074ba38 EFLAGS: 00010002
<4>
[ 114.464499] RAX: fffff5d4d250f988 RBX: 0000000000000000 RCX:
dead000000000200
<4>
[ 114.464504] RDX: fffb9541ee800480 RSI: 00000000ffffffff RDI:
fffff5d4d2998d08
<4>
[ 114.464510] RBP: ffffa2b94074bb00 R08: 00000000f33181c6 R09:
ffff9541e6635e48
<4>
[ 114.464514] R10: fffff5d4d2998d08 R11: 0000000000000000 R12:
0000000000000008
<4>
[ 114.464518] R13: fffff5d4d250f980 R14: fffff5d4d2998d00 R15:
ffff9541ee80f880
<4>
[ 114.464527] get_partial_node.isra.29+0x178/0x460
<4>
[ 114.464534] ? __lock_acquire+0x3c8/0x1b50
<4>
[ 114.464541] ? ___slab_alloc.constprop.34+0x1af/0x390
<4>
[ 114.464546] ___slab_alloc.constprop.34+0x1af/0x390
<4>
[ 114.464603] ? i915_gem_object_get_pages_gtt+0xa4/0x620 [i915]
<4>
[ 114.464612] ? lock_acquire+0xa6/0x1c0
<4>
[ 114.464655] ? i915_gem_object_get_pages_gtt+0xa4/0x620 [i915]
<4>
[ 114.464668] ? __slab_alloc.isra.27.constprop.33+0x3d/0x70
<4>
[ 114.464674] __slab_alloc.isra.27.constprop.33+0x3d/0x70
<4>
[ 114.464716] ? i915_gem_object_get_pages_gtt+0xa4/0x620 [i915]
<4>
[ 114.464723] kmem_cache_alloc_trace+0x228/0x290
<4>
[ 114.464764] i915_gem_object_get_pages_gtt+0xa4/0x620 [i915]
Is indicative of severe memcorruption -- someone has overwritten the slabs.
The system is toast and we can expect all manner of wacky bugs. A few kasan runs required, maybe followed by even a kmemcheck. If they all check out, I'm afraid we're telling the GPU to do funky things (unless we can pin on some other hw). Martin Peres@mupuf
said:Another suspend test failing: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_115/fi-icl-u/igt@kms_plane@plane-panning-bottom-right-suspend-pipe-c-planes.html
Martin Peres@mupuf
said:Look 'ma, another one!
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_112/fi-icl-u/igt@kms_vblank@pipe-a-ts-continuation-dpms-suspend.html Jani Saarinen@jani.saarinen
said:Same for 117:
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_117/fi-icl-u/igt@kms_vblank@pipe-c-ts-continuation-dpms-suspend.html Francesco Balestrieri@baleboy
said:Chris asked for "A few kasan runs required, maybe followed by even a kmemcheck" - who can help with that?
Martin Peres@mupuf
closed a related bug:*** Bug 107901 has been marked as a duplicate of this bug. ***
Jani Saarinen@jani.saarinen
said:Waiting new run on CI with new FW.
Martin Peres@mupuf
said:(In reply to Jani Saarinen from comment 10)
Waiting new run on CI with new FW.
Seems like it fixed the S3 issue. Now we are hitting more issues. This one looks scary:
`<1>`[ 452.562270] BUG: unable to handle kernel paging request at 00000001000102cd
`<6>`[ 452.562273] PGD 0 P4D 0
`<4>`[ 452.562277] Oops: 0002 [#2] PREEMPT SMP PTI
`<4>`[ 452.562280] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G UD W 4.19.0-rc8-g53197f72a64b-drmtip_131+ #1 (moved)
`<4>`[ 452.562281] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.2352.A01.1808281852 08/28/2018
`<4>`[ 452.562291] RIP: 0010:expire_timers+0x69/0x190
`<4>`[ 452.562293] Code: 37 00 0f 1f 44 00 00 65 8b 05 93 5f ef 47 89 c0 49 0f a3 45 00 0f 82 b2 00 00 00 48 8b 03 48 8b 53 08 48 85 c0 48 89 02 74 04 `<48>` 89 50 08 f6 43 22 20 48 c7 43 08 00 00 00 00 48 89 ef 4c 89 33
`<4>`[ 452.562295] RSP: 0018:ffffa10470683ed0 EFLAGS: 00010006
`<4>`[ 452.562297] RAX: 00000001000102c5 RBX: ffffb4e001a0fdb8 RCX: 0000000000000103
`<4>`[ 452.562299] RDX: ffffa10470683f08 RSI: 0000000000000001 RDI: 00000000ffffffff
`<4>`[ 452.562300] RBP: ffffa10470699980 R08: 0000000000000000 R09: 0000000000000000
`<4>`[ 452.562301] R10: ffffa10470683ed0 R11: ffffffffba744000 R12: ffffa10470683f08
`<4>`[ 452.562303] R13: ffffffffb934d3b0 R14: dead000000000200 R15: 0000000000000002
`<4>`[ 452.562304] FS: 0000000000000000(0000) GS:ffffa10470680000(0000) knlGS:0000000000000000
`<4>`[ 452.562306] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
`<4>`[ 452.562307] CR2: 00000001000102cd CR3: 000000036d210006 CR4: 0000000000760ee0
`<4>`[ 452.562308] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
`<4>`[ 452.562310] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
`<4>`[ 452.562311] PKRU: 55555554
`<4>`[ 452.562312] Call Trace:
`<4>`[ 452.562313] ``
`<4>`[ 452.562317] run_timer_softirq+0xc7/0x170
`<4>`[ 452.562321] ? recalibrate_cpu_khz+0x10/0x10
`<4>`[ 452.562323] ? ktime_get+0x84/0x100
`<4>`[ 452.562327] __do_softirq+0xd8/0x483
`<4>`[ 452.562332] irq_exit+0xa9/0xc0
`<4>`[ 452.562334] smp_apic_timer_interrupt+0x9c/0x240
`<4>`[ 452.562336] apic_timer_interrupt+0xf/0x20
`<4>`[ 452.562338] ``
`<4>`[ 452.562341] RIP: 0010:cpuidle_enter_state+0xab/0x340
`<4>`[ 452.562343] Code: 44 00 00 31 ff e8 65 dd 93 ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 70 02 00 00 31 ff e8 1e 81 9a ff e8 e9 52 9e ff fb 4c 29 fb `<48>` ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 ff
`<4>`[ 452.562344] RSP: 0018:ffffb4e00010fe90 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13
`<4>`[ 452.562347] RAX: 0000000000000000 RBX: 0000000006143a37 RCX: 000000000000001f
`<4>`[ 452.562348] RDX: 000000695ed012c2 RSI: ffffffffb9080d10 RDI: ffffffffb8782447
`<4>`[ 452.562349] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
`<4>`[ 452.562351] R10: ffffb4e00010fe70 R11: ffffffffba6d7cc8 R12: ffffa1046cfa4a68
`<4>`[ 452.562352] R13: ffffffffb929c538 R14: 0000000000000000 R15: 0000006958bbd88b
`<4>`[ 452.562357] ? cpuidle_enter_state+0xa7/0x340
`<4>`[ 452.562362] do_idle+0x1f3/0x260
`<4>`[ 452.562365] cpu_startup_entry+0x6a/0x70
`<4>`[ 452.562369] start_secondary+0x19d/0x1f0
`<4>`[ 452.562371] secondary_startup_64+0xa4/0xb0
`<4>`[ 452.562376] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core btusb btrtl snd_pcm btbcm btintel e1000e cdc_ether usbnet mii bluetooth ecdh_generic prime_numbers
`<0>`[ 452.562410] Dumping ftrace buffer:
`<0>`[ 452.562412] (ftrace buffer empty)
`<4>`[ 452.562414] CR2: 00000001000102cd
`<4>`[ 452.562416] ---[ end trace 02eb919dc2c5c0a7 ]--- Jani Saarinen@jani.saarinen
said:BIOS ICLSFWR1.R00.2352 is old, should be 2392.
LAKSHMINARAYANA VUDUM@l4kshmi
said:(In reply to Jani Saarinen from comment 12)
BIOS ICLSFWR1.R00.2352 is old, should be 2392.
Failure with BIOS ICLSFWR1.R00.2392.A04.1809260455 09/26/2018.
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_134/fi-icl-u/igt@kms_frontbuffer_tracking@fbc-suspend.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5042/fi-icl-u/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html Martin Peres@mupuf
said:https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-icl-u/igt@kms_cursor_legacy@pipe-a-single-move.html
No real indication as to what went wrong... Francesco Balestrieri@baleboy
said:Waiting for Imre's power well-related patches in the hope that they would help here (logs at least mention power well). We need to check the status once those land.
Francesco Balestrieri@baleboy
said:This should be the series: https://patchwork.freedesktop.org/series/51765/
Francesco Balestrieri@baleboy
said:Last seen in BAT three days ago
Martin Peres@mupuf
said:No good logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5147/shard-iclb4/igt@pm_rpm@universal-planes-dpms.html
Mika Kuoppala said:Just note that this is not a first time we see a null pointer dereference
with a bit 16 set on icl.
[ 6151.054027] BUG: unable to handle kernel paging request at 00000001000000b4
[ 452.562270] BUG: unable to handle kernel paging request at 00000001000102cd