kernel BUG at mm/memory.c:2183 when media tests run in parallel
#3468 (closed) has about same backtrace, but as that issue seems to have a lot of other things and platforms, I'm filing it separately. Feel free to dup it to #3468 (closed), if that seems appropriate, but in my case this happens only on GEN9 Atoms, not on GEN9 Core machines (#3468 (closed) has tags also for Core machines).
Setup
- HW: BXT J4205
- OS: ClearLinux & Ubuntu
- SW: Git versions of drm-tip kernel, Mesa, Weston, Xwayland and slightly older git version of media stack & FFmpeg
Test-case triggering the BUG:
- Run 50 parallel instances of following:
ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i 1280x720p_29.97_10mb_h264_cabac.264 -c:v h264_qsv -b:v 800K -vf scale_qsv=w=352:h=240,fps=15 -compression_level 4 -an -y 0030_HD22_1.0.h264
Because of #3457 (closed), I'm not sure when this started. #3468 (closed) was filed two weeks ago, but while test execution started timing out two weeks ago, last week also SSH connections to the test machines started dropping.
Dmesg output:
[10887.466150] kernel BUG at mm/memory.c:2183!
[10887.466162] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[10887.466168] CPU: 0 PID: 7775 Comm: ffmpeg Tainted: G U 5.13.0-rc3-CI-Nightly #1
[10887.466174] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.40 07/14/2017
[10887.466177] RIP: 0010:remap_pfn_range_notrack+0x30f/0x440
[10887.466188] Code: e8 96 d7 e0 ff 84 c0 0f 84 27 01 00 00 48 ba 00 f0 ff ff ff ff 0f 00 4c 89 e0 48 c1 e0 0c 4d 85 ed 75 96 48 21 d0 31 f6 eb a9 <0f> 0b 48 39 37 0f 85 0e 01 00 00 48 8b 0c 24 48 39 4f 08 0f 85 00
[10887.466193] RSP: 0018:ffffc90006e33c50 EFLAGS: 00010286
[10887.466198] RAX: 800000000000002f RBX: 00007f5e01800000 RCX: 0000000000000028
[10887.466201] RDX: 0000000000000001 RSI: ffffea0000000000 RDI: 0000000000000000
[10887.466204] RBP: ffffea000033fea8 R08: 800000000000002f R09: ffff8881072256e0
[10887.466207] R10: ffffc9000b84fff8 R11: 0000000017dab000 R12: 0000000000089f9f
[10887.466210] R13: 800000000000002f R14: 00007f5e017e4000 R15: ffff88800cffaf20
[10887.466213] FS: 00007f5e04849640(0000) GS:ffff888278000000(0000) knlGS:0000000000000000
[10887.466216] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10887.466220] CR2: 00007fd9b191a2ac CR3: 00000001829ac000 CR4: 00000000003506f0
[10887.466223] Call Trace:
[10887.466233] vm_fault_gtt+0x1ca/0x5d0 [i915]
[10887.466381] ? ktime_get+0x38/0x90
[10887.466389] __do_fault+0x37/0x90
[10887.466395] __handle_mm_fault+0xc46/0x1200
[10887.466402] handle_mm_fault+0xce/0x2a0
[10887.466407] do_user_addr_fault+0x1c5/0x660
[10887.466412] ? exit_to_user_mode_prepare+0x134/0x160
[10887.466419] exc_page_fault+0x63/0x130
[10887.466427] ? asm_exc_page_fault+0x8/0x30
[10887.466433] asm_exc_page_fault+0x1e/0x30
[10887.466438] RIP: 0033:0x7f5e06b40051
[10887.466443] Code: eb 17 0f 1f 80 00 00 00 00 41 83 c7 01 83 c2 01 44 39 f9 0f 84 de 00 00 00 89 d0 25 ff 01 00 00 48 69 c0 e0 02 00 00 48 01 f8 <48> 39 b0 08 01 00 00 74 12 83 bb 0c 04 00 00 01 75 cd 48 3b b0 40
[10887.466448] RSP: 002b:00007f5e04848790 EFLAGS: 00010206
[10887.466452] RAX: 00007f5e017f6828 RBX: 000055ca6495d690 RCX: 0000000000000004
[10887.466455] RDX: 0000000000000067 RSI: 000055ca649220b0 RDI: 00007f5e017e4008
[10887.466458] RBP: 00007f5e04848a70 R08: 000055ca6495d690 R09: 00007f5e088075f0
[10887.466460] R10: 00000000012cb860 R11: 0000000000000246 R12: 000055ca64921170
[10887.466463] R13: 000055ca64921170 R14: 000055ca648229d0 R15: 0000000000000000
[10887.466468] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul snd_hda_intel snd_intel_dspcfg crc32_pclmul snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm realtek i2c_i801 mei_me lpc_ich i2c_smbus mei pinctrl_broxton
[10887.466511] ---[ end trace e0363311618c484b ]---
[10887.466517] RIP: 0010:remap_pfn_range_notrack+0x30f/0x440
[10887.466524] Code: e8 96 d7 e0 ff 84 c0 0f 84 27 01 00 00 48 ba 00 f0 ff ff ff ff 0f 00 4c 89 e0 48 c1 e0 0c 4d 85 ed 75 96 48 21 d0 31 f6 eb a9 <0f> 0b 48 39 37 0f 85 0e 01 00 00 48 8b 0c 24 48 39 4f 08 0f 85 00
[10887.466528] RSP: 0018:ffffc90006e33c50 EFLAGS: 00010286
[10887.466532] RAX: 800000000000002f RBX: 00007f5e01800000 RCX: 0000000000000028
[10887.466535] RDX: 0000000000000001 RSI: ffffea0000000000 RDI: 0000000000000000
[10887.466538] RBP: ffffea000033fea8 R08: 800000000000002f R09: ffff8881072256e0
[10887.466541] R10: ffffc9000b84fff8 R11: 0000000017dab000 R12: 0000000000089f9f
[10887.466543] R13: 800000000000002f R14: 00007f5e017e4000 R15: ffff88800cffaf20
[10887.466547] FS: 00007f5e04849640(0000) GS:ffff888278000000(0000) knlGS:0000000000000000
[10887.466551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10887.466554] CR2: 00007fd9b191a2ac CR3: 00000001829ac000 CR4: 00000000003506f0
[10887.466558] note: ffmpeg[7775] exited with preempt_count 1
[10887.470646] BUG: unable to handle page fault for address: ffffc90006e33d30
[10887.470661] #PF: supervisor read access in kernel mode
[10887.470665] #PF: error_code(0x0000) - not-present page
[10887.470669] PGD 100000067 P4D 100000067 PUD 1001b9067 PMD 112c53067 PTE 0
[10887.470678] Oops: 0000 [#2] PREEMPT SMP PTI
[10887.470683] CPU: 0 PID: 7541 Comm: ffmpeg Tainted: G UD 5.13.0-rc3-CI-Nightly #1
[10887.470688] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.40 07/14/2017
[10887.470691] RIP: 0010:__ww_mutex_lock.isra.0+0x57d/0x7e0
[10887.470701] Code: fc ff ff e8 2c 9f 50 ff e9 20 fc ff ff f6 c2 04 0f 84 72 fb ff ff 48 89 d1 83 e1 03 e9 8e fb ff ff 48 85 c0 74 0d 49 8b 57 08 <48> 2b 50 08 48 85 d2 7f 23 48 8b 44 24 48 49 39 c5 75 12 e9 de fd
[10887.470706] RSP: 0018:ffffc9000697b9e0 EFLAGS: 00010286
[10887.470710] RAX: ffffc90006e33d28 RBX: ffff888182882b80 RCX: ffff88817af6b800
[10887.470713] RDX: 00000000002dc341 RSI: ffff88800cc83210 RDI: ffff888182882b80
[10887.470716] RBP: ffffc9000697ba80 R08: 0000000000000000 R09: ffff8880426f8000
[10887.470719] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88800cc83200
[10887.470722] R13: ffff88800cc83210 R14: ffffc9000697ba20 R15: ffffc9000697bbe0
[10887.470725] FS: 00007f5e08a06b80(0000) GS:ffff888278000000(0000) knlGS:0000000000000000
[10887.470729] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10887.470732] CR2: ffffc90006e33d30 CR3: 00000001829ac000 CR4: 00000000003506f0
[10887.470735] Call Trace:
[10887.470745] eb_validate_vmas+0x1e6/0x640 [i915]
[10887.470894] ? intel_timeline_pin+0xa1/0xe0 [i915]
[10887.471010] i915_gem_do_execbuffer+0xbbf/0x2010 [i915]
[10887.471129] ? update_load_avg+0x78/0x650
[10887.471139] ? _raw_spin_unlock_irqrestore+0x1b/0x30
[10887.471146] ? try_to_wake_up+0x7a/0x4a0
[10887.471151] ? asm_sysvec_call_function_single+0x12/0x20
[10887.471157] ? wake_page_function+0x59/0x90
[10887.471163] ? __wake_up_common+0x7a/0x140
[10887.471171] i915_gem_execbuffer2_ioctl+0x106/0x250 [i915]
[10887.471292] ? i915_gem_do_execbuffer+0x2010/0x2010 [i915]
[10887.471409] drm_ioctl_kernel+0xaa/0xf0
[10887.471419] drm_ioctl+0x1ec/0x390
[10887.471425] ? i915_gem_do_execbuffer+0x2010/0x2010 [i915]
[10887.471543] ? __schedule+0x255/0x8a0
[10887.471548] ? tracing_record_taskinfo_skip+0x4e/0x50
[10887.471554] __x64_sys_ioctl+0x72/0xb0
[10887.471560] do_syscall_64+0x40/0xb0
[10887.471567] entry_SYSCALL_64_after_hwframe+0x44/0xae
[10887.471573] RIP: 0033:0x7f5e091ab8ab
[10887.471579] Code: 4c 89 e0 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 95 75 0d 00 f7 d8 64 89 01 48
[10887.471583] RSP: 002b:00007fffe0c79708 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[10887.471589] RAX: ffffffffffffffda RBX: 00007fffe0c797a0 RCX: 00007f5e091ab8ab
[10887.471592] RDX: 00007fffe0c797a0 RSI: 00000000c0406469 RDI: 0000000000000005
[10887.471595] RBP: 00000000c0406469 R08: 000055ca648b0620 R09: 0000000000000000
[10887.471598] R10: 0000000000000000 R11: 0000000000000246 R12: 000055ca64827b40
[10887.471601] R13: 0000000000000005 R14: 000055ca65679280 R15: 0000000000000000
[10887.471606] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul snd_hda_intel snd_intel_dspcfg crc32_pclmul snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm realtek i2c_i801 mei_me lpc_ich i2c_smbus mei pinctrl_broxton
[10887.471639] CR2: ffffc90006e33d30
[10887.471644] ---[ end trace e0363311618c484c ]---
[10887.471648] RIP: 0010:remap_pfn_range_notrack+0x30f/0x440
[10887.471655] Code: e8 96 d7 e0 ff 84 c0 0f 84 27 01 00 00 48 ba 00 f0 ff ff ff ff 0f 00 4c 89 e0 48 c1 e0 0c 4d 85 ed 75 96 48 21 d0 31 f6 eb a9 <0f> 0b 48 39 37 0f 85 0e 01 00 00 48 8b 0c 24 48 39 4f 08 0f 85 00
[10887.471659] RSP: 0018:ffffc90006e33c50 EFLAGS: 00010286
[10887.471663] RAX: 800000000000002f RBX: 00007f5e01800000 RCX: 0000000000000028
[10887.471666] RDX: 0000000000000001 RSI: ffffea0000000000 RDI: 0000000000000000
[10887.471669] RBP: ffffea000033fea8 R08: 800000000000002f R09: ffff8881072256e0
[10887.471672] R10: ffffc9000b84fff8 R11: 0000000017dab000 R12: 0000000000089f9f
[10887.471675] R13: 800000000000002f R14: 00007f5e017e4000 R15: ffff88800cffaf20
[10887.471678] FS: 00007f5e08a06b80(0000) GS:ffff888278000000(0000) knlGS:0000000000000000
[10887.471682] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10887.471685] CR2: ffffc90006e33d30 CR3: 00000001829ac000 CR4: 00000000003506f0
[10887.471689] note: ffmpeg[7541] exited with preempt_count 2
[10947.718711] rcu: INFO: rcu_preempt self-detected stall on CPU
[10947.718740] rcu: 3-....: (60000 ticks this GP) idle=5ca/1/0x4000000000000000 softirq=1428744/1428744 fqs=14877
...