[gfxhub0] retry page fault from Chromium use
System information
inxi -GSC -xx
output
System: Host: c1d7b983e36d Kernel: 5.4.0-45-generic x86_64 bits: 64 gcc: 7.5.0 Console: tty 0
Distro: Ubuntu 18.04.4 LTS
CPU: Dual core AMD Ryzen Embedded V1202B with Radeon Vega Gfx (-MT-MCP-) arch: Zen rev.0 cache: 1024 KB
flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm) bmips: 9183
clock speeds: min/max: 1600/2300 MHz 1: 1602 MHz 2: 1500 MHz 3: 1541 MHz 4: 1554 MHz
Graphics: Card: Advanced Micro Devices [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
bus-ID: 05:00.0 chip-ID: 1002:15dd
Display Server: X.org 1.20.8 driver: amdgpu tty size: 270x59 Advanced Data: N/A for root out of X
- OS: Ubuntu 18.04.4 LTS
- GPU: [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] [1002:15dd] (rev 85)
- Kernel version: Linux c1d7b983e36d 5.4.0-45-generic #49 (closed)~18.04.2-Ubuntu SMP Wed Aug 26 16:29:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- Mesa version: OpenGL version string: 4.6 (Compatibility Profile) Mesa 20.0.8
- Xserver version: X.Org X Server 1.20.8
- Desktop manager and compositor: None (Chromium in kiosk mode)
Describe the issue
This machine runs Chromium in kiosk mode 24/7. Every couple of weeks it is crashing with the trace below:
Jan 08 19:35:35 c1d7b983e36d kernel: amdgpu 0000:05:00.0: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process chromium-browse pid 3024 thread chromium-b:cs0 pid 3162)
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: in page starting at address 0x0000800102a40000 from client 27
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: MORE_FAULTS: 0x1
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: WALKER_ERROR: 0x0
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: PERMISSION_FAULTS: 0x5
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: MAPPING_ERROR: 0x0
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: RW: 0x1
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32773, for process chromium-browse pid 3024 thread chromium-b:cs0 pid 3162)
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: in page starting at address 0x0000800102a41000 from client 27
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00341051
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: MORE_FAULTS: 0x1
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: WALKER_ERROR: 0x0
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: PERMISSION_FAULTS: 0x5
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: MAPPING_ERROR: 0x0
Jan 08 19:35:46 c1d7b983e36d kernel: amdgpu 0000:05:00.0: RW: 0x1
<snip - lines above are repeated many times>
Jan 08 19:35:56 c1d7b983e36d kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=5819802, emitted seq=5819805
Jan 08 19:35:56 c1d7b983e36d kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chromium-browse pid 3024 thread chromium-b:cs0 pid 3162
Jan 08 19:35:56 c1d7b983e36d kernel: amdgpu 0000:05:00.0: GPU reset begin!
Jan 08 19:35:56 c1d7b983e36d kernel: amdgpu 0000:05:00.0: GPU reset succeeded, trying to resume
Jan 08 19:35:56 c1d7b983e36d kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
Jan 08 19:35:56 c1d7b983e36d kernel: [drm] PSP is resuming...
Jan 08 19:35:56 c1d7b983e36d kernel: [drm] reserve 0x400000 from 0xf43f800000 for PSP TMR
Jan 08 19:35:56 c1d7b983e36d kernel: [drm] psp command failed and response status is (0x7)
Jan 08 19:36:05 c1d7b983e36d kernel: show_signal_msg: 231 callbacks suppressed
Jan 08 19:36:05 c1d7b983e36d kernel: GpuWatchdog[3182]: segfault at 0 ip 000055f162597c07 sp 00007fe841122700 error 6 in chromium-browser[55f15d5fd000+903c000]
Jan 08 19:36:05 c1d7b983e36d kernel: Code: 7d b7 00 79 09 48 8b 7d a0 e8 05 1b 7f fe 8b 83 00 01 00 00 85 c0 0f 84 91 00 00 00 48 8b 03 48 89 df be 01 00 00 00 ff 50 68 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 47 d5 99 04 01 80 7d 87 00
Jan 08 19:36:20 c1d7b983e36d kernel: [drm] psp command failed and response status is (0x0)
Jan 08 19:36:20 c1d7b983e36d kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP load asd failed!
Jan 08 19:36:20 c1d7b983e36d kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Jan 08 19:36:20 c1d7b983e36d kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22
Jan 08 19:36:20 c1d7b983e36d kernel: [drm] Skip scheduling IBs!
Jan 08 19:36:20 c1d7b983e36d kernel: ------------[ cut here ]------------
Jan 08 19:36:20 c1d7b983e36d kernel: WARNING: CPU: 3 PID: 2748 at /build/linux-hwe-5.4-6nUBUV/linux-hwe-5.4-5.4.0/include/linux/dma-fence.h:533 drm_sched_resubmit_jobs+0x14c/0x160 [gpu_sched]
Jan 08 19:36:20 c1d7b983e36d kernel: Modules linked in: xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat br_netfilter bridge stp llc aufs overlay joydev edac_mce_amd kvm_amd ccp kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio k1
Jan 08 19:36:20 c1d7b983e36d kernel: async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage amdgpu crct10dif_pclmul amd_iommu_v2 gpu_sched crc32_pclmul i2c_algo_bit ghash_clm
Jan 08 19:36:20 c1d7b983e36d kernel: CPU: 3 PID: 2748 Comm: kworker/3:1 Not tainted 5.4.0-45-generic #49~18.04.2-Ubuntu
Jan 08 19:36:20 c1d7b983e36d kernel: Hardware name: Advantech Co Ltd. DPX-W258/DPX-W258, BIOS W2580000D60X016 12/13/2019
Jan 08 19:36:20 c1d7b983e36d kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
Jan 08 19:36:20 c1d7b983e36d kernel: RIP: 0010:drm_sched_resubmit_jobs+0x14c/0x160 [gpu_sched]
Jan 08 19:36:20 c1d7b983e36d kernel: Code: 41 5c 41 5d 41 5e 41 5f 5d c3 49 8b 47 10 31 c9 48 c7 80 80 00 00 00 00 00 00 00 49 8b 7d 70 31 c0 83 e7 01 74 04 0f 0b eb bf <0f> 0b eb c7 0f 0b eb 8a 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
Jan 08 19:36:20 c1d7b983e36d kernel: RSP: 0018:ffffbf64c1eefd28 EFLAGS: 00010246
Jan 08 19:36:20 c1d7b983e36d kernel: RAX: 0000000000000000 RBX: ffff9c7b07507980 RCX: 0000000000000000
Jan 08 19:36:20 c1d7b983e36d kernel: RDX: ffff9c7aa253bf00 RSI: ffff9c7b109fa5f8 RDI: 0000000000000000
Jan 08 19:36:20 c1d7b983e36d kernel: RBP: ffffbf64c1eefd60 R08: 0000000000000604 R09: 0000000000000004
Jan 08 19:36:20 c1d7b983e36d kernel: R10: ffffbf64c1eefc98 R11: 0000000000000001 R12: 0000000000000001
Jan 08 19:36:20 c1d7b983e36d kernel: R13: ffff9c7aa253bec0 R14: ffff9c7b064bd400 R15: ffff9c7b109fa400
Jan 08 19:36:20 c1d7b983e36d kernel: FS: 0000000000000000(0000) GS:ffff9c7b17ac0000(0000) knlGS:0000000000000000
Jan 08 19:36:20 c1d7b983e36d kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 08 19:36:20 c1d7b983e36d kernel: CR2: 000000c420093010 CR3: 00000001cb45e000 CR4: 00000000003406e0
Jan 08 19:36:20 c1d7b983e36d kernel: Call Trace:
Jan 08 19:36:20 c1d7b983e36d kernel: amdgpu_device_gpu_recover+0x38e/0xa50 [amdgpu]
Jan 08 19:36:20 c1d7b983e36d kernel: amdgpu_job_timedout+0x116/0x140 [amdgpu]
Jan 08 19:36:20 c1d7b983e36d kernel: drm_sched_job_timedout+0x44/0x90 [gpu_sched]
Jan 08 19:36:20 c1d7b983e36d kernel: ? __schedule+0x29b/0x720
Jan 08 19:36:20 c1d7b983e36d kernel: ? drm_sched_job_timedout+0x44/0x90 [gpu_sched]
Jan 08 19:36:20 c1d7b983e36d kernel: process_one_work+0x20f/0x400
Jan 08 19:36:20 c1d7b983e36d kernel: worker_thread+0x34/0x410
Jan 08 19:36:20 c1d7b983e36d kernel: kthread+0x121/0x140
Jan 08 19:36:20 c1d7b983e36d kernel: ? process_one_work+0x400/0x400
Jan 08 19:36:20 c1d7b983e36d kernel: ? kthread_park+0x90/0x90
Jan 08 19:36:20 c1d7b983e36d kernel: ret_from_fork+0x22/0x40
Jan 08 19:36:20 c1d7b983e36d kernel: ---[ end trace 1cca7ed603c38222 ]---
Jan 08 19:36:20 c1d7b983e36d kernel: amdgpu 0000:05:00.0: couldn't schedule ib on ring <gfx>
Jan 08 19:36:20 c1d7b983e36d kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
Jan 08 19:36:20 c1d7b983e36d kernel: amdgpu 0000:05:00.0: GPU reset(3) failed
Jan 08 19:36:20 c1d7b983e36d kernel: amdgpu 0000:05:00.0: GPU reset end with ret = -22
Jan 08 19:36:20 c1d7b983e36d kernel: [drm] Skip scheduling IBs!
Jan 08 19:36:20 c1d7b983e36d kernel: ------------[ cut here ]------------
Jan 08 19:36:20 c1d7b983e36d kernel: WARNING: CPU: 3 PID: 225 at /build/linux-hwe-5.4-6nUBUV/linux-hwe-5.4-5.4.0/include/linux/dma-fence.h:533 drm_sched_main+0x2d5/0x310 [gpu_sched]
Jan 08 19:36:20 c1d7b983e36d kernel: Modules linked in: xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat br_netfilter bridge stp llc aufs overlay joydev edac_mce_amd kvm_amd ccp kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio k1
Jan 08 19:36:20 c1d7b983e36d kernel: async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage amdgpu crct10dif_pclmul amd_iommu_v2 gpu_sched crc32_pclmul i2c_algo_bit ghash_clm
Jan 08 19:36:20 c1d7b983e36d kernel: CPU: 3 PID: 225 Comm: gfx Tainted: G W 5.4.0-45-generic #49~18.04.2-Ubuntu
Jan 08 19:36:20 c1d7b983e36d kernel: Hardware name: Advantech Co Ltd. DPX-W258/DPX-W258, BIOS W2580000D60X016 12/13/2019
Jan 08 19:36:20 c1d7b983e36d kernel: RIP: 0010:drm_sched_main+0x2d5/0x310 [gpu_sched]
Jan 08 19:36:20 c1d7b983e36d kernel: Code: 3b e9 ff e9 f5 fe ff ff 48 89 de 4c 89 ef e8 32 fb ff ff e9 e5 fe ff ff 49 8b 56 70 45 31 ed 31 c0 83 e2 01 74 04 0f 0b eb a8 <0f> 0b eb b4 48 8d 75 a8 4c 89 ef 48 89 45 98 e8 27 6c 7a d3 48 8b
Jan 08 19:36:20 c1d7b983e36d kernel: RSP: 0018:ffffbf64c04fbe98 EFLAGS: 00010246
Jan 08 19:36:20 c1d7b983e36d kernel: RAX: 0000000000000000 RBX: ffff9c7b109f8858 RCX: 0000000000000018
Jan 08 19:36:20 c1d7b983e36d kernel: RDX: 0000000000000000 RSI: 0000000000000282 RDI: 0000000000000282
Jan 08 19:36:20 c1d7b983e36d kernel: RBP: ffffbf64c04fbf00 R08: 0000000000000628 R09: 0000000000000004
Jan 08 19:36:20 c1d7b983e36d kernel: R10: ffff9c7a4963e040 R11: 0000000000000001 R12: ffff9c7b07507980
Jan 08 19:36:20 c1d7b983e36d kernel: R13: 0000000000000000 R14: ffff9c7aa253ae40 R15: ffff9c7b07507b18
Jan 08 19:36:20 c1d7b983e36d kernel: FS: 0000000000000000(0000) GS:ffff9c7b17ac0000(0000) knlGS:0000000000000000
Jan 08 19:36:20 c1d7b983e36d kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 08 19:36:20 c1d7b983e36d kernel: CR2: 000000c420093010 CR3: 00000001cb45e000 CR4: 00000000003406e0
Jan 08 19:36:20 c1d7b983e36d kernel: Call Trace:
Jan 08 19:36:20 c1d7b983e36d kernel: ? wait_woken+0x80/0x80
Jan 08 19:36:20 c1d7b983e36d kernel: kthread+0x121/0x140
Jan 08 19:36:20 c1d7b983e36d kernel: ? drm_sched_start+0x130/0x130 [gpu_sched]
Jan 08 19:36:20 c1d7b983e36d kernel: ? kthread_park+0x90/0x90
Jan 08 19:36:20 c1d7b983e36d kernel: ret_from_fork+0x22/0x40
Jan 08 19:36:20 c1d7b983e36d kernel: ---[ end trace 1cca7ed603c38223 ]---
Jan 08 19:36:20 c1d7b983e36d kernel: amdgpu 0000:05:00.0: couldn't schedule ib on ring <gfx>
Jan 08 19:36:20 c1d7b983e36d kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
Jan 08 19:36:20 c1d7b983e36d kernel: amdgpu 0000:05:00.0: couldn't schedule ib on ring <gfx>
Jan 08 19:36:20 c1d7b983e36d kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
<snip - lines above are repeated many times>
-- Reboot --
Regression
Did it used to work? No it has always done this