GPU crash and failed reset leading to deadlock on Polaris 22 XL [Radeon RX Vega M GL]

Rémi Verschelde uploaded an attachment:

Attachment 143951, "dmesg output after GPU crash in game StarCrawlers with kernel 5.0.7-desktop from Mageia 7":
dmesg-starcrawlers-gpu-crash_mageia-linux-5.0.7-desktop.log

Rémi Verschelde uploaded an attachment:

Built with the same .config as Mageia's 5.0.7-desktop kernel, see next attachment.

Attachment 143952, "dmesg output after GPU crash in game For The King with kernel 5.0-rc1 built from amd-staging-drm-next":
dmesg-ftk-gpu-crash_amd-staging-drm-next.log

Rémi Verschelde uploaded an attachment:

Attachment 143953, "/proc/config.gz from Mageia's kernel 5.0.7-desktop, used for custom amd-staging-drm-next build":
kernel-config-mageia-5.0.7-desktop.txt

Rémi Verschelde said:

Pasting some relevant output from attachment 143951 so that relevant keywords can be found by Bugzilla searches.

[  325.087186] mce: CPU7: Core temperature above threshold, cpu clock throttled (total events = 1)<br>
[  325.087187] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)<br>
[  325.087188] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)<br>
[  325.087189] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)<br>
[  325.087224] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)<br>
[  325.087225] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)<br>
[  325.087226] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)<br>
[  325.087226] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)<br>
[  325.087227] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)<br>
[  325.087228] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)<br>
[  325.089212] mce: CPU7: Core temperature/speed normal<br>
[  325.089213] mce: CPU0: Package temperature/speed normal<br>
[  325.089214] mce: CPU3: Core temperature/speed normal<br>
[  325.089214] mce: CPU4: Package temperature/speed normal<br>
[  325.089215] mce: CPU7: Package temperature/speed normal<br>
[  325.089215] mce: CPU3: Package temperature/speed normal<br>
[  325.089248] mce: CPU6: Package temperature/speed normal<br>
[  325.089248] mce: CPU5: Package temperature/speed normal<br>
[  325.089249] mce: CPU2: Package temperature/speed normal<br>
[  325.089250] mce: CPU1: Package temperature/speed normal<br>
[  565.312183] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0040d508 for process  pid 0 thread  pid 0<br>
[  565.312194] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00169208<br>
[  565.312200] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF<br>
[  565.312209] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 1479176, write from '\xff\xff\xff\xff' (0xffffffff) (511)<br>
[  565.312219] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00405508 for process  pid 0 thread  pid 0<br>
[  565.312224] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0xFFFFFFFF<br>
[  565.312229] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF<br>
[  565.312236] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)<br>
[  565.312244] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00485508 for process  pid 0 thread  pid 0<br>
[  565.312248] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0xFFFFFFFF<br>
[  565.312252] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF<br>
[  565.312258] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)<br>
<br>
<snip><br>
<br>
[  565.312378] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00785508 for process  pid 0 thread  pid 0<br>
[  565.312383] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0xFFFFFFFF<br>
[  565.312387] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF<br>
[  565.312393] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)<br>
[  575.625913] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=117668, emitted seq=117670<br>
[  575.625950] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process starcrawlers.x8 pid 9151 thread starcrawle:cs0 pid 9162<br>
[  575.625953] amdgpu 0000:01:00.0: GPU reset begin!<br>
[  575.626419] amdgpu: [powerplay] <br>
                last message was failed ret is 65535<br>
[  575.626420] amdgpu: [powerplay] <br>
                failed to send message 281 ret is 65535 <br>
[  575.636259] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110<br>
[  575.651311] amdgpu: [powerplay] <br>
                last message was failed ret is 65535<br>
[  575.651312] amdgpu: [powerplay] <br>
                failed to send message 133 ret is 65535 <br>
[  575.651316] amdgpu: [powerplay] <br>
                last message was failed ret is 65535<br>
[  575.651316] amdgpu: [powerplay] <br>
                failed to send message 310 ret is 65535 <br>
[  575.651317] amdgpu: [powerplay] <br>
                last message was failed ret is 65535<br>
[  575.651317] amdgpu: [powerplay] <br>
                failed to send message 5e ret is 65535 <br>
<br>
<snip><br>
<br>
[  575.651340] amdgpu: [powerplay] <br>
                last message was failed ret is 65535<br>
[  575.651341] amdgpu: [powerplay] <br>
                failed to send message 84 ret is 65535 <br>
[  575.651341] amdgpu: [powerplay] Failed to force to switch arbf0!<br>
[  575.651342] amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM!<br>
[  575.651360] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -22<br>
[  575.769673] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)<br>
[  575.769740] [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed<br>
[  575.888355] cp is busy, skip halt cp<br>
[  576.007183] rlc is busy, skip halt rlc<br>
[  576.008188] amdgpu 0000:01:00.0: GPU pci config reset<br>
[  576.126260] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* ASIC reset failed with err r, -22 for drm dev, 0000:01:00.0<br>
[  576.127736] Asynchronous wait on fence drm_sched:gfx:1ca87 timed out (hint:submit_notify+0x0/0x58 [i915])<br>
[  576.127768] Asynchronous wait on fence drm_sched:gfx:1ca82 timed out (hint:submit_notify+0x0/0x58 [i915])<br>
[  576.127788] Asynchronous wait on fence i915:Xorg[3673]/0:6455 timed out (hint:intel_atomic_commit_ready+0x0/0x4c [i915])<br>
[  581.126683] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting<br>
[  581.126734] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D654 (len 62, WS 0, PS 0) @ 0xD670<br>
[  581.126754] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C410 (len 114, WS 0, PS 8) @ 0xC42B<br>
[  581.126755] [drm] asic atom init failed!<br>
[  581.126765] amdgpu 0000:01:00.0: GPU reset(2) failed<br>
[  581.126766] amdgpu 0000:01:00.0: GPU reset end with ret = -22<br>
[  581.126777] [drm] Skip scheduling IBs!<br>
[  581.126782] [drm] Skip scheduling IBs!<br>
[  581.126784] [drm] Skip scheduling IBs!<br>
[  581.126785] [drm] Skip scheduling IBs!<br>
[  581.126786] [drm] Skip scheduling IBs!<br>
[  581.126787] [drm] Skip scheduling IBs!<br>
[  581.126789] [drm] Skip scheduling IBs!<br>
[  581.126790] [drm] Skip scheduling IBs!<br>
[  581.126791] [drm] Skip scheduling IBs!<br>
[  591.487678] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=117670, emitted seq=117670<br>
[  591.487716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process starcrawlers.x8 pid 9151 thread starcrawle:cs0 pid 9162<br>
[  591.487719] amdgpu 0000:01:00.0: GPU reset begin!<br>
[  591.488418] amdgpu: [powerplay] <br>
                last message was failed ret is 65535<br>
[  591.488419] amdgpu: [powerplay] <br>
                failed to send message 281 ret is 65535 <br>
[  591.488495] WARNING: CPU: 2 PID: 666 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:788 dm_suspend+0x4e/0x60 [amdgpu]<br>
[  591.488496] Modules linked in: cmac rfcomm ccm msr ip6t_REJECT nf_reject_ipv6 xt_comment ip6table_mangle ip6table_nat nf_nat_ipv6 ip6table_raw nf_log_ipv6 ip6table_filter ip6_tables xt_recent ipt_IFWLOG ipt_psd xt_set ip_set_hash_ip ip_set ipt_REJECT nf_reject_ipv4 xt_conntrack xt_hashlimit xt_addrtype xt_mark iptable_mangle iptable_nat nf_nat_ipv4 xt_CT xt_tcpudp iptable_raw nfnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_conntrack_sane nf_conntrack_netlink nfnetlink nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_nat nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter af_packet bnep binfmt_misc fuse nls_iso8859_1 nls_cp437 vfat fat dm_mirror dm_region_hash dm_log dm_mod snd_hda_codec_hdmi arc4 joydev<br>
[  591.488509]  intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm hid_sensor_incl_3d hid_sensor_gyro_3d hid_sensor_magn_3d hid_sensor_rotation hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio irqbypass hid_multitouch crc32_pclmul crc32c_intel ghash_clmulni_intel spi_pxa2xx_platform 8250_dw iwlmvm hid_sensor_hub aesni_intel iTCO_wdt iTCO_vendor_support mac80211 snd_hda_codec_realtek hid_generic aes_x86_64 input_leds tpm_crb crypto_simd cryptd snd_hda_codec_generic glue_helper ledtrig_audio intel_cstate psmouse intel_uncore iwlwifi snd_hda_intel thermal snd_hda_codec uvcvideo btusb snd_hda_core btbcm videobuf2_vmalloc btrtl videobuf2_memops videobuf2_v4l2 btintel videobuf2_common cfg80211 snd_hwdep videodev snd_pcm bluetooth media snd_timer intel_rapl_perf pinctrl_sunrisepoint ucsi_acpi typec_ucsi usbhid typec tpm_tis pinctrl_intel intel_wmi_thunderbolt snd tpm_tis_core hp_wmi soundcore tpm wmi_bmof idma64 ecdh_generic<br>
[  591.488521]  int3400_thermal battery virt_dma button acpi_thermal_rel rtsx_pci_ms intel_vbtn i2c_i801 acpi_pad hp_wireless ac rfkill sparse_keymap int3403_thermal memstick mei_me mei intel_lpss_pci intel_pch_thermal intel_lpss processor_thermal_device intel_ishtp_hid int340x_thermal_zone intel_soc_dts_iosf evdev nvram sch_fq_codel efivarfs ip_tables x_tables ipv6 crc_ccitt autofs4 amdgpu xhci_pci rtsx_pci_sdmmc xhci_hcd mmc_block mmc_core usbcore serio_raw chash amd_iommu_v2 rtsx_pci gpu_sched intel_ish_ipc ttm intel_ishtp usb_common i915 i2c_hid hid i2c_algo_bit drm_kms_helper wmi video drm<br>
[  591.488549] CPU: 2 PID: 666 Comm: kworker/2:2 Not tainted 5.0.7-desktop-4.mga7 #1<br>
[  591.488550] Hardware name: HP HP Spectre x360 Convertible 15-ch0xx/83BB, BIOS F.24 11/06/2018<br>
[  591.488552] Workqueue: events drm_sched_job_timedout [gpu_sched]<br>
[  591.488627] RIP: 0010:dm_suspend+0x4e/0x60 [amdgpu]<br>
[  591.488627] Code: 00 48 89 83 70 cb 00 00 e8 af fc ff ff 48 89 df e8 67 75 00 00 48 8b bb 60 b3 00 00 be 08 00 00 00 e8 16 8f 0a 00 31 c0 5b c3 <0f> 0b eb c1 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00<br>
[  591.488628] RSP: 0018:ffffb50201f97d20 EFLAGS: 00010282<br>
[  591.488629] RAX: ffffffffc08a3e00 RBX: ffff93f4a35c0000 RCX: 0000000000000012<br>
[  591.488629] RDX: 0000000000000080 RSI: 0000000000000001 RDI: ffff93f4a35c0000<br>
[  591.488629] RBP: ffff93f4a35ccb98 R08: 0000000000000492 R09: 0000000000000004<br>
[  591.488630] R10: 0000000000000000 R11: 0000000000000001 R12: ffff93f4a35c0000<br>
[  591.488630] R13: ffffffffc09e25a0 R14: 0000000000000000 R15: ffff93f4a35c3498<br>
[  591.488631] FS:  0000000000000000(0000) GS:ffff93f4b1c80000(0000) knlGS:0000000000000000<br>
[  591.488631] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033<br>
[  591.488632] CR2: 00007f8c18a40a38 CR3: 000000033220e002 CR4: 00000000003606e0<br>
[  591.488632] Call Trace:<br>
[  591.488676]  amdgpu_device_ip_suspend_phase1+0x94/0xc0 [amdgpu]<br>
[  591.488721]  amdgpu_device_ip_suspend+0x1b/0x60 [amdgpu]<br>
[  591.488796]  amdgpu_device_pre_asic_reset+0x9e/0x260 [amdgpu]<br>
[  591.488817]  amdgpu_device_gpu_recover+0x87/0x7e0 [amdgpu]<br>
[  591.488828]  ? drm_err+0x72/0x90 [drm]<br>
[  591.488882]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]<br>
[  591.488884]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]<br>
[  591.488887]  process_one_work+0x200/0x400<br>
[  591.488888]  worker_thread+0x2d/0x3d0<br>
[  591.488889]  ? process_one_work+0x400/0x400<br>
[  591.488891]  kthread+0x112/0x130<br>
[  591.488892]  ? kthread_create_on_node+0x60/0x60<br>
[  591.488894]  ret_from_fork+0x35/0x40<br>
[  591.488895] ---[ end trace 356c1ae357df635c ]---<br>
[  591.499325] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110<br>

Rémi Verschelde said:

Tried another game (Northgard) today, same issue. I manually enabled the performance CPU governor, but it didn't prevent the GPU crash which happened ~10 min in game.

I'm attaching the dmesg, journalctl and Xorg.1.log taken right after the crash (before deadlock).

Rémi Verschelde uploaded an attachment:

Worth noting: Northgard is not a Unity3D game compared to For The King and StarCrawlers. It uses the Haxe/Heaps engine.

Attachment 143958, "dmesg output after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7":
dmesg-northgard-kernel-5.0.7-desktop.log

Rémi Verschelde uploaded an attachment:

As can be seen in these logs, I'm running Plasma 5/KWin. Some messages from plasmashell regarding temperature/sensors are likely due to the widget I use to monitor the CPU and GPU temperatures in the taskbar.

Attachment 143959, "journalctl -b output after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7":
journalctl-northgard-kernel-5.0.7-desktop.log

Rémi Verschelde uploaded an attachment:

This seems to only cover the startup of the computer. The rest of the log seems to be in Xorg.1.log, I guess DRI_PRIME=1 does that.

Attachment 143960, "Xorg.0.log after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7":
Xorg.0-northgard-kernel-5.0.7-desktop.log

Rémi Verschelde uploaded an attachment:

Attachment 143961, "Xorg.1.log after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7":
Xorg.1-northgard-kernel-5.0.7-desktop.log

Alex Behling said:

From my experience this seems to be a thermal problem. I have the exact same hardware configuration running latest Archlinux Kernel.

$ uname -a
Linux lexnote 5.2.3-arch1-1-ARCH #1 (closed) SMP PREEMPT Fri Jul 26 08:13:47 UTC 2019 x86_64 GNU/Linux

If I leave leave the system with the default PM settings (Profile Performance or Balance doesn't matter) sooner or later I will get Lock-Ups in any game or application with higher GPU loads.

EXAMPLE DMESG OUTPUT:

[Do Jul 25 23:33:45 2019] amdgpu 0000:01:00.0: GPU pci config reset
[Do Jul 25 23:33:53 2019] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[Do Jul 25 23:33:53 2019] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring gfx test failed (-110)
[Do Jul 25 23:33:53 2019] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] ERROR resume of IP block <gfx_v8_0> failed -110
[Do Jul 25 23:33:53 2019] [drm:amdgpu_device_resume [amdgpu]] ERROR amdgpu_device_ip_resume failed (-110).
[Do Jul 25 23:33:53 2019] [drm] schedsdma0 is not ready, skipping
[Do Jul 25 23:33:53 2019] [drm] schedsdma1 is not ready, skipping
[Do Jul 25 23:33:59 2019] WARNING: CPU: 1 PID: 20969 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:891 dm_suspend+0x4e/0x60 [amdgpu]
[Do Jul 25 23:33:59 2019] Modules linked in: msr fuse 8021q garp mrp stp llc ccm snd_hda_codec_hdmi hid_sensor_gyro_3d hid_sensor_accel_3d hid_sensor_magn_3d hid_sensor_rotation hid_sensor_incl_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio hid_sensor_hub intel_ishtp_loader intel_ishtp_hid arc4 iwlmvm mousedev cdc_ether usbnet r8152 xpad ff_memless joydev mii mac80211 uvcvideo btusb videobuf2_vmalloc hid_logitech_hidpp videobuf2_memops btrtl btbcm nls_iso8859_1 videobuf2_v4l2 btintel nls_cp437 videobuf2_common bluetooth vfat fat videodev media spi_pxa2xx_platform ecdh_generic iTCO_wdt 8250_dw hid_multitouch ecc mei_hdcp iTCO_vendor_support iwlwifi intel_rapl hp_wmi x86_pkg_temp_thermal wmi_bmof intel_powerclamp intel_wmi_thunderbolt coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm psmouse input_leds snd_hda_intel cfg80211 irqbypass intel_cstate snd_hda_codec snd_hda_core intel_uncore snd_hwdep intel_rapl_perf snd_pcm
[Do Jul 25 23:33:59 2019] rtsx_pci_ms memstick snd_timer pcspkr mei_me intel_ish_ipc processor_thermal_device snd idma64 int3403_thermal ucsi_acpi i2c_i801 soundcore typec_ucsi rfkill intel_lpss_pci mei tpm_crb int340x_thermal_zone intel_pch_thermal intel_ishtp intel_soc_dts_iosf intel_lpss i2c_hid typec wmi tpm_tis tpm_tis_core tpm rng_core intel_vbtn battery sparse_keymap hp_wireless evdev mac_hid int3400_thermal acpi_thermal_rel ac pcc_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 algif_skcipher af_alg hid_logitech_dj hid_generic usbhid hid dm_crypt crct10dif_pclmul crc32_pclmul dm_mod crc32c_intel ghash_clmulni_intel rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 ahci libahci aesni_intel libata aes_x86_64 crypto_simd cryptd xhci_pci glue_helper scsi_mod xhci_hcd rtsx_pci i8042 serio amdgpu amd_iommu_v2 gpu_sched ttm i915 intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
[Do Jul 25 23:33:59 2019] agpgart
[Do Jul 25 23:33:59 2019] CPU: 1 PID: 20969 Comm: kworker/1:2 Tainted: G OE 5.2.1-arch1-1-ARCH #1 (closed)
[Do Jul 25 23:33:59 2019] Hardware name: HP HP Spectre x360 Convertible 15-ch0xx/83BB, BIOS F.24 11/06/2018
[Do Jul 25 23:33:59 2019] Workqueue: pm pm_runtime_work
[Do Jul 25 23:33:59 2019] RIP: 0010:dm_suspend+0x4e/0x60 [amdgpu]
[Do Jul 25 23:33:59 2019] Code: 00 48 89 83 70 e9 00 00 e8 9f fc ff ff 48 89 df e8 97 83 00 00 48 8b bb 70 cf 00 00 be 08 00 00 00 e8 b6 9a 08 00 31 c0 5b c3 <0f> 0b eb c1 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00
[Do Jul 25 23:33:59 2019] RSP: 0018:ffffb286869cfcb8 EFLAGS: 00010282
[Do Jul 25 23:33:59 2019] RAX: ffffffffc0675ed0 RBX: ffffa1b9e0d30000 RCX: ffffffffc073e980
[Do Jul 25 23:33:59 2019] RDX: 0000000000000080 RSI: 0000000000000001 RDI: ffffa1b9e0d30000
[Do Jul 25 23:33:59 2019] RBP: ffffa1b9e0d3e998 R08: 0000000000000001 R09: 0000000000000018
[Do Jul 25 23:33:59 2019] R10: fefefefefefefeff R11: 0000000000000000 R12: ffffa1b9e0d30000
[Do Jul 25 23:33:59 2019] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa1b9ebc8bd80
[Do Jul 25 23:33:59 2019] FS: 0000000000000000(0000) GS:ffffa1b9eea40000(0000) knlGS:0000000000000000
[Do Jul 25 23:33:59 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Do Jul 25 23:33:59 2019] CR2: 00007f1d9c02b000 CR3: 0000000469058002 CR4: 00000000003606e0
[Do Jul 25 23:33:59 2019] Call Trace:
[Do Jul 25 23:33:59 2019] amdgpu_device_ip_suspend_phase1+0x8e/0xc0 [amdgpu]
[Do Jul 25 23:33:59 2019] amdgpu_device_suspend+0x234/0x390 [amdgpu]
[Do Jul 25 23:33:59 2019] amdgpu_pmops_runtime_suspend+0x41/0xb0 [amdgpu]
[Do Jul 25 23:33:59 2019] pci_pm_runtime_suspend+0x5b/0x150
[Do Jul 25 23:33:59 2019] ? __switch_to_asm+0x40/0x70
[Do Jul 25 23:33:59 2019] vga_switcheroo_runtime_suspend+0x25/0xb0
[Do Jul 25 23:33:59 2019] ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019] __rpm_callback+0x7b/0x130
[Do Jul 25 23:33:59 2019] ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019] ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019] rpm_callback+0x2a/0x90
[Do Jul 25 23:33:59 2019] ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019] rpm_suspend+0x136/0x610
[Do Jul 25 23:33:59 2019] pm_runtime_work+0x94/0xa0
[Do Jul 25 23:33:59 2019] process_one_work+0x1d1/0x3e0
[Do Jul 25 23:33:59 2019] worker_thread+0x4a/0x3d0
[Do Jul 25 23:33:59 2019] kthread+0xfd/0x130
[Do Jul 25 23:33:59 2019] ? process_one_work+0x3e0/0x3e0
[Do Jul 25 23:33:59 2019] ? kthread_park+0x90/0x90
[Do Jul 25 23:33:59 2019] ret_from_fork+0x35/0x40
[Do Jul 25 23:33:59 2019] ---[ end trace 69a711ec632dab70 ]---
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Trying to disable SCLK DPM when DPM is disabled
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Trying to disable voltage DPM when DPM is disabled
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Failed to force to switch arbf0!
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM!
[Do Jul 25 23:33:59 2019] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] ERROR suspend of IP block <powerplay> failed -22
[Do Jul 25 23:34:00 2019] cp is busy, skip halt cp
[Do Jul 25 23:34:00 2019] rlc is busy, skip halt rlc
[Do Jul 25 23:34:00 2019] amdgpu 0000:01:00.0: GPU pci config reset

If I reduce the maximum CPU Frequency to 2.2GHz keeping the temperatures of CPU cores and GPU just below 60 degree Celsius the problem does not occur anymore.

$ sudo cpupower frequency-set -u 2.2GHz

Utku Helvacı (tuxutku) uploaded an attachment:

kernel 5.3.0-rc1 was just fine and was just fixed a long lasted regression on rx 540 gpu, updating to 5.3.0-rc2 causes gpu to be disabled after launching a single application with it, gpu works fine until application is closed, then DRI_PRIME=1 doesn't work

Attachment 144926, "journalctl -b0 output on kernel 5.3.0-rc2 from ubuntu mainline repository, with a system with rx 540 gpu":
rx_540_5.3.0-rc2.txt

Utku Helvacı (tuxutku) said:

as it turns out this is not a bug in kernel but amd's aco compiler so its irrelevant

This issue hasn't had any activity since 2019-11-19. The AMD driver stack changes rapidly and contains lots of shared code across products so it's possible that it has already been fixed. Please upgrade to a current stable kernel and userspace stack and try again. If you still experience this issue with the latest driver stack, please capture relevant logging and open a new issue referring back to this one.

closed

mentioned in issue #2892

GPU crash and failed reset leading to deadlock on Polaris 22 XL [Radeon RX Vega M GL]

Submitted by Rémi Verschelde

Description

Designs

Child items ...

Activity

Admin message

Admin message

GPU crash and failed reset leading to deadlock on Polaris 22 XL [Radeon RX Vega M GL]

Submitted by Rémi Verschelde

Description

Activity