GPU crash and failed reset leading to deadlock on Polaris 22 XL [Radeon RX Vega M GL]
Submitted by Rémi Verschelde
Assigned to Default DRI bug account
Link to original bug (#110413)
Description
Created attachment 143950
lspci -vvv output for HP Spectre 360x
My HP Spectre x360 laptop bought in March 2019 comes with KabyLake G HD Graphics 630 and a discrete AMD Radeon RX Vega M GL GPU.
I only enable the Radeon GPU when needed to play graphics intensive games with DRI_PRIME=1
, and so far I experience a lot of GPU deadlocks with the following symptoms:
- Temperatures raise, the CPUs are throttled. Framerate drops when this happens.
- Later on, GPU faults are reported in dmesg, the game's rendering freezes (but music continues playing). I am still able to alt+tab back to desktop or open a terminal, but the game's process can't be killed. If I'm monitoring temperatures, lm_sensors always reports a bogus 511°C temperature for the AMD dGPU at this point, before breaking.
- Any subsequent attempt at using the AMD GPU will cause a system deadlock, and I need to force shutdown with the power button.
My testing so far has covered:
- Unity3D games like For The King or StarCrawlers. The crash happens mid-game, not in a strictly reproducible manner, but seems related to CPU temperature/throttling.
* I could also reproduce the crash with SuperTuxKart, not in-game but when alt-tabbing back to desktop.
* I could not get the crash yet with glmark2. With For The King, I can reliably get a crash within 1 to 10 minutes in-game when playing with "High" or "Dream" graphics quality.
- Kernel 5.0.x (up to 5.0.7) from Mageia 7 (Cauldron), e.g. 5.0.7-desktop-4.mga7.
* I also tried `git://people.freedesktop.org/~agd5f/linux -b amd-staging-drm-next` at b07c394a327fc9e435ee03288584c111fa73d963, but I still got the same symptoms. dmesg output was in part different though, more spammy.
* Following discussions in bug 109692, I tried the patches provided by Andrey Grodzovsky in bug 109692 comment 34, but they did not solve the issue for me.
- Mesa 19.0.0 to 19.0.2 built against LLVM 7.0.1.
- Suspecting the CPU temperature/throttling as a trigger, I'm using https://github.com/kitsunyan/intel-undervolt to undervolt the CPU Cache by -100 mV and set the CPU limit temperature to 80°C instead of 100°C. This has helped with throttling issues I had during code compilation, but no visible change on my GPU crashes that I can tell. I can disable this undervolting when doing tests if required.
I found various bug reports which might well be duplicates, but I'm opening my own to avoid hijacking discussions on what may or may not be the same root cause: bug 109461, bug 109466, bug 109692 (I installed Shadow of the Tomb Raider but haven't checked if I can reproduce this one's symptoms yet), bug 109819.
I attach some relevant logs on the system and the bug. Please ask for anything else you may need.
**Attachment 143950**, "lspci -vvv output for HP Spectre 360x":
lspci-vvv.log
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Bugzilla Migration User added AMDgpu bugzilla labels
Rémi Verschelde uploaded an attachment:Attachment 143951, "dmesg output after GPU crash in game StarCrawlers with kernel 5.0.7-desktop from Mageia 7":
dmesg-starcrawlers-gpu-crash_mageia-linux-5.0.7-desktop.log Rémi Verschelde uploaded an attachment:Built with the same .config as Mageia's 5.0.7-desktop kernel, see next attachment.
Attachment 143952, "dmesg output after GPU crash in game For The King with kernel 5.0-rc1 built from amd-staging-drm-next":
dmesg-ftk-gpu-crash_amd-staging-drm-next.log Rémi Verschelde uploaded an attachment:Attachment 143953, "/proc/config.gz from Mageia's kernel 5.0.7-desktop, used for custom amd-staging-drm-next build":
kernel-config-mageia-5.0.7-desktop.txt Rémi Verschelde said:Pasting some relevant output from attachment 143951 so that relevant keywords can be found by Bugzilla searches.
[ 325.087186] mce: CPU7: Core temperature above threshold, cpu clock throttled (total events = 1)<br> [ 325.087187] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)<br> [ 325.087188] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)<br> [ 325.087189] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)<br> [ 325.087224] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)<br> [ 325.087225] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)<br> [ 325.087226] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)<br> [ 325.087226] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)<br> [ 325.087227] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)<br> [ 325.087228] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)<br> [ 325.089212] mce: CPU7: Core temperature/speed normal<br> [ 325.089213] mce: CPU0: Package temperature/speed normal<br> [ 325.089214] mce: CPU3: Core temperature/speed normal<br> [ 325.089214] mce: CPU4: Package temperature/speed normal<br> [ 325.089215] mce: CPU7: Package temperature/speed normal<br> [ 325.089215] mce: CPU3: Package temperature/speed normal<br> [ 325.089248] mce: CPU6: Package temperature/speed normal<br> [ 325.089248] mce: CPU5: Package temperature/speed normal<br> [ 325.089249] mce: CPU2: Package temperature/speed normal<br> [ 325.089250] mce: CPU1: Package temperature/speed normal<br> [ 565.312183] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0040d508 for process pid 0 thread pid 0<br> [ 565.312194] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00169208<br> [ 565.312200] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF<br> [ 565.312209] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 1479176, write from '\xff\xff\xff\xff' (0xffffffff) (511)<br> [ 565.312219] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00405508 for process pid 0 thread pid 0<br> [ 565.312224] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0xFFFFFFFF<br> [ 565.312229] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF<br> [ 565.312236] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)<br> [ 565.312244] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00485508 for process pid 0 thread pid 0<br> [ 565.312248] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0xFFFFFFFF<br> [ 565.312252] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF<br> [ 565.312258] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)<br> <br> <snip><br> <br> [ 565.312378] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00785508 for process pid 0 thread pid 0<br> [ 565.312383] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0xFFFFFFFF<br> [ 565.312387] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF<br> [ 565.312393] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)<br> [ 575.625913] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=117668, emitted seq=117670<br> [ 575.625950] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process starcrawlers.x8 pid 9151 thread starcrawle:cs0 pid 9162<br> [ 575.625953] amdgpu 0000:01:00.0: GPU reset begin!<br> [ 575.626419] amdgpu: [powerplay] <br> last message was failed ret is 65535<br> [ 575.626420] amdgpu: [powerplay] <br> failed to send message 281 ret is 65535 <br> [ 575.636259] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110<br> [ 575.651311] amdgpu: [powerplay] <br> last message was failed ret is 65535<br> [ 575.651312] amdgpu: [powerplay] <br> failed to send message 133 ret is 65535 <br> [ 575.651316] amdgpu: [powerplay] <br> last message was failed ret is 65535<br> [ 575.651316] amdgpu: [powerplay] <br> failed to send message 310 ret is 65535 <br> [ 575.651317] amdgpu: [powerplay] <br> last message was failed ret is 65535<br> [ 575.651317] amdgpu: [powerplay] <br> failed to send message 5e ret is 65535 <br> <br> <snip><br> <br> [ 575.651340] amdgpu: [powerplay] <br> last message was failed ret is 65535<br> [ 575.651341] amdgpu: [powerplay] <br> failed to send message 84 ret is 65535 <br> [ 575.651341] amdgpu: [powerplay] Failed to force to switch arbf0!<br> [ 575.651342] amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM!<br> [ 575.651360] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -22<br> [ 575.769673] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)<br> [ 575.769740] [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed<br> [ 575.888355] cp is busy, skip halt cp<br> [ 576.007183] rlc is busy, skip halt rlc<br> [ 576.008188] amdgpu 0000:01:00.0: GPU pci config reset<br> [ 576.126260] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* ASIC reset failed with err r, -22 for drm dev, 0000:01:00.0<br> [ 576.127736] Asynchronous wait on fence drm_sched:gfx:1ca87 timed out (hint:submit_notify+0x0/0x58 [i915])<br> [ 576.127768] Asynchronous wait on fence drm_sched:gfx:1ca82 timed out (hint:submit_notify+0x0/0x58 [i915])<br> [ 576.127788] Asynchronous wait on fence i915:Xorg[3673]/0:6455 timed out (hint:intel_atomic_commit_ready+0x0/0x4c [i915])<br> [ 581.126683] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting<br> [ 581.126734] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D654 (len 62, WS 0, PS 0) @ 0xD670<br> [ 581.126754] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C410 (len 114, WS 0, PS 8) @ 0xC42B<br> [ 581.126755] [drm] asic atom init failed!<br> [ 581.126765] amdgpu 0000:01:00.0: GPU reset(2) failed<br> [ 581.126766] amdgpu 0000:01:00.0: GPU reset end with ret = -22<br> [ 581.126777] [drm] Skip scheduling IBs!<br> [ 581.126782] [drm] Skip scheduling IBs!<br> [ 581.126784] [drm] Skip scheduling IBs!<br> [ 581.126785] [drm] Skip scheduling IBs!<br> [ 581.126786] [drm] Skip scheduling IBs!<br> [ 581.126787] [drm] Skip scheduling IBs!<br> [ 581.126789] [drm] Skip scheduling IBs!<br> [ 581.126790] [drm] Skip scheduling IBs!<br> [ 581.126791] [drm] Skip scheduling IBs!<br> [ 591.487678] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=117670, emitted seq=117670<br> [ 591.487716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process starcrawlers.x8 pid 9151 thread starcrawle:cs0 pid 9162<br> [ 591.487719] amdgpu 0000:01:00.0: GPU reset begin!<br> [ 591.488418] amdgpu: [powerplay] <br> last message was failed ret is 65535<br> [ 591.488419] amdgpu: [powerplay] <br> failed to send message 281 ret is 65535 <br> [ 591.488495] WARNING: CPU: 2 PID: 666 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:788 dm_suspend+0x4e/0x60 [amdgpu]<br> [ 591.488496] Modules linked in: cmac rfcomm ccm msr ip6t_REJECT nf_reject_ipv6 xt_comment ip6table_mangle ip6table_nat nf_nat_ipv6 ip6table_raw nf_log_ipv6 ip6table_filter ip6_tables xt_recent ipt_IFWLOG ipt_psd xt_set ip_set_hash_ip ip_set ipt_REJECT nf_reject_ipv4 xt_conntrack xt_hashlimit xt_addrtype xt_mark iptable_mangle iptable_nat nf_nat_ipv4 xt_CT xt_tcpudp iptable_raw nfnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_conntrack_sane nf_conntrack_netlink nfnetlink nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_nat nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter af_packet bnep binfmt_misc fuse nls_iso8859_1 nls_cp437 vfat fat dm_mirror dm_region_hash dm_log dm_mod snd_hda_codec_hdmi arc4 joydev<br> [ 591.488509] intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm hid_sensor_incl_3d hid_sensor_gyro_3d hid_sensor_magn_3d hid_sensor_rotation hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio irqbypass hid_multitouch crc32_pclmul crc32c_intel ghash_clmulni_intel spi_pxa2xx_platform 8250_dw iwlmvm hid_sensor_hub aesni_intel iTCO_wdt iTCO_vendor_support mac80211 snd_hda_codec_realtek hid_generic aes_x86_64 input_leds tpm_crb crypto_simd cryptd snd_hda_codec_generic glue_helper ledtrig_audio intel_cstate psmouse intel_uncore iwlwifi snd_hda_intel thermal snd_hda_codec uvcvideo btusb snd_hda_core btbcm videobuf2_vmalloc btrtl videobuf2_memops videobuf2_v4l2 btintel videobuf2_common cfg80211 snd_hwdep videodev snd_pcm bluetooth media snd_timer intel_rapl_perf pinctrl_sunrisepoint ucsi_acpi typec_ucsi usbhid typec tpm_tis pinctrl_intel intel_wmi_thunderbolt snd tpm_tis_core hp_wmi soundcore tpm wmi_bmof idma64 ecdh_generic<br> [ 591.488521] int3400_thermal battery virt_dma button acpi_thermal_rel rtsx_pci_ms intel_vbtn i2c_i801 acpi_pad hp_wireless ac rfkill sparse_keymap int3403_thermal memstick mei_me mei intel_lpss_pci intel_pch_thermal intel_lpss processor_thermal_device intel_ishtp_hid int340x_thermal_zone intel_soc_dts_iosf evdev nvram sch_fq_codel efivarfs ip_tables x_tables ipv6 crc_ccitt autofs4 amdgpu xhci_pci rtsx_pci_sdmmc xhci_hcd mmc_block mmc_core usbcore serio_raw chash amd_iommu_v2 rtsx_pci gpu_sched intel_ish_ipc ttm intel_ishtp usb_common i915 i2c_hid hid i2c_algo_bit drm_kms_helper wmi video drm<br> [ 591.488549] CPU: 2 PID: 666 Comm: kworker/2:2 Not tainted 5.0.7-desktop-4.mga7 #1<br> [ 591.488550] Hardware name: HP HP Spectre x360 Convertible 15-ch0xx/83BB, BIOS F.24 11/06/2018<br> [ 591.488552] Workqueue: events drm_sched_job_timedout [gpu_sched]<br> [ 591.488627] RIP: 0010:dm_suspend+0x4e/0x60 [amdgpu]<br> [ 591.488627] Code: 00 48 89 83 70 cb 00 00 e8 af fc ff ff 48 89 df e8 67 75 00 00 48 8b bb 60 b3 00 00 be 08 00 00 00 e8 16 8f 0a 00 31 c0 5b c3 <0f> 0b eb c1 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00<br> [ 591.488628] RSP: 0018:ffffb50201f97d20 EFLAGS: 00010282<br> [ 591.488629] RAX: ffffffffc08a3e00 RBX: ffff93f4a35c0000 RCX: 0000000000000012<br> [ 591.488629] RDX: 0000000000000080 RSI: 0000000000000001 RDI: ffff93f4a35c0000<br> [ 591.488629] RBP: ffff93f4a35ccb98 R08: 0000000000000492 R09: 0000000000000004<br> [ 591.488630] R10: 0000000000000000 R11: 0000000000000001 R12: ffff93f4a35c0000<br> [ 591.488630] R13: ffffffffc09e25a0 R14: 0000000000000000 R15: ffff93f4a35c3498<br> [ 591.488631] FS: 0000000000000000(0000) GS:ffff93f4b1c80000(0000) knlGS:0000000000000000<br> [ 591.488631] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033<br> [ 591.488632] CR2: 00007f8c18a40a38 CR3: 000000033220e002 CR4: 00000000003606e0<br> [ 591.488632] Call Trace:<br> [ 591.488676] amdgpu_device_ip_suspend_phase1+0x94/0xc0 [amdgpu]<br> [ 591.488721] amdgpu_device_ip_suspend+0x1b/0x60 [amdgpu]<br> [ 591.488796] amdgpu_device_pre_asic_reset+0x9e/0x260 [amdgpu]<br> [ 591.488817] amdgpu_device_gpu_recover+0x87/0x7e0 [amdgpu]<br> [ 591.488828] ? drm_err+0x72/0x90 [drm]<br> [ 591.488882] amdgpu_job_timedout+0xfc/0x120 [amdgpu]<br> [ 591.488884] drm_sched_job_timedout+0x39/0x60 [gpu_sched]<br> [ 591.488887] process_one_work+0x200/0x400<br> [ 591.488888] worker_thread+0x2d/0x3d0<br> [ 591.488889] ? process_one_work+0x400/0x400<br> [ 591.488891] kthread+0x112/0x130<br> [ 591.488892] ? kthread_create_on_node+0x60/0x60<br> [ 591.488894] ret_from_fork+0x35/0x40<br> [ 591.488895] ---[ end trace 356c1ae357df635c ]---<br> [ 591.499325] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110<br>
Rémi Verschelde said:Tried another game (Northgard) today, same issue. I manually enabled the performance CPU governor, but it didn't prevent the GPU crash which happened ~10 min in game.
I'm attaching the dmesg, journalctl and Xorg.1.log taken right after the crash (before deadlock). Rémi Verschelde uploaded an attachment:Worth noting: Northgard is not a Unity3D game compared to For The King and StarCrawlers. It uses the Haxe/Heaps engine.
Attachment 143958, "dmesg output after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7":
dmesg-northgard-kernel-5.0.7-desktop.log Rémi Verschelde uploaded an attachment:As can be seen in these logs, I'm running Plasma 5/KWin. Some messages from plasmashell regarding temperature/sensors are likely due to the widget I use to monitor the CPU and GPU temperatures in the taskbar.
Attachment 143959, "journalctl -b output after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7":
journalctl-northgard-kernel-5.0.7-desktop.log Rémi Verschelde uploaded an attachment:This seems to only cover the startup of the computer. The rest of the log seems to be in Xorg.1.log, I guess DRI_PRIME=1 does that.
Attachment 143960, "Xorg.0.log after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7":
Xorg.0-northgard-kernel-5.0.7-desktop.log Rémi Verschelde uploaded an attachment:Attachment 143961, "Xorg.1.log after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7":
Xorg.1-northgard-kernel-5.0.7-desktop.log Alex Behling said:From my experience this seems to be a thermal problem. I have the exact same hardware configuration running latest Archlinux Kernel.
$ uname -a
Linux lexnote 5.2.3-arch1-1-ARCH #1 (closed) SMP PREEMPT Fri Jul 26 08:13:47 UTC 2019 x86_64 GNU/Linux
If I leave leave the system with the default PM settings (Profile Performance or Balance doesn't matter) sooner or later I will get Lock-Ups in any game or application with higher GPU loads.
EXAMPLE DMESG OUTPUT:
[Do Jul 25 23:33:45 2019] amdgpu 0000:01:00.0: GPU pci config reset
[Do Jul 25 23:33:53 2019] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[Do Jul 25 23:33:53 2019] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring gfx test failed (-110)
[Do Jul 25 23:33:53 2019] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] ERROR resume of IP block<gfx_v8_0>
failed -110
[Do Jul 25 23:33:53 2019] [drm:amdgpu_device_resume [amdgpu]] ERROR amdgpu_device_ip_resume failed (-110).
[Do Jul 25 23:33:53 2019] [drm] schedsdma0 is not ready, skipping
[Do Jul 25 23:33:53 2019] [drm] schedsdma1 is not ready, skipping
[Do Jul 25 23:33:59 2019] WARNING: CPU: 1 PID: 20969 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:891 dm_suspend+0x4e/0x60 [amdgpu]
[Do Jul 25 23:33:59 2019] Modules linked in: msr fuse 8021q garp mrp stp llc ccm snd_hda_codec_hdmi hid_sensor_gyro_3d hid_sensor_accel_3d hid_sensor_magn_3d hid_sensor_rotation hid_sensor_incl_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio hid_sensor_hub intel_ishtp_loader intel_ishtp_hid arc4 iwlmvm mousedev cdc_ether usbnet r8152 xpad ff_memless joydev mii mac80211 uvcvideo btusb videobuf2_vmalloc hid_logitech_hidpp videobuf2_memops btrtl btbcm nls_iso8859_1 videobuf2_v4l2 btintel nls_cp437 videobuf2_common bluetooth vfat fat videodev media spi_pxa2xx_platform ecdh_generic iTCO_wdt 8250_dw hid_multitouch ecc mei_hdcp iTCO_vendor_support iwlwifi intel_rapl hp_wmi x86_pkg_temp_thermal wmi_bmof intel_powerclamp intel_wmi_thunderbolt coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm psmouse input_leds snd_hda_intel cfg80211 irqbypass intel_cstate snd_hda_codec snd_hda_core intel_uncore snd_hwdep intel_rapl_perf snd_pcm
[Do Jul 25 23:33:59 2019] rtsx_pci_ms memstick snd_timer pcspkr mei_me intel_ish_ipc processor_thermal_device snd idma64 int3403_thermal ucsi_acpi i2c_i801 soundcore typec_ucsi rfkill intel_lpss_pci mei tpm_crb int340x_thermal_zone intel_pch_thermal intel_ishtp intel_soc_dts_iosf intel_lpss i2c_hid typec wmi tpm_tis tpm_tis_core tpm rng_core intel_vbtn battery sparse_keymap hp_wireless evdev mac_hid int3400_thermal acpi_thermal_rel ac pcc_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 algif_skcipher af_alg hid_logitech_dj hid_generic usbhid hid dm_crypt crct10dif_pclmul crc32_pclmul dm_mod crc32c_intel ghash_clmulni_intel rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 ahci libahci aesni_intel libata aes_x86_64 crypto_simd cryptd xhci_pci glue_helper scsi_mod xhci_hcd rtsx_pci i8042 serio amdgpu amd_iommu_v2 gpu_sched ttm i915 intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
[Do Jul 25 23:33:59 2019] agpgart
[Do Jul 25 23:33:59 2019] CPU: 1 PID: 20969 Comm: kworker/1:2 Tainted: G OE 5.2.1-arch1-1-ARCH #1 (closed)
[Do Jul 25 23:33:59 2019] Hardware name: HP HP Spectre x360 Convertible 15-ch0xx/83BB, BIOS F.24 11/06/2018
[Do Jul 25 23:33:59 2019] Workqueue: pm pm_runtime_work
[Do Jul 25 23:33:59 2019] RIP: 0010:dm_suspend+0x4e/0x60 [amdgpu]
[Do Jul 25 23:33:59 2019] Code: 00 48 89 83 70 e9 00 00 e8 9f fc ff ff 48 89 df e8 97 83 00 00 48 8b bb 70 cf 00 00 be 08 00 00 00 e8 b6 9a 08 00 31 c0 5b c3<0f>
0b eb c1 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00
[Do Jul 25 23:33:59 2019] RSP: 0018:ffffb286869cfcb8 EFLAGS: 00010282
[Do Jul 25 23:33:59 2019] RAX: ffffffffc0675ed0 RBX: ffffa1b9e0d30000 RCX: ffffffffc073e980
[Do Jul 25 23:33:59 2019] RDX: 0000000000000080 RSI: 0000000000000001 RDI: ffffa1b9e0d30000
[Do Jul 25 23:33:59 2019] RBP: ffffa1b9e0d3e998 R08: 0000000000000001 R09: 0000000000000018
[Do Jul 25 23:33:59 2019] R10: fefefefefefefeff R11: 0000000000000000 R12: ffffa1b9e0d30000
[Do Jul 25 23:33:59 2019] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa1b9ebc8bd80
[Do Jul 25 23:33:59 2019] FS: 0000000000000000(0000) GS:ffffa1b9eea40000(0000) knlGS:0000000000000000
[Do Jul 25 23:33:59 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Do Jul 25 23:33:59 2019] CR2: 00007f1d9c02b000 CR3: 0000000469058002 CR4: 00000000003606e0
[Do Jul 25 23:33:59 2019] Call Trace:
[Do Jul 25 23:33:59 2019] amdgpu_device_ip_suspend_phase1+0x8e/0xc0 [amdgpu]
[Do Jul 25 23:33:59 2019] amdgpu_device_suspend+0x234/0x390 [amdgpu]
[Do Jul 25 23:33:59 2019] amdgpu_pmops_runtime_suspend+0x41/0xb0 [amdgpu]
[Do Jul 25 23:33:59 2019] pci_pm_runtime_suspend+0x5b/0x150
[Do Jul 25 23:33:59 2019] ? __switch_to_asm+0x40/0x70
[Do Jul 25 23:33:59 2019] vga_switcheroo_runtime_suspend+0x25/0xb0
[Do Jul 25 23:33:59 2019] ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019] __rpm_callback+0x7b/0x130
[Do Jul 25 23:33:59 2019] ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019] ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019] rpm_callback+0x2a/0x90
[Do Jul 25 23:33:59 2019] ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019] rpm_suspend+0x136/0x610
[Do Jul 25 23:33:59 2019] pm_runtime_work+0x94/0xa0
[Do Jul 25 23:33:59 2019] process_one_work+0x1d1/0x3e0
[Do Jul 25 23:33:59 2019] worker_thread+0x4a/0x3d0
[Do Jul 25 23:33:59 2019] kthread+0xfd/0x130
[Do Jul 25 23:33:59 2019] ? process_one_work+0x3e0/0x3e0
[Do Jul 25 23:33:59 2019] ? kthread_park+0x90/0x90
[Do Jul 25 23:33:59 2019] ret_from_fork+0x35/0x40
[Do Jul 25 23:33:59 2019] ---[ end trace 69a711ec632dab70 ]---
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Trying to disable SCLK DPM when DPM is disabled
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Trying to disable voltage DPM when DPM is disabled
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Failed to force to switch arbf0!
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM!
[Do Jul 25 23:33:59 2019] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] ERROR suspend of IP block<powerplay>
failed -22
[Do Jul 25 23:34:00 2019] cp is busy, skip halt cp
[Do Jul 25 23:34:00 2019] rlc is busy, skip halt rlc
[Do Jul 25 23:34:00 2019] amdgpu 0000:01:00.0: GPU pci config reset
If I reduce the maximum CPU Frequency to 2.2GHz keeping the temperatures of CPU cores and GPU just below 60 degree Celsius the problem does not occur anymore.
$ sudo cpupower frequency-set -u 2.2GHz Utku Helvacı (tuxutku) uploaded an attachment:kernel 5.3.0-rc1 was just fine and was just fixed a long lasted regression on rx 540 gpu, updating to 5.3.0-rc2 causes gpu to be disabled after launching a single application with it, gpu works fine until application is closed, then DRI_PRIME=1 doesn't work
Attachment 144926, "journalctl -b0 output on kernel 5.3.0-rc2 from ubuntu mainline repository, with a system with rx 540 gpu":
rx_540_5.3.0-rc2.txt Utku Helvacı (tuxutku) said:as it turns out this is not a bug in kernel but amd's aco compiler so its irrelevant
- Developer
This issue hasn't had any activity since 2019-11-19. The AMD driver stack changes rapidly and contains lots of shared code across products so it's possible that it has already been fixed. Please upgrade to a current stable kernel and userspace stack and try again. If you still experience this issue with the latest driver stack, please capture relevant logging and open a new issue referring back to this one.
- Mario Limonciello closed
closed
- Jonas Costa mentioned in issue #2892
mentioned in issue #2892