Sometimes the discrete graphics card (Navi 14 - 5500M) stops working
Brief summary of the problem:
When the problem happens, if I'm running a game it slows down to a crawl and after that the discrete graphics card cannot be used anymore, at least until the next reboot. It is also not visible anymore with DRI_PRIME=1 glxinfo and with vuklaninfo.
Hardware description:
- CPU: AMD Ryzen 7 5800H
- GPU (integrated): 07:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [1002:1638] (rev c5)
- GPU (discrete): 03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 14 [Radeon RX 5500/5500M / Pro 5500M] [1002:7340] (rev c1)
- System Memory: 16 GB
- Display(s): 1x 1920x1080 @ 144 Hz
- Type of Display Connection: eDP-1
System information:
- Distribution: Arch Linux
- Kernel version: 5.15.6-arch2-1
- Custom kernel: the problem also happens (maybe more often?) with the zen kernel, 5.15.6-zen2-1-zen
- Driver version: Mesa 21.3.0
- KDE Wayland session version 5.23.4 (I will try to run Xorg for a while to see if the problem happens there too)
How to reproduce the issue:
Unfortunately I don't know how reproduce it reliably.
Dmesg log
[sab dic 4 13:46:44 2021] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[sab dic 4 13:46:44 2021] [drm] PSP is resuming...
[sab dic 4 13:46:44 2021] [drm] reserve 0x900000 from 0x800f400000 for PSP TMR
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[sab dic 4 13:46:44 2021] [drm] kiq ring mec 2 pipe 1 q 0
[sab dic 4 13:46:44 2021] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[sab dic 4 13:46:44 2021] [drm] JPEG decode initialized successfully.
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[sab dic 4 13:46:44 2021] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes
[sab dic 4 13:46:53 2021] ------------[ cut here ]------------
[sab dic 4 13:46:53 2021] WARNING: CPU: 2 PID: 448 at drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:479 amdgpu_bo_move+0x466/0x7e0 [amdgpu]
[sab dic 4 13:46:53 2021] Modules linked in: ccm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device intel_rapl_msr intel_rapl_common hid_sensor_accel_3d hid_sensor_als hid_sensor_gyro_3d hid_sensor_magn_3d hid_sensor_prox hid_sensor_trigger industrialio_triggered_buffer snd_acp3x_rn snd_soc_dmic snd_acp3x_pdm_dma kfifo_buf snd_hda_codec_realtek hid_sensor_iio_common edac_mce_amd industrialio snd_soc_core iwlmvm snd_hda_codec_generic snd_compress ledtrig_audio kvm_amd snd_hda_codec_hdmi ac97_bus mousedev hid_multitouch joydev uvcvideo snd_hda_intel mac80211 kvm snd_pcm_dmaengine msi_wmi videobuf2_vmalloc sparse_keymap snd_intel_dspcfg snd_intel_sdw_acpi wmi_bmof libarc4 irqbypass videobuf2_memops hid_sensor_hub snd_hda_codec btusb amdgpu lzo_rle crct10dif_pclmul videobuf2_v4l2 btrtl crc32_pclmul snd_hda_core r8169 iwlwifi ghash_clmulni_intel btbcm snd_hwdep videobuf2_common aesni_intel btintel snd_pcm crypto_simd realtek bluetooth cryptd snd_timer gpu_sched snd_pci_acp5x videodev sp5100_tco mdio_devres
[sab dic 4 13:46:53 2021] vfat snd drm_ttm_helper rapl snd_rn_pci_acp3x ecdh_generic fat psmouse pcspkr cfg80211 mc usbhid k10temp i2c_piix4 crc16 amd_sfh snd_pci_acp3x libphy soundcore ttm ccp rfkill tpm_crb mac_hid wmi video tpm_tis i2c_hid_acpi tpm_tis_core i2c_hid acpi_cpufreq tpm soc_button_array rng_core pinctrl_amd amd_pmc ipmi_devintf ipmi_msghandler crypto_user fuse zram bpf_preload ip_tables x_tables btrfs blake2b_generic serio_raw libcrc32c atkbd crc32c_generic libps2 xor raid6_pq i8042 xhci_pci crc32c_intel xhci_pci_renesas serio
[sab dic 4 13:46:53 2021] CPU: 2 PID: 448 Comm: kworker/2:2 Not tainted 5.15.6-arch2-1 #1 cfba5f24b926d50e4fcc5026b2bafd12217f3134
[sab dic 4 13:46:53 2021] Hardware name: Micro-Star International Co., Ltd. Bravo 15 B5DD/MS-158K, BIOS E158KAMS.105 05/20/2021
[sab dic 4 13:46:53 2021] Workqueue: pm pm_runtime_work
[sab dic 4 13:46:53 2021] RIP: 0010:amdgpu_bo_move+0x466/0x7e0 [amdgpu]
[sab dic 4 13:46:53 2021] Code: 89 ef e8 ad 69 8a ff 41 89 c0 85 c0 0f 85 4a fe ff ff 48 8b b5 70 01 00 00 48 8b bd 48 01 00 00 e8 1f cc ff ff e9 ea fd ff ff <0f> 0b 41 b8 ea ff ff ff e9 25 fe ff ff 83 f8 02 0f 85 6e fc ff ff
[sab dic 4 13:46:53 2021] RSP: 0018:ffffb86a014fbb30 EFLAGS: 00010202
[sab dic 4 13:46:53 2021] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff979d48cfeac0
[sab dic 4 13:46:53 2021] RDX: ffffb86a014fbcd0 RSI: 0000000000000001 RDI: 0000000000000002
[sab dic 4 13:46:53 2021] RBP: ffff979d59614458 R08: ffffb86a014fbc38 R09: 0000000000000000
[sab dic 4 13:46:53 2021] R10: ffff979d25f96f08 R11: 0000000000000000 R12: ffff979d48cfeac0
[sab dic 4 13:46:53 2021] R13: ffffb86a014fbcd0 R14: ffffb86a014fbc38 R15: ffff979d23ea5270
[sab dic 4 13:46:53 2021] FS: 0000000000000000(0000) GS:ffff97a01e680000(0000) knlGS:0000000000000000
[sab dic 4 13:46:53 2021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[sab dic 4 13:46:53 2021] CR2: 000013a600e13000 CR3: 00000001beb6e000 CR4: 0000000000750ee0
[sab dic 4 13:46:53 2021] PKRU: 55555554
[sab dic 4 13:46:53 2021] Call Trace:
[sab dic 4 13:46:53 2021] <TASK>
[sab dic 4 13:46:53 2021] ? kmem_cache_alloc_trace+0x190/0x310
[sab dic 4 13:46:53 2021] ? unmap_mapping_pages+0xa2/0x130
[sab dic 4 13:46:53 2021] ttm_bo_handle_move_mem+0x8d/0x190 [ttm 37b22b071aea884b59976171a78824a07a4e64ff]
[sab dic 4 13:46:53 2021] ttm_mem_evict_first+0x276/0x460 [ttm 37b22b071aea884b59976171a78824a07a4e64ff]
[sab dic 4 13:46:53 2021] ? kfree+0x384/0x400
[sab dic 4 13:46:53 2021] ttm_resource_manager_evict_all+0xa2/0x1d0 [ttm 37b22b071aea884b59976171a78824a07a4e64ff]
[sab dic 4 13:46:53 2021] amdgpu_device_suspend+0x73/0xc0 [amdgpu 51b0b14928c80512d86d3896dd88e580472f265b]
[sab dic 4 13:46:53 2021] amdgpu_pmops_runtime_suspend+0x99/0x140 [amdgpu 51b0b14928c80512d86d3896dd88e580472f265b]
[sab dic 4 13:46:53 2021] pci_pm_runtime_suspend+0x5e/0x180
[sab dic 4 13:46:53 2021] ? dequeue_entity+0xc6/0x470
[sab dic 4 13:46:53 2021] ? pci_dev_put+0x20/0x20
[sab dic 4 13:46:53 2021] __rpm_callback+0x44/0x120
[sab dic 4 13:46:53 2021] ? pci_dev_put+0x20/0x20
[sab dic 4 13:46:53 2021] rpm_callback+0x5f/0x70
[sab dic 4 13:46:53 2021] ? pci_dev_put+0x20/0x20
[sab dic 4 13:46:53 2021] rpm_suspend+0x177/0x750
[sab dic 4 13:46:53 2021] ? __schedule+0x339/0x1540
[sab dic 4 13:46:53 2021] pm_runtime_work+0x94/0xa0
[sab dic 4 13:46:53 2021] process_one_work+0x1e8/0x3c0
[sab dic 4 13:46:53 2021] worker_thread+0x50/0x3c0
[sab dic 4 13:46:53 2021] ? process_one_work+0x3c0/0x3c0
[sab dic 4 13:46:53 2021] kthread+0x132/0x160
[sab dic 4 13:46:53 2021] ? set_kthread_struct+0x50/0x50
[sab dic 4 13:46:53 2021] ret_from_fork+0x22/0x30
[sab dic 4 13:46:53 2021] </TASK>
[sab dic 4 13:46:53 2021] ---[ end trace 3980420df5973d8c ]---
[sab dic 4 13:46:53 2021] [drm] free PSP TMR buffer
[sab dic 4 13:48:42 2021] audit: type=1334 audit(1638622122.533:96): prog-id=31 op=LOAD
[sab dic 4 13:48:42 2021] audit: type=1334 audit(1638622122.533:97): prog-id=32 op=LOAD
[sab dic 4 13:48:42 2021] audit: type=1334 audit(1638622122.536:98): prog-id=33 op=LOAD
[sab dic 4 13:48:42 2021] audit: type=1130 audit(1638622122.576:99): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[sab dic 4 13:48:49 2021] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[sab dic 4 13:48:49 2021] [drm] PSP is resuming...
[sab dic 4 13:48:52 2021] [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
[sab dic 4 13:48:52 2021] [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[sab dic 4 13:48:52 2021] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
[sab dic 4 13:48:52 2021] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
[sab dic 4 13:48:52 2021] amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:55 param:0x00000000 message:ReenableAcDcInterrupt?
[sab dic 4 13:48:52 2021] amdgpu 0000:03:00.0: amdgpu: Ack AC/DC interrupt Failed!