[bisected] amdgpu graphics acceleration causing system crashes on 22f3bcfb or later
Tried running the latest main branch on my machine and hit some serious problems. Anything from just running glxgears to starting sddm will break.
I first noticed it when starting sddm on bootup, the display rendered garbage for a moment, then went black and didn't come back. Looking at the kernel log, it entered a GPU reset loop:
Started bpfilter
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2, emitted seq=4
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1062 thread Xorg:cs0 pid 1189
amdgpu 0000:65:00.0: amdgpu: GPU reset begin!
[drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
------------[ cut here ]------------
sched: CPU 1 need_resched set for > 100999520 ns (101 ticks) without schedule
WARNING: CPU: 1 PID: 578 at kernel/sched/debug.c:1086 resched_latency_warn+0x50/0x60
Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter bpfilter overlay snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof snd_sof_utils snd_sof_xtensa_dsp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_soc_core snd_hda_scodec_cs35l41_spi snd_compress regmap_spi ac97_bus snd_hda_scodec_cs35l41_i2c uvcvideo snd_pci_ps snd_hda_intel snd_hda_scodec_cs35l41 snd_rpl_pci_acp6x snd_acp_pci videobuf2_vmalloc snd_intel_dspcfg snd_hda_cs_dsp_ctls uvc btusb snd_pci_acp6x videobuf2_memops hid_multitouch cs_dsp iwlmvm snd_hda_codec btrtl snd_pci_acp5x snd_soc_cs35l41_lib videobuf2_v4l2 snd_rn_pci_acp3x btbcm intel_rapl_msr btintel snd_hwdep intel_rapl_common mac80211 videodev snd_hda_core regmap_i2c bluetooth iwlwifi snd_acp_config i2c_hid_acpi snd_pcm videobuf2_common i2c_hid serial_multi_instantiate snd_soc_acpi ecdh_generic mc asus_nb_wmi kvm_amd wmi_bmof
nvidia_wmi_ec_backlight snd_timer nvidia_drm(PO) ecc k10temp i2c_piix4 snd_pci_acp3x ucsi_acpi cfg80211 snd joydev nvidia_modeset(PO) rtsx_pci typec_ucsi soundcore typec thermal ac input_leds tpm_crb tpm_tis tpm_tis_core i2c_designware_platform amd_pmc i2c_designware_core usbip_host usbip_core nvidia_uvm(PO) nvidia(PO) crypto_user acpi_call(O) fuse bpf_preload ip_tables btrfs blake2b_generic libcrc32c xor raid6_pq dm_crypt trusted asn1_encoder tee tpm hid_asus asus_wmi ledtrig_audio sparse_keymap platform_profile rfkill usbhid amdgpu drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_buddy drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt crct10dif_pclmul atkbd crc32_pclmul polyval_clmulni libps2 polyval_generic vivaldi_fmap serio_raw ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd drm cryptd cec ccp battery i8042 video wmi nvme nvme_core nvme_common
CPU: 1 PID: 578 Comm: kworker/u32:5 Tainted: P O 6.3.9-zen-1-zen-g0a0760279f2e #1 6fc2b2e7f59010a63bcc724fde5809f73c8f5390
Hardware name: ASUSTeK COMPUTER INC. ROG Zephyrus G14 GA402XZ_GA402XZ/GA402XZ, BIOS GA402XZ.310 06/05/2023
Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
RIP: 0010:resched_latency_warn+0x50/0x60
Code: 48 63 d5 48 c7 c0 c0 15 03 00 89 ee 48 c7 c7 98 85 55 9e 48 8b 14 d5 c0 68 66 9e 8b 8c 02 70 04 00 00 48 89 da e8 90 d8 fb ff <0f> 0b 5b 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48
RSP: 0018:ffffa9b6801e4ea0 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000000006052160 RCX: 0000000000000027
RDX: ffff919d9e85f488 RSI: 0000000000000001 RDI: ffff919d9e85f480
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000003 R11: ffffffff9f03a4c8 R12: 0000000000000001
R13: 00000000000f41ec R14: ffff919d9e862320 R15: ffff919d9e862300
FS: 0000000000000000(0000) GS:ffff919d9e840000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5f300010b8 CR3: 00000002bee43000 CR4: 0000000000750ee0
PKRU: 55555554
Call Trace:
<IRQ>
? resched_latency_warn+0x50/0x60
? __warn+0x7d/0x130
? resched_latency_warn+0x50/0x60
? report_bug+0x169/0x1a0
? prb_read_valid+0x17/0x20
? handle_bug+0x36/0x70
? exc_invalid_op+0x13/0x60
? asm_exc_invalid_op+0x16/0x20
? resched_latency_warn+0x50/0x60
scheduler_tick+0x315/0x340
update_process_times+0x7f/0x90
tick_sched_handle+0x22/0x60
tick_sched_timer+0x5f/0x70
? tick_sched_do_timer+0x90/0x90
__hrtimer_run_queues+0x108/0x2a0
hrtimer_interrupt+0xf3/0x220
__sysvec_apic_timer_interrupt+0x4e/0x120
sysvec_apic_timer_interrupt+0x69/0x90
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x16/0x20
RIP: 0010:delay_halt_mwaitx+0x39/0x40
Code: 48 89 d1 65 48 03 05 ae e2 33 62 0f 01 fa b8 ff ff ff ff b9 02 00 00 00 48 39 c6 48 0f 46 c6 48 89 c3 b8 f0 00 00 00 0f 01 fb <5b> c3 0f 1f 44 00 00 0f 1f 44 00 00 ff 25 fd 12 99 00 0f 1f 44 00
RSP: 0018:ffffa9b681627a80 EFLAGS: 00000293
RAX: 00000000000000f0 RBX: 0000000000004dfd RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000004dfd RDI: 00000020bd453818
RBP: 00000020bd453818 R08: ffffffffc0b8e5e0 R09: 0000000000000001
R10: 0000000000000003 R11: ffffffff9f03a4c8 R12: 000000000000001e
R13: 000000000000001c R14: ffff9196867f8388 R15: ffff9196867f82e0
delay_halt+0x35/0x50
amdgpu_fence_wait_polling+0x27/0x60 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
mes_v11_0_submit_pkt_and_poll_completion.constprop.0+0x156/0x200 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
mes_v11_0_unmap_legacy_queue+0x7b/0xc0 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
amdgpu_mes_unmap_legacy_queue+0x60/0xa0 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
amdgpu_gfx_disable_kcq+0x98/0xf0 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
gfx_v11_0_hw_fini+0x46/0x150 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
amdgpu_device_ip_suspend_phase2+0xfa/0x190 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
? amdgpu_device_ip_suspend_phase1+0x68/0xd0 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
amdgpu_device_ip_suspend+0x2e/0x60 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
amdgpu_device_pre_asic_reset+0xcf/0x290 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
amdgpu_device_gpu_recover+0x484/0xcf0 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
amdgpu_job_timedout+0x147/0x1f0 [amdgpu c49cefa0d37a6bab00a61d743e1dba99cfb38aa3]
drm_sched_job_timedout+0x6c/0xf0 [gpu_sched b825f5fdbd96e4b89e34829b0e07cfaaac579270]
process_one_work+0x1c0/0x3d0
worker_thread+0x4e/0x390
? process_one_work+0x3d0/0x3d0
kthread+0xd7/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
---[ end trace 0000000000000000 ]---
[drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
amdgpu 0000:65:00.0: amdgpu: MODE2 reset
amdgpu 0000:65:00.0: amdgpu: GPU reset succeeded, trying to resume
[drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
amdgpu 0000:65:00.0: amdgpu: SMU is resuming...
amdgpu 0000:65:00.0: amdgpu: SMU is resumed successfully!
[drm] DMUB hardware initialized: version=0x08000E00
[drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:264
[drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:272
[drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:280
[drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:288
[drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:264
[drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:272
[drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:280
[drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:288
[drm] kiq ring mec 3 pipe 1 q 0
[drm] VCN decode and encode initialized successfully(under DPG Mode).
amdgpu 0000:65:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
amdgpu 0000:65:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
amdgpu 0000:65:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
amdgpu 0000:65:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
amdgpu 0000:65:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
amdgpu 0000:65:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
amdgpu 0000:65:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
amdgpu 0000:65:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
amdgpu 0000:65:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
amdgpu 0000:65:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
amdgpu 0000:65:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
amdgpu 0000:65:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 1
amdgpu 0000:65:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 1
amdgpu 0000:65:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
amdgpu 0000:65:00.0: amdgpu: recover vram bo from shadow start
amdgpu 0000:65:00.0: amdgpu: recover vram bo from shadow done
[drm] ring gfx_32770.1.1 was added
[drm] Skip scheduling IBs!
[drm] ring compute_32770.2.2 was added
[drm] ring sdma_32770.3.3 was added
[drm] ring gfx_32770.1.1 test pass
[drm] ring gfx_32770.1.1 ib test pass
[drm] ring compute_32770.2.2 test pass
[drm] ring compute_32770.2.2 ib test pass
[drm] ring sdma_32770.3.3 test pass
[drm] ring sdma_32770.3.3 ib test pass
amdgpu 0000:65:00.0: amdgpu: GPU reset(2) succeeded!
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=6, emitted seq=7
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1062 thread Xorg:cs0 pid 1189
amdgpu 0000:65:00.0: amdgpu: GPU reset begin!
I looked through the Mesa issues tracker and thought #9249 (closed) was related, but my bisection lands on a totally different commit:
# bad: [07597596588973cea5bfe064ecc4017dd24357be] rusticl/program: skip linking compiled binaries
# good: [12b123fdb7205b46a4d6e668cb4630ebedc2c381] radeonsi: handle VGT_LS_HS_CONFIG like a tracker register
git bisect start '07597596588' '12b123fd'
# good: [8b888ead2f738fa24ccb0cb534a932eb45d67484] dzn: Remove dynamic check for block-compressed support
git bisect good 8b888ead2f738fa24ccb0cb534a932eb45d67484
# bad: [98c8d7b7cfbe7dc66a87bbe8fda56d855053d7cd] venus: Fix detection of push descriptor set
git bisect bad 98c8d7b7cfbe7dc66a87bbe8fda56d855053d7cd
# bad: [a6e6646d918a1110211cebfb634db0bccc69d40e] radeonsi: reorder compute code to prepare for packed SET_SH_REG packets
git bisect bad a6e6646d918a1110211cebfb634db0bccc69d40e
# good: [f9a4b8e6401a875db7886ad8baeefdd9d1461b21] docs/ci: fix command to disable/re-enable farms
git bisect good f9a4b8e6401a875db7886ad8baeefdd9d1461b21
# bad: [22f3bcfb5a3311a2c61ad26c943976e66b68b09c] radeonsi/gfx11: use SET_*_REG_PAIRS_PACKED packets for pm4 states
git bisect bad 22f3bcfb5a3311a2c61ad26c943976e66b68b09c
# good: [b4e2073f041174a4dd4de141823d7950ffb78819] zink/ci: remove 3 tests from the fails list
git bisect good b4e2073f041174a4dd4de141823d7950ffb78819
# good: [5632d8d1a777d39c7882dcb011aab4619bcff01a] radeonsi: replace tcs_out_lds_layout with nearly identical tes_offchip_addr
git bisect good 5632d8d1a777d39c7882dcb011aab4619bcff01a
# good: [1aa99437d3784cb1193120d8e069bd168ba9e749] radeonsi: eliminate redundant TCS user data and RSRC2 register changes
git bisect good 1aa99437d3784cb1193120d8e069bd168ba9e749
# first bad commit: [22f3bcfb5a3311a2c61ad26c943976e66b68b09c] radeonsi/gfx11: use SET_*_REG_PAIRS_PACKED packets for pm4 states
Hardware is an ASUS Zephyrus G14 2023 model with what vulkaninfo/glxinfo call a "GFX1103_R1" (Windows calls it the "AMD Radeon 780M").