[amdgpu] Crash with RX 6900XT when writing to pp_table
Brief summary of the problem:
Since upgrading from Kernel 5.17.15 to Kernel 5.18.5 my whole system crashes when I write to pp_table.
Sometimes it works the first time without crash but never a second time.
Hardware description:
- CPU: AMD Ryzen 9 5900X
- GPU: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] [1002:73bf]
- System Memory: 32GB DDR4
- Display(s): 1 (2560x1440)
- Type of Display Connection: DP
System information:
- Distro name and Version: Manjaro Linux
- Kernel version: 5.18.5-1-MANJARO #1 (closed) SMP PREEMPT_DYNAMIC Thu Jun 16 12:28:47 UTC 2022 x86_64 GNU/Linux
- Custom kernel: N/A
- AMD official driver version: N/A
How to reproduce the issue:
It happens when I set it with UPP (Uplift Power Play) like this:
upp set --write \
smc_pptable/SocketPowerLimitAc/0=300 \
smc_pptable/SocketPowerLimitDc/0=300
And also if I plain write to pp_table:
cat navi21_pp_table > /sys/class/drm/card0/device/pp_table
This is the complete script I am using:
#!/bin/sh
#
# GPU Power
#
echo "auto" > /sys/class/drm/card0/device/power_dpm_force_performance_level
# modify power-play table
upp set --write \
smc_pptable/SocketPowerLimitAc/0=300 \
smc_pptable/SocketPowerLimitDc/0=300
# alternatively by writing full pp table
#cat navi21.pp_table.oc > /sys/class/drm/card0/device/pp_table
# set max wattage
cat /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap_max > /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap
echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
#
# GPU Core
#
echo "s 0 2450" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "s 1 2550" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "vo -12" > /sys/class/drm/card0/device/pp_od_clk_voltage
#
# GPU MEM
#
echo "m 1 1075" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "3" > /sys/class/drm/card0/device/pp_dpm_mclk
Attached files:
Log files (for system lockups / game freezes / crashes)
Jun 21 00:26:51 Undertaker kernel: amdgpu 0000:2d:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000006 SMN_C2PMSG_82:0x00000000
Jun 21 00:26:51 Undertaker kernel: amdgpu 0000:2d:00.0: amdgpu: Failed to enable requested dpm features!
Jun 21 00:26:51 Undertaker kernel: amdgpu 0000:2d:00.0: amdgpu: Failed to setup smc hw!
Jun 21 00:26:51 Undertaker kernel: WARNING: CPU: 15 PID: 9271 at drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:575 amdgpu_gfx_off_ctrl+0xf0/0x120 [amdgpu]
Jun 21 00:26:51 Undertaker kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device f2fs crc32_generic lz4hc_compress hid_logitech_hidpp ucsi_ccg qrtr rfkill vfat fat mousedev hid_logitech_dj uas usb_storag
e joydev intel_rapl_msr intel_rapl_common snd_hda_codec_realtek edac_mce_amd snd_hda_codec_generic kvm_amd ledtrig_audio snd_hda_codec_hdmi usbhid wmi_bmof kvm snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec i
rqbypass crct10dif_pclmul snd_hda_core crc32_pclmul ghash_clmulni_intel snd_hwdep aesni_intel nct6687(OE) snd_pcm typec_ucsi r8169 crypto_simd typec cryptd snd_timer roles realtek sp5100_tco rapl mdio_devres snd libphy ccp tp
m_crb soundcore i2c_piix4 pcspkr zenpower(OE) tpm_tis tpm_tis_core wmi tpm gpio_amdpt mac_hid acpi_cpufreq pinctrl_amd gpio_generic rng_core uinput dm_multipath dm_mod ipmi_devintf ipmi_msghandler fuse crypto_user bpf_preload
ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 nvme xhci_pci crc32c_intel nvme_core xhci_pci_renesas amdgpu
Jun 21 00:26:51 Undertaker kernel: drm_ttm_helper ttm gpu_sched drm_dp_helper
Jun 21 00:26:51 Undertaker kernel: CPU: 15 PID: 9271 Comm: overclock.sh Tainted: G OE 5.18.5-1-MANJARO #1 f3ee4d7d020bb789451220f23826e89d72fc8675
Jun 21 00:26:51 Undertaker kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C91/MAG B550 TOMAHAWK (MS-7C91), BIOS A.94 03/10/2022
Jun 21 00:26:51 Undertaker kernel: RIP: 0010:amdgpu_gfx_off_ctrl+0xf0/0x120 [amdgpu]
Jun 21 00:26:51 Undertaker kernel: Code: 66 14 00 85 c0 75 e5 48 8b 83 50 b6 00 00 c6 83 60 b6 00 00 00 48 8b 40 30 48 85 c0 74 ce 66 90 48 89 df e8 d2 1c 65 ec eb c2 <0f> 0b eb 84 48 8b 33 48 c7 c2 60 02 a6 c0 48 c7 c7 b8 16
c1 c0 e8
Jun 21 00:26:51 Undertaker kernel: RSP: 0018:ffffa23006e8bc80 EFLAGS: 00010246
Jun 21 00:26:51 Undertaker kernel: RAX: 0000000000000000 RBX: ffff8f4bd1880000 RCX: 00000000000a0300
Jun 21 00:26:51 Undertaker kernel: RDX: ffff8f4c4c18a080 RSI: 0000000000000001 RDI: ffff8f4bd188b668
Jun 21 00:26:51 Undertaker kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: ffff8f4bc5b48c38
Jun 21 00:26:51 Undertaker kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f4bd188b668
Jun 21 00:26:51 Undertaker kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000006
Jun 21 00:26:51 Undertaker kernel: FS: 00007fe718e52c40(0000) GS:ffff8f52bedc0000(0000) knlGS:0000000000000000
Jun 21 00:26:51 Undertaker kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 21 00:26:51 Undertaker kernel: CR2: 00007fe718dfe750 CR3: 00000001ad8e8000 CR4: 0000000000750ee0
Jun 21 00:26:51 Undertaker kernel: PKRU: 55555554
Jun 21 00:26:51 Undertaker kernel: Call Trace:
Jun 21 00:26:51 Undertaker kernel: <TASK>
Jun 21 00:26:51 Undertaker kernel: gfx_v10_0_set_powergating_state+0x57/0x200 [amdgpu 8b74f50f5850a2aa38c3b307209cec0a6b401159]
Jun 21 00:26:51 Undertaker kernel: amdgpu_device_ip_set_powergating_state+0x5f/0xc0 [amdgpu 8b74f50f5850a2aa38c3b307209cec0a6b401159]
Jun 21 00:26:51 Undertaker kernel: amdgpu_dpm_force_performance_level+0x101/0x1b0 [amdgpu 8b74f50f5850a2aa38c3b307209cec0a6b401159]
Jun 21 00:26:51 Undertaker kernel: amdgpu_set_power_dpm_force_performance_level+0x97/0x2a0 [amdgpu 8b74f50f5850a2aa38c3b307209cec0a6b401159]
Jun 21 00:26:51 Undertaker kernel: kernfs_fop_write_iter+0x11f/0x1f0
Jun 21 00:26:51 Undertaker kernel: new_sync_write+0x13d/0x1c0
Jun 21 00:26:51 Undertaker kernel: vfs_write+0x1ec/0x270
Jun 21 00:26:51 Undertaker kernel: ksys_write+0x6f/0xf0
Jun 21 00:26:51 Undertaker kernel: do_syscall_64+0x5f/0x90
Jun 21 00:26:51 Undertaker kernel: ? syscall_exit_to_user_mode+0x26/0x50
Jun 21 00:26:51 Undertaker kernel: ? __x64_sys_close+0x11/0x40
Jun 21 00:26:51 Undertaker kernel: ? do_syscall_64+0x6b/0x90
Jun 21 00:26:51 Undertaker kernel: ? exc_page_fault+0x74/0x170
Jun 21 00:26:51 Undertaker kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Jun 21 00:26:51 Undertaker kernel: RIP: 0033:0x7fe718d01c27
Jun 21 00:26:51 Undertaker kernel: Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48
89 74 24
Jun 21 00:26:51 Undertaker kernel: RSP: 002b:00007ffe27c1bb78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jun 21 00:26:51 Undertaker kernel: RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007fe718d01c27
Jun 21 00:26:51 Undertaker kernel: RDX: 0000000000000007 RSI: 000056088ea3ab60 RDI: 0000000000000001
Jun 21 00:26:51 Undertaker kernel: RBP: 000056088ea3ab60 R08: 000056088ea3aa80 R09: 0000000000000073
Jun 21 00:26:51 Undertaker kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
Jun 21 00:26:51 Undertaker kernel: R13: 00007fe718dfe6c0 R14: 0000000000000007 R15: 00007fe718df9940
Jun 21 00:26:51 Undertaker kernel: </TASK>
Jun 21 00:26:51 Undertaker kernel: ---[ end trace 0000000000000000 ]---
Jun 21 00:26:52 Undertaker kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jun 21 00:26:52 Undertaker kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Jun 21 00:26:52 Undertaker kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Jun 21 00:26:56 Undertaker kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=60034, emitted seq=60036
Jun 21 00:26:56 Undertaker kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 1325 thread kwin_wayla:cs0 pid 1339
Edited by Christopher Hubmann