Freezes and crashes after a failed attempt to load Secure display TA ucode

Brief summary of the problem:

I sometimes experience a freeze followed by graphics driver crash (black screen) after resuming from suspend (S3). Bisecting has found that it's caused by attempting to load Secure display TA ucode.

Hardware description:

Lenovo ThinkPad E15 Gen 2

CPU: AMD Ryzen 7 4700U with Radeon Graphics
GPU: 04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Renoir [1002:1636] (rev c2)
System Memory: 16GB DDR4 3200 MHz
Display(s): Laptop screen 1920x1080
Type of Display Connection: eDP

System information:

Distro name and Version: Manjaro
Kernel version: Linux filip-e15 6.3.5-2-MANJARO #1 SMP PREEMPT_DYNAMIC Sun Jun 4 18:26:12 UTC 2023 x86_64 GNU/Linux
Custom kernel: N/A
AMD official driver version: N/A

How to reproduce the issue:

The issue sometimes occurs after resuming from suspend. It can be reproduced by repeatedly suspending and resuming. Usually it takes about 20 suspend cycles before the issue occurs. I used the following script to reproduce it:

repeated suspend script

#!/bin/bash
for i in {1..30}
do
	rtcwake -m mem -s 1
	sleep 4 # you can adjust this sleep as needed to give system enough time to resume properly
	        # more time is needed if you have a lot of drivers or userspace programs reacting to the suspension

	if dmesg | grep -q 'failed to load ucode' # detect the error
	then
		sleep 5
		systemctl reboot # bug was reproduced, reboot to recover
		exit
	fi
done

30 tries is enough to reproduce it pretty reliably on affected kernels.

The problem occurs regardless of what applications are running. It can be reproduced with just Linux framebuffer console. I have also confirmed it occurs when running GNOME Mutter (Wayland).

EDIT: A better reproduction script which only does GPU resets:

repeated GPU reset script

#!/bin/bash
for i in {1..30}; do
        echo "Resetting GPU ($i)"
        result="$(cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover)"
        if [[ "$result" != "0" ]]; then
                echo "GPU reset failed"
                exit 1
        fi
        sleep 2
done
echo "All resets succeeded"

Attached files:

Log files (for system lockups / game freezes / crashes)

Here is an excerpt from dmesg when the error occurs:

[  629.384379] amdgpu 0000:04:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  629.393307] amdgpu 0000:04:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  631.564486] [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0xB)
[  631.564490] amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: Failed to initialize SECUREDISPLAY
[  631.564494] amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
[  631.564540] amdgpu 0000:04:00.0: amdgpu: dpm has been disabled
[  631.565576] amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
[  631.566179] [drm] DMUB hardware initialized: version=0x01010026
[  634.463835] [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x0)
[  634.463843] [drm] Failed to enable ASSR
[  636.679691] [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x0)
[  636.708173] [drm] kiq ring mec 2 pipe 1 q 0
[  638.874481] [drm] failed to load ucode VCN0_RAM(0x3A)
[  638.874484] [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
[  639.058505] amdgpu 0000:04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vcn_dec test failed (-110)
[  639.058749] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vcn_v2_0> failed -110
[  639.058971] amdgpu 0000:04:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
[  639.058972] amdgpu 0000:04:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xf0 returns -110
[  639.058978] amdgpu 0000:04:00.0: PM: failed to resume async: error -110
[  639.060000] OOM killer enabled.
[  639.060001] Restarting tasks ... 
[  639.060097] pci_bus 0000:01: Allocating resources
[  639.060109] pci_bus 0000:02: Allocating resources
[  639.060117] pci_bus 0000:03: Allocating resources
[  639.060628] done.
[  639.060640] random: crng reseeded on system resumption
[  639.061337] PM: suspend exit
[  639.061392] pci_bus 0000:04: Allocating resources
[  640.064601] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
[  640.247735] [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x00000010 != 0x00000000
[  640.430945] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002

Here is a full log of another session with drm.debug=0x1ff: dmesg.log

Bisect

This is a log from bisecting: bisect.log

The bisect points to commit e42dfa66d59240afbdd8d4b47b87486db39504aa: drm/amdgpu: Add secure display TA load for Renoir. This commit added support for loading Secure display firmware for my hardware. However, as can be seen from the logs, loading this firmware fails. Most of the time this has no effect on the system, but sometimes it seems to corrupt some internal state of either the driver or the GPU, and is followed by many other errors.

Additional information

When the freeze happens, sometimes the system still "works" for a while, but is really sluggish (one frame per second or less). But after a short time, it stops displaying anything at all, just black screen.

The driver always reports the errors shown in the above log excerpt. Depending on what applications are doing, further errors are reported. For example, when running GNOME and Firefox, i get a lot of errors that Fence fallback timer expired. When trying to suspend again, i get a lot of errors and warnings with backtraces (WARNING: CPU: 2 PID: 68 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:599 amdgpu_irq_put+0x46/0x70 [amdgpu]). However, these do not seem to be the root cause.

Edited 1 year ago

Designs

Child items 0

No child items are currently assigned. Use child items to break down this issue into smaller parts.

Admin message