radv: Horizon Zero Dawn page faults on Navi 22
Brief summary of the problem:
Horizon Zero Dawn frequently causes irrecoverable page faults. This happens in other games too, but seemingly more rarely, with other errors being both more common and more frequent in those games.
Hardware description:
- CPU: AMD Ryzen 9 5980HX with Radeon Graphics (
08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1638] (rev c7)
) - GPU:
03:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] [1002:73df] (rev c3)
- System Memory: 64GB DDR4 JEDEC 3200MHz (Crucial CT2K32G4SFD832A)
- Display(s):
ChiMei InnoLux 0x1540 res: 2560x1440
- Type of Display Connection: eDP
System information:
- Distro name and Version: Arch Linux
- Kernel version:
Linux Hostname 6.4.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 06 Jul 2023 18:35:54 +0000 x86_64 GNU/Linux
- Custom kernel: N/A
- AMD official driver version: N/A
How to reproduce the issue:
I have no reliable/consistent reproducer.
The problem is seemingly random, sometimes a few minutes or a few hours of play are fine, but it often happens eventually, and then it freezes, sometimes the system freezes irrecoverably with it, sometimes it remains interactive and then freezes, and sometimes its appeared to remained interactive and usable until I reboot, though I don't wait long as the system is now in a utterly broken state, and it did not work gracefully the times i tried, so now i always do a hard reboot.
The game process remains hung, and if you try to kill it, amdgpu logs further errors and the process remains.
Attached files:
Log files (for system lockups / game freezes / crashes)
This log includes attempting to manually kill -9
the HorizonZeroDawn process, which caused additional errors and still left the process.
Note: I'm using cpufreq.off=1
to attempt to work around other issues in playing Horizon Zero Dawn based on this comment, and AMD_DEBUG=nodcc
in /etc/environment
in the (unsuccessful) hopes of preventing this issue based on this comment on another similar issue, but this issue has also occurred without those set.