ring GFX timeout
Brief summary of the problem:
We are running render tasks on an Ampere Altra platform with AMD Radeon PRO W6800, and the GPU hang on some test cases.
Hardware description:
- CPU: Ampere Altra
- GPU: AMD Radeon PRO W6800
- System Memory: 512GB
- Display(s): None
- Type of Display Connection: None
System information:
- Distro name and Version: Ubuntu 20.04.1 LTS
- Kernel version: 5.10.27
- Custom kernel: merged Ampere patches, https://github.com/AmpereComputing/ampere-lts-kernel/tree/linux-5.10.y
- AMD official driver version: mesa-20.2.6, libdrm-2.4.110
How to reproduce the issue:
This bug can be reproduced by running certain Vulkan cases in Khronos Vulkan Conformance Tests (see links below)
https://github.com/KhronosGroup/VK-GL-CTS
Steps for compiling and running the test suite can be found at
https://github.com/KhronosGroup/VK-GL-CTS/blob/main/external/vulkancts/README.md
The easiest way is consecutively running the following case for several times (usually more than 5 times).
"dEQP-VK.binding_model.descriptorset_random.sets4.constant.ubolimitlow.sbolimithigh.sampledimglow.outimgtexlow.iublimithigh.uab.comp.noia.0"
When the bug happened, we could see "ring gfx_0.0.0 timeout" in dmesg, and all render tasks running on this device died. The kernel driver would then trigger a GPU reset to recover the device, this usually works, most render tasks can restart normally after the GPU reset. Detailed dmesg can be found in attached log files.
We researched the kernel and found out this problem might be cause by a PCIE erratum on Ampere platform, writings from CPU are corrupted on certain circumstance. We also found a walkaround that fixes the issue, which disables write combining on Ampere platform (see links below).
https://github.com/Tencent/TencentOS-kernel/commit/f454797b673c06c0eb1b77be20d8a475ad2fbf6f
However, this solution causes unacceptable performance loss, because write operations from CPU become rather expensive without write combining. We wonder if there is a more graceful solution.