etnaviv: corrupted vertex data
System information
- OS: yocto dunfell
- GPU: Vivante GC400 chipMajorFeatures: 0xa0a9c60c
- Kernel version: Linux granite2v8 5.4.90-yocto-standard #1 (closed) SMP PREEMPT Sat Dec 18 02:09:53 UTC 2021 aarch64 GNU/Linux
- Mesa version: OpenGL version string: 1.3 Mesa 21.2.5
Describe the issue
On certain Marvell 6270 chips (but not all) the GPU seems to fetch incorrect data for vertex attributes. The test.c illustrates the problem. The vertex shader has two attributes, into which we feed identical data. The vertex shader compares the two attributes. If they match, it uses the color green. If they don't match, it uses the color red. The astute reader will no doubt observe that the test program is a highly stripped-down version of es2gears.c.
Expected behavior
The expected behavior on hardware should be the same as on mesa software rendering, which is a green square.
Actual behavior
Actual behavior varies from run to run, but is never correct. One common failure mode is seeing only one triangle with at least one red vertex. This indicates that the inputs to the vertex shader do not match and are hence incorrect. It is somewhat interesting to note that inserting padding so that the second attribute starts at +4 instead of +3 seems to improve behavior. It also seems to improve behavior to change GEAR_VERTEX_STRIDE from 6 to 8, which results in the data for each vertex being more highly aligned.
Hypothesis
There is a known hardware bug that can cause vertex data corruption:
commit e158b7497103f145a9236a70183e07c37a9e13f7
Author: Lucas Stach <dev@lynxeye.de>
Date: Mon Nov 21 11:54:25 2016 +0100
etnaviv: force vertex buffers through the MMU
This fixes a vertex data corruption issue if some of the vertex streams
go through the MMU and some don't.
Indeed, if I change the test so that no vertex buffers go through the MMU, then the vertex data is fetched correctly. (This test was performed by modifying mesa to remove the GEM_FORCE_MMU flag and limit vertex object sizes to 4k, which causes the kernel driver to use direct addresses instead of the MMU).
This led to a hypothesis that there must be more vertex attributes active than expected, and that those vertex attributes must not be going through the MMU, causing the above described bug to surface. The reset values of the GC400 registers do not seem to be documented, but it's reasonable to assume that most writable registers would reset to zeroes. However, if my hypothesis is correct, then this is not happening. To test this hypothesis, I wrote a small program to workaround that issue by writing zeroes to all FE_VERTEX_ELEMENT_CONFIG registers (0x600 - 0x630).
Experiment
After booting, I can run test.c and observe incorrect behavior. I then run workaround.c. Now I can run test.c, or any other GL program, and all observed behavior is correct. Also interesting, I can reboot, and the bug is still not observed. The bug does not return until a power-off/power-on cycle.
Open Questions
The initial implementation of a work-around was inside mesa. I modified etnaviv_emit.c to always write to all registers of FE_VERTEX_ELEMENT_CONFIG. For registers beyond ctx->vertex_element->num_elements, I would write a 0, otherwise the behavior was unchanged. This resulted in a GPU hang after the next draw command. It is unclear to me why this fails when writing all zeroes in a stand-alone program works.
I'm hoping someone with a better understanding of the hardware can correct me if I've got anything above wrong, and hopefully provide a cleaner fix.