x1e: CP_SMMU_TABLE_UPDATE related faults
Not sure if this is x1e specific or a7xx specific. With glmark2 I can eventually trigger iova faults that seem to be related to CP_SMMU_TABLE_UPDATE
not being properly synchronized. See attached devcore, in particular:
kernel: 6.11.0-rc5-next-20240830+
module: msm
time: 1725917575.894242113
comm: glmark2
cmdline: glmark2 --run-forever
gpu-initialized: 1
revision: 0 (67.5.12.1)
Got chip_id=0x43050c01
fault-info:
- ttbr0=00000008dec82000
- iova=0000000100c09000
- dir=WRITE
- type=UNKNOWN
- source=CP
pgtable-fault-info:
- ttbr0: 00000008ab4e2000
- asid: 0
- ptes: 00000008da4c1003 00000008da4c2003 00000008da4c8003 000000094f28bf47
Notes:
- The
pgtable-fault-info
ttbr0 value is what we expect it to be based on the faulting submit (looked up based on last completed fence read back from memptrs, thefault-info
ttbr0 value is what is read back from hw. The sw pagetable walk using thepgtable-fault-info
tables indicates the fault address is a valid translation. - The faulting address is frequently, but not always, the userspace fence buffer which is the last thing written in the submit.
- I think glmark2 is particularly good at reproducing this because of frequent context switches between the compositor and glmark2. Most glmark2 scenes are just a single or small # of draws, so typically (if you look in nvtop or similar) you'll see the compositor and glmark2 each using approx 50% of the GPU time. I've not been able to reproduce this with vkmark but I think that is down to presentation mode (ie, I don't see the compositor using much GPU time, so not every frame is getting blit to the screen, so there are vastly fewer context switches.. ie 60/sec vs 8000+/sec)
- It looks like the
CP_SMMU_TABLE_UPDATE
happens for the switch to the next context before the GPU is finished writing back from the previous submit.