NVK: mmu fault on GA104 when GPU goes to sleep

System information

OS: Gentoo
GPU: NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] [10de:249c] (rev a1)
Kernel version: 6.8.1
Mesa version: main
Desktop manager and compositor: N/A, can trigger issue without one

Describe the issue

While testing several games on NVK, I discovered a crash on startup in Civilization IV, D3D12/vkd3d/NVK. I have spent much of the week trying to find the right combination of actions to develop a minified stress test to reproduce the same crash: find it here. The program cycles between triggering on-GPU writes to a buffer, adding memory allocations on the GPU, and performing host->GPU writes.

On my system, this test case always crashes on/near iteration 44 with the kernel log showing:

nouveau 0000:01:00.0: gsp: mmu fault queued
nouveau 0000:01:00.0: gsp: rc engn:00000001 chid:40 type:31 scope:1 part:233
nouveau 0000:01:00.0: fifo:c00000:0005:0028:[test[111796]] errored - disabling channel
nouveau 0000:01:00.0: test[111796]: channel 40 killed!

This stress test never crashes on my Intel GPU nor with the proprietary NVIDIA driver.

Note from Faith

The real issue here is that nouveau isn't properly restoring resources when it wakes the GPU up from sleep. I don't know if some things are just missing from the restore or if it's somehow not waiting to submit new work until the restore is complete. It may also be that we do evict everything properly but we don't set the maps back up so the fault handler can access them even with the GPU powered off. In any case, this is a "nouveau is being dumb" around power management issue.

Discussion

I am interested in getting more involved in nouveau/NVK development. I have some experience with 3D game engine programming (but not using Vulkan) and driver development (but not for GPUs). So unless the solution is obvious to somebody, I would prefer advice on how to continue chasing this so that I can get more hands-on experience and eventually find the fix myself, rather than have somebody else take this issue away from here. :)

Increasing my stress test's 'k' loop from 500 iterations to a greater value makes it crash sooner. This makes me doubtful that the fault is with NVK itself, because the only variable I'm changing is how many CPU->GPU mapped writes occur. I have also been poring over NVK's memory management for the past few days and could not find anything obviously wrong.

My current working hypothesis is that the fault lies in the newish GPUVA stuff: that the kernelmode driver is somehow stomping on PDEs/PTEs for older VA bindings when adding new ones, but the GPU's MMU is keeping the intact page table cached and thus masking the problem. Also supporting this hypothesis is that the problem goes away entirely if I dispatch the vkCmdFillBuffer command buffer from within the 'k' loop: I believe this stops the PDEs/PTE for that buffer's mapping from being evicted from cache, which makes sense if the cache follows a LFU replacement policy. But I would like to restate that I have only a conceptual grasp of the GPU architecture here: I have zero concrete idea how the hardware actually works.

Since I'm at the edge of my knowledge, I would like to ask if anyone has any thoughts about what's going on and/or how I might further troubleshoot. Does my stress test trigger faults for anyone else? On other Turing/Ampere/Ada GPUs or only GA104? Can I disable the MMU's cache? Can I dump the raw PDE/PTE structure so that I may inspect it? What are some links (and/or source filenames) for recommended reading? ;)

Edited Apr 04, 2024 by Faith Ekstrand

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information