Race with shared PDEs across multiple VM bind engines
Let me give a super simple example of a corruption and how it could occur.
- In bind engine 1, a 4k bind (A) is done to 0x1000 and some time later in unbound (B)
- In bind engine 2 after B has been process by the driver, a 4k bind (C) to 0x2000 is submitted which creates a new set of PDEs.
- C's job completes before B's job is run as B's job is waiting on fences while C is not. We are pointing to C's set of PDEs at this time.
- B's job is run and all of C's PDEs are set to NULL but C's PTE is still present
- At this point we are broken
The key here is PDEs are shared between B & C and B's & C's jobs can execute out order relative to the order in which XE received these operations.
Another race that could possible occur is back to back binds, Let me give another example
- In bind engine 1, a 4k bind A is done to 0x1000
- In bind engine 2, a 4k bind B is done to 0x2000
- B's job is run before A's job as A is waiting on fences while B is not
- B doesn't program the PDEs as it expected those entries to be populated
- Another job comes all and tries to use address 0x2000, now we fault
To fix this, roughly what I'm thinking:
- We program all page table entries from the root to the PTEs on each bind regardless if we think the entries are populated (seals 2nd race)
- Unbind jobs prune VA's at the PTE level only (avoid clobbering shared PDEs but effectively removes the VA)
- Unbinds jobs have a callback on completion which drops refs to page table BOs and creates a new job (if necessary) to prune these BO from the page tables (safe at this point to write shared PDEs at this point). All of this done under a lock to not race with new binds, these 'prune jobs' have no in-fences so they are run immediately
- All new binds are ordered with any pending 'prune jobs'
- Alternatively for the 'prune jobs' we just do the pruning on the CPU under lock (maybe this doesn't work because of caching in the device and other GPU bind ops touching the same cacheline???)