Remove blocking unbinds on in-fences

changed title from Remove blocking on unbinds in-fences to Remove blocking unbinds on in-fences

mentioned in merge request !40 (closed)

mentioned in merge request !41 (closed)

Highly suspect the root of this issue due to this snippet: !41 (comment 1421607)

I've convinced myself the above snippet likely isn't the issue.

Interesting data point, xe_exec_threads --r threads_rebind, pass with the below patch.

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index dd0cf3ea32e5..38a3de22720a 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2051,7 +2051,7 @@ static int vm_bind_ioctl(struct xe_vm *vm, struct xe_vma *vma, struct xe_bo *bo,
         * the PTEs immediately in the unbind code. If doing an async VM unbind,
         * no penalty for sleeping here.
         */
-       if (VM_BIND_OP(args->op) == XE_VM_BIND_OP_UNMAP) {
+       if (VM_BIND_OP(args->op) == XE_VM_BIND_OP_MAP) {
                int i;

                for (i = 0; i < num_syncs; i++) {

mentioned in issue #41 (closed)

This problem is almost certainly an issue with having multiple binds to the same address in flight at the same time.

I'm basing on this on some results from the following 2 MRs: !53 (closed) https://gitlab.freedesktop.org/drm/xe/igt-gpu-tools/-/merge_requests/16

The the above MRs xe_evict.evict-threads-small sometime fails /w 0x6000 errors, engine resets, or engine reset failures. xe_evict.evict-mixed-many-threads-small fails everytime with one of the aforementioned errors.

If CPU binds + waiting on moves in are used the previous tests pass.

Or if the following patch is applied, the tests also pass:

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 5380c52bc78f..dde2027e7004 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1888,6 +1888,9 @@ xe_vm_bind_vma(struct xe_vma *vma, struct xe_engine *e,
                xe_pt_abort_bind(vma, entries, num_entries);
        }

+       if (rebind)
+               dma_fence_wait(fence, false);
+
        return fence; 
 err:

I've also examined a ftrace of a failure of xe_evict.evict-threads-small in detail and all the fencing is working correct (e.g. the move completes before the rebind is submitted to the GuC, the rebind completes before dependent execs are submitted). This points to xe_vm_bind_vma messing with page tables that the GPU is still using.

xe_exec_threads_rebind_failure.zip

Attaching a zip file with an annotated ftrace and dmesg showing a failure of xe_exec_threads.threads-rebind on TGL after the code in the subject is removed. Search for 'MB -' for comments in the ftrace. Basically it shows the fencing is working exactly as expected (unbind / bind blocks existing jobs using VMAs to complete, bind runs, bind completes, jobs dependent on binds run, unbind blocked behind jobs, running jobs fault).

Again this all points to one of two possible things:

The KMD code that preps unbind / bind jobs corrupts existing page tables
We are programming the ring or BBs wrong somewhere (i.e. we are missing TLB invalidates or something like that)

Can produce a similar trace for the eviction issue too.

Update - Also captured a GuC log which clearly shows the PPGTT address of a batch causing a fault

evict_small_threads.zip

ftrace of xe_evict.evict-small-threads failing with an engine reset + job timeout on DG1. It clearly shows the fencing working (move to LMEM triggered, a rebind issued after move completion, job which uses VMA issued after move completion, hang). I captured a GuC log for a couple different failures on DG1 and the fault registers read -1, I don't these registers work correctly on DG1 but I'd almost guarantee this test is faulting on a PPGTT address too.

Again search for "MB -" for comments explaining the trace.

mentioned in merge request !53 (closed)

mentioned in issue #93 (closed)

mentioned in merge request !155 (merged)

Seem like this is now fixed on tip, yay!

MR to remove these W/A. !155 (merged)

mentioned in commit 3ba6fe34

mentioned in commit 8fdeed22

This W/A has been removed. Closing.

closed

Remove blocking unbinds on in-fences

Designs

Child items ...

Activity

Admin message

Admin message

Remove blocking unbinds on in-fences

Activity