Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
This code snippet shouldn't be needed but without it xe_exec_threads.rebind faults on TGL.
/* * We can't do an unbind until all in syncs are signalled as we destroy * the PTEs immediately in the unbind code. If doing an async VM unbind, * no penalty for sleeping here. */ if (VM_BIND_OP(args->op) == XE_VM_BIND_OP_UNMAP) { int i; for (i = 0; i < num_syncs; i++) { err = xe_sync_entry_wait(&syncs[i]); if (err) return err; } }
Edited
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Activity
Sort or filter
Newest first
Oldest first
Show all activity
Show comments only
Show history only
Matthew Brostchanged title from Remove blocking on unbinds in-fences to Remove blocking unbinds on in-fences
changed title from Remove blocking on unbinds in-fences to Remove blocking unbinds on in-fences
Interesting data point, xe_exec_threads --r threads_rebind, pass with the below patch.
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.cindex dd0cf3ea32e5..38a3de22720a 100644--- a/drivers/gpu/drm/xe/xe_vm.c+++ b/drivers/gpu/drm/xe/xe_vm.c@@ -2051,7 +2051,7 @@ static int vm_bind_ioctl(struct xe_vm *vm, struct xe_vma *vma, struct xe_bo *bo, * the PTEs immediately in the unbind code. If doing an async VM unbind, * no penalty for sleeping here. */- if (VM_BIND_OP(args->op) == XE_VM_BIND_OP_UNMAP) {+ if (VM_BIND_OP(args->op) == XE_VM_BIND_OP_MAP) { int i; for (i = 0; i < num_syncs; i++) {
The the above MRs xe_evict.evict-threads-small sometime fails /w 0x6000 errors, engine resets, or engine reset failures. xe_evict.evict-mixed-many-threads-small fails everytime with one of the aforementioned errors.
If CPU binds + waiting on moves in are used the previous tests pass.
Or if the following patch is applied, the tests also pass:
I've also examined a ftrace of a failure of xe_evict.evict-threads-small in detail and all the fencing is working correct (e.g. the move completes before the rebind is submitted to the GuC, the rebind completes before dependent execs are submitted). This points to xe_vm_bind_vma messing with page tables that the GPU is still using.
Attaching a zip file with an annotated ftrace and dmesg showing a failure of xe_exec_threads.threads-rebind on TGL after the code in the subject is removed. Search for 'MB -' for comments in the ftrace. Basically it shows the fencing is working exactly as expected (unbind / bind blocks existing jobs using VMAs to complete, bind runs, bind completes, jobs dependent on binds run, unbind blocked behind jobs, running jobs fault).
Again this all points to one of two possible things:
The KMD code that preps unbind / bind jobs corrupts existing page tables
We are programming the ring or BBs wrong somewhere (i.e. we are missing TLB invalidates or something like that)
Can produce a similar trace for the eviction issue too.
Update - Also captured a GuC log which clearly shows the PPGTT address of a batch causing a fault
ftrace of xe_evict.evict-small-threads failing with an engine reset + job timeout on DG1. It clearly shows the fencing working (move to LMEM triggered, a rebind issued after move completion, job which uses VMA issued after move completion, hang). I captured a GuC log for a couple different failures on DG1 and the fault registers read -1, I don't these registers work correctly on DG1 but I'd almost guarantee this test is faulting on a PPGTT address too.
Again search for "MB -" for comments explaining the trace.