RFC: Bind changes
Add a document with my planned changes to the bind uAPI. Not meant to merge, rather just to review.
Merge request reports
Activity
- bind_changes.txt 0 → 100644
18 - If an bind fails before the 'execute' phase, fault of the user or 19 kmalloc error, punt to user 20 - If an bind fails in the 'execute' phase, no fault of the user, put VM 21 in error mode, out fences not published yet 22 - Define 2 error codes, -ERECLAIM, -ERECLAIMINFLIGHT, with the former 23 being the bind that triggered the error and latter attempting another 24 bind when in the error mode 25 - In error mode 'RECLAIM' operations are allowed (unbinds /w RECLAIM bit 26 set, only in sync mode) 27 - In error mode execs are allowed, the user is excepted to order these 28 correctly behind any pending binds via fences or wait until error mode 29 exit before calling exec 30 - To exit error mode a 'RESTART' operation will pick up where the bind 31 that triggered -ERECLAIM left off, can also fail again with -ERECLAIM 32 - If VM closed in error mode, tear down pending bind operation 33 - Out fences published on error mode exit Here, I think the error handling is based upon what is easy for the kernel to achieve. Also If not resulting in excessively complex KMD code, we want an uAPI that is easy for UMD to handle. Also it's not fully clear from UMD's point of view what the execute phase is:
Actions on various error codes (Exact error code doesn't really matter).
- -EINTR, -EAGAIN: Just relaunch. Same as most ioctls:
- -ENOSPC: Reduce the amount of VM bound memory. (Typically lack of local memory space) Relaunch the ioctl.
- -ENOMEM: Try to reclaim memory, or kill something. Then relaunch.
Error states: To be clear what state the VM is put in we need to define various (part of) operations, and which need rollbacks on error:
- uAPI ops: The operations that are listed in the ioctl.
- Atomic gpuVA ops: The operations that are initiated by gpuVA. These are typically the same as the uAPI ops, but may, for example on munmap type unbinds be multiple ops originating from a single uAPI op.
- nonAtomic gpuVA ops: Operations that are initiated by gpuVA that do not need to complete atomically. Such an example is multiple prefetch operations that originate from a single uAPI prefetch op. Or a multiple unmap operation originating from a single "unmap all" uAPI op.
- Atomic KMD actions: actions, subject to failure that may originate from a single op. For example binds on multiple GTs, originating from a single uAPI- or gpuVA op.
Now I'd like to postulate that if a failure occurs in the middle of a sequence of Atomic KMD actions, or a sequence of Atomic gpuVA ops, we need to do a complete rollback of the single uAPI op that was processed. Only then the error state will be manageable. Example: A bind that succeeded on one gt, but not the other will need to be rolled back. Similarly a munmap rebind that succeeded in unbinding, but not in binding the newly added vmas will need to be rolled back. Finally a prefetch that succeeded in prefetching only half of the range doesn't need rollback.
With that in place, we can simply return the number of uAPI operations that succeeded to user-space and do not need to put the VM in an error state. If we want to make the ioctl easily restartable, like on -EINTR just restart the ioctl, we need the kernel to be able to pick up that information on relaunch, and to mangle the fences correctly.
This makes the error handling clear: A failed uAPI op is either completely rolled back, or it's partially complete and it's fully safe to rerun it again.
KMD rollback implementation details: The key to be able to do rollbacks is to be able to, at any time, to unmap a single vma completely fail-safe; if needed, clear the page-table sync-on-cpu. We need to allocate the resources needed for that at ioctl start. Then the only case we can't easily cover is the failed bind of new vmas on munmap type unbinds. But on such failures, we can, at least as a first implementation, just mark them for rebind and then leave them alone. They will then subsequently be bound by exec/rebind worker/fault handler. In that way, we initially don't need a failsafe bind operation.
Error path coverage: Injecting -EINTRS will be helpful in testing most of the error paths. User-space recovery from -ENOMEM and -ENOSPC will not, however, be covered by that.
Edited by Thomas Hellström
IMO this is more complex than having the error state + restart IOCTL, the rollback of a failed operation is going to be quite of a bit more complicated that you think and also rollback operation can certainly fail too. For example:
- MUNMAP that requires a UNMAP (done first, succeeds) and then a REMAP (fails), the rollback from this is a MAP operation which also can certainly fail. Also the number of UNMAP operations can be more than 1 in this case.
Also how do we deal with out-fences mid uAPI ops failure? We certainly can't signal the out fences so the user in this case has pass in the out-fences in the next IOCTL? Doable but seems goofy.
The restart implementation is actually really, really easy. We have a list of operations which has everything we need to execute the operations, we literally just need to save the list in the VM and restart IOCTL continues walking the list. WRT to multi-GT operations failing on the 2nd GT, just clear the bits in the gt_mask of the operation so when executing the operation the 1st GT is skipped.
In the end the solution is the same, on failure report it to user, user frees some memory, and user signals restart. The question is what is the simpler implementation for both the UMD and KMD. IMO it is pretty clearly the restart IOCTL is. FWIW I could likely code my proposed solution in about a day as like I've stated it is actually really easy to do this.
Edited by Matthew BrostSo I think the rollbacks are actually covered there. In the example you give, due to the initial resource allocation we can always make the unbinds succeed. If the REMAPs fail, as mentioned, just mark them for rebinding by the async worker or the next exec. (Those might ofc fail as well, but we've cleaned up the uAPI op). Prefetch doesn't need rollback, and unbinds are failsafe once we commit to doing them.
@dakr has a similar early WIP version to the GPUVA changes at the end this MR: https://gitlab.freedesktop.org/nouvelles/kernel/-/tree/wip-gpuva?ref_type=heads
<dakr> Hi! Unfortunately, I was sick for a while, but I managed to sketch up what we discussed real quick: https://gitlab.freedesktop.org/nouvelles/kernel/-/tree/wip-gpuva?ref_type=heads <dakr> FYI, I'll be on vacation for three weeks, but I'll be around till Friday. <mbrost> i have something similar too <mbrost> https://gitlab.freedesktop.org/drm/xe/kernel/-/merge_requests/339 <mbrost> Matthew Brost's avatar <mbrost> drm/xe: Userptr refactor <mbrost> Matthew Brost authored 5 days ago <mbrost> f5885cca <mbrost> Matthew Brost's avatar <mbrost> drm/gpuva: Add support for extobj <mbrost> Matthew Brost authored 5 days ago <mbrost> dc791d7c <mbrost> Matthew Brost's avatar <mbrost> drm/gpuva: Move dma-resv to GPUVA manager <mbrost> Matthew Brost authored 5 days ago <mbrost> cafb376a <mbrost> Matthew Brost's avatar <mbrost> drm/gpuva: Add drm device to GPUVA manager <mbrost> Matthew Brost authored 5 days ago <mbrost> caee7839 <mbrost> those are the patches in the MR link that touch what talked about <mbrost> I suspect our changes should be similar <dakr> ah, you're tracking whole drm_gpuvas as extobjs. <dakr> that solves the problem with GEMs stuffed in multiple VMs. <mbrost> yea <dakr> however, I used a maple_tree for that as well. that omits duplicates. <mbrost> hmm <dakr> not that this is a real problem though <mbrost> one thing I like about yours is a common evict list, I didn't do that because of how Xe does evict + uses multiple lists / locks <mbrost> if, which is a big if, we can use a simple common list + locking I'm all for it <dakr> yeah, I also tried to come up with some common functions for validation and adding fences <mbrost> so for adding fences, we need 2 usages <dakr> as mentioned this is just sketched up and maybe it needs some more complexity to work for all drivers. <mbrost> one for the manager, one for extobjs <mbrost> easy change I have that <dakr> oh, I see. should be easy to fix though. <mbrost> https://gitlab.freedesktop.org/drm/xe/kernel/-/merge_requests/339/diffs?commit_id=dc791d7c7f8051cea77146d67753735ad401a483#d1cb7cb43ed844f26c6dcee68c245a1fffec9d9b_874_899 <mbrost> i need to wrap my head around extobjs in a maple tree... give me a minute to look at that <dakr> sure. <mbrost> also your version of adding fences is correct in the sense that it takes exec as an arg and iterates the locked objects <mbrost> as we shouldn't call that function without locking the entire mgr via exec <mbrost> s/exec/drm_exec <mbrost> I also didn't do some this as gpuva and exec were different modules but that probably should be one? I think you do that, right? <mbrost> drm_gpuva_insert_state_ext doesn't work, at least I think, we have the same problem in the tip of Xe too <mbrost> let me give an example <mbrost> MAP(1, 0x0000-0x1000) extboj(A) this is inserted, MAP(2, 0x1000-0x2000) extobj(A) this not inserted, UNMAP (1, 0x0000-0x1000) extobj(A) is removed but we have a mapping (2, 0x1000-0x2000) <mbrost> either we keep a list per GPUVA like I do or ref count in the GEM + MT tree like you do <mbrost> on insert ref++ == 0 update MT, on delete --ref == 0 remove from MT <dakr> oh yeah. <dakr> absolutely. we can just refcount that. <dakr> we can use the maple trees "entry" for that, rather than saving the same pointer again. <dakr> and yeah, I think drm_exec should just be always builtin, like the gpuva manager. <mbrost> i haven't looked at mt api but built in ref counting would make tons of sense <dakr> mt saves an index and a pointer. <dakr> I'm using the GEMs address as index and save the GEMs address also as pointer value. <mbrost> oh, i missed that <dakr> we could just abuse the pointer value to store the ref count. <dakr> since we already have it implicitly through the index <mbrost> oh so it is still 1to1 from gem <dakr> yeah <mbrost> in that case a MT doesn't make sense, a list of GPUVA makes more <mbrost> if we think an extobj GEM will commonly have multiple GPUVAs, then a MT + ref makes more sense <mbrost> it is possible spare bindings will do this <dakr> ok, maybe I should ask what you mean exactly with "1to1 from gem". <mbrost> a gem gets 1 entry in the MT <mbrost> regardless on the number of GPUVA per gem <dakr> yes, one entry with a ref count. <dakr> the advantage of not using a list of gpuvas would be memory consumption for a large amount of gpuvas, where 99% of them are not backed by extobj <dakr> which should be a common case <mbrost> yea a list head is 8b (or is it 16b) <mbrost> so per gpuva the impact is minimal <mbrost> but I guess that does add up over time... <mbrost> open to either, no strong opinion here <dakr> I remember that Christian said he was fighting for every byte for mapping entry structures in amdgpu <mbrost> regardless after review of your code and mine, we are basically on the same track <mbrost> so the evicted MT has the same issue as gpuva, assuming a 1 to 1 between gpuva & gem, if anything drm_gpuva_evict should accept a gem then loop over gpuva to set the evicted flag <mbrost> but I think this falls apart with userptr as we dont have a gem <mbrost> maybe hold off the evicted list until we unify userptr <mbrost> s/evicted list/evicted mt/ <mbrost> also the malloc in the drm_gpuva_evict is suspicious <mbrost> typically if you need to evict / invalidate something we are not allowed to allocate more memory <dakr> agreed, to all of the above. <dakr> the memory allocation is suspicious, but I just wanted to get it out real quick and get some feedback. <dakr> conceptually. <dakr> I think the detail need to be all re-worked. <dakr> for the evicted MT though, wouldn't this work? I mean, it actually is a 1:1, isn't it? <dakr> A GEM can either be evicted or not. <dakr> Duplicate additions or removals are just ignored. <dakr> In order to avoid the duplicate calls we can also just split drm_gpuva_evict() and let's say drm_gpuva_manager_gem_evicted() or something. <mbrost> yea we just use a GPUVA list as userptr invalidation / evicts are similar, both need to be rebound in exec while the later also needs a revalidation first <mbrost> with userptr, no gem so need a per gpuva tracker