Use doorbells rather than H2G channel for submit for a fast path to avoid the H2G channel
Chat with @jekstrand on #xe
TL;DR:
Be greedy and allocate a doorbell for the first 256 guc_ids and use those for submission (avoid CT lock). Fall back H2G submission for all others.
<jekstrand> mbrost: I don't remember. Where is i915 at in terms of submits/sec?
<jekstrand> I know I wrote a benchmark once upon a time
<mbrost> jekstrand: no idea... I could try to find out
<mbrost> gem_exec_parallel is a pretty similar version xe_exec_threads
<jekstrand> Looks like about 16us/submit with 8 BOs
<jekstrand> But by the time you're up to 128 BOs, it's 136us each
<mbrost> xe_exec_threads is submitting at a 29us/submit
<jekstrand> Yeah, so once you hit 16 buffers or so, i915 is worse due to locking hell.
<jekstrand> Hrm...
<jekstrand> The benchmark I'm looking at isn't measuring actual submits, though. Just the time to queue them.
<jekstrand> Not actual time in the back-end scheduler.
<jekstrand> Specifically, CPU time required for execbuffer2
<jekstrand> It stops the timer before it waits for stuff to complete.
<jekstrand> In any case, 33k/sec is easily enough for 3D workloads.
<mbrost> I would think so and I honestly haven't dug into this at all but I bet the bottle neck is the GuC itself
<mbrost> it has to deal processing the H2G channel plus a CSB IRQ from each physical engine on everything something is switch on / off the hardware
<mbrost> it is a tiny uC so which 33k/sec submits that is like 100k IRQ routines it is running
<jekstrand> yeah
<mbrost> I think the limitations of the GuC is part of the reason we have 2 GuC on MTL+, one for render / copy and another for media
<mbrost> also future designs are looking for more powerful uC too
<jekstrand> Sure
<mbrost> wrt to doorbells that you mentioned on the list, the current limitation is basically a 1 to 1 mapping between doorbell and guc_id, so that means we could have max 256 xe_engines open without having to implement stealing
<jekstrand> Ok, 256 is a bit limiting
<jekstrand> What do the doorbells do?
<mbrost> definitely can use from the kernel but never looked into it, def need to fix the doorbells on future products to support a many guc_id to doorbell mapping
<mbrost> it is mmio write associated with a single guc_id (xe_engine) that says the LRC tail has moved
<jekstrand> Ok...
<mbrost> for user space there is queue involved as they can't directly write the LRC tail but in the kernel I don't think that is needed
<jekstrand> So it's effectively a limit on active contexts?
<mbrost> there is some why to multiplex the doorbells too with userspace / software magic
<mbrost> but in the end I just think we need to fix them to have a many to 1 relationship, think it is all possible with software / GuC changes
<mbrost> had discussion today about this actually as I think we are going to be forced to use doorbells in the future
<mbrost> *GuC firwmare changes (e.g. no hardware changes required)
<jekstrand> k
<jekstrand> So are you just limited to 256 contexts now?
<mbrost> No, 64k
<mbrost> using the H2G channel to submit
<jekstrand> Ok, so if you have 64k contexts and only 256 doorbells, how do you map doorbells to contexts?
<jekstrand> Or is the doorbell thing a way around H2G?
<mbrost> a H2G command ties a single doorbell to single context
<mbrost> Yes, once the doorbell is setup you don't need a H2G to submit, rather just a MMIO write that says the LRC tail moved
<jekstrand> Ok. So you H2G to set up the door bell, MMIO, and off you go.
<mbrost> so we could submit 256 contexts without a lock
<mbrost> yea
<jekstrand> Ok. And doorbells exist today?
<mbrost> yea
<mbrost> I guess we could be greedy and use doorbells til there gone then fall back the H2G
<jekstrand> Just give the first 256 a doorbell and call it a day. :D
<jekstrand> Yup
<jekstrand> And in the 99.999% of the time case, everyone gets a doorbell.
<mbrost> let me look at that
<mbrost> Also check with the GuC team that H2G submission and doorbells can be mixed
<mbrost> I sure they can but you never know
<jekstrand> Yeah, that'd be good to verify
<jekstrand> But, yeah, you could just greedy allocate doorbells. We should support > 256 contexts but at that point you're getting crazy.
<mbrost> apparently greater than 256 is media use case
<jekstrand> One could imagine trying to do something clever like handing doorbells out to "hot" contexts but I suspect whatever algorithm you had for sorting all that out would be more expensive than the H2G locking you're doing today.
<mbrost> yea I don't want to anything complicated unless we have a very good reason
Edited by Matthew Brost