MTL: Initial LRC submission to Media GT times out
Initialization of the media GT on MTL currently fails. The first failure is due to kernel_bb_pool never being initialized for the media GT, which prevents emit_wa_job() from allocating a batch buffer. The code and comment below will need to be adjusted to address this:
/*
* FIXME: This should be ok as SA should only be used by gt->migrate and
* vm->gt->migrate and both should be pointing to a non-media GT. But to
* realy safe, convert gt->kernel_bb_pool to a pointer and point a media
* GT to the kernel_bb_pool on a real tile.
*/
if (!xe_gt_is_media_type(gt)) {
err = xe_sa_bo_manager_init(gt, >->kernel_bb_pool, SZ_1M, 16);
However even after redirecting emit_wa_job and emit_nop_job to use the primary GT's kernel_bb_pool for the media GT, initialization still fails with an -ETIME return from emit_wa_job(). The media GT's GuC appears to have loaded and initialized successfully, so it isn't clear why the initial submissions to setup the golden context are timing out. Manually increasing the timeout values doesn't help either; the GT just seems to be unresponsive.
As a side effect of this failure, some additional bugs on the device teardown path are encountered:
[Mon Mar 20 23:27:35 2023] Memory manager not clean during takedown.
followed by some kernel BUG's due to attempts to access POISON_INUSE pointers and/or linked list corruption in the MM code.