TGL/DG2: crash during driver load

Are you saying this is bisected to this commit or it was just the head you were on?

added BUG label

assigned to @demarchi

it is my head.

Not able to reproduce it anymore in 3 TGLs, maybe was something faulty in my end.

closed

I reproduced it once in a DG2. But when it happened, there were other issues before that and the device was dead.

reopened

changed title from TGL: crash during driver load to TGL/DG2: crash during driver load

changed the description

Able to workaround it by removing all CCS, VECS and VCS engines from .platform_engine_mask.

mentioned in issue #236

mentioned in issue #237

Just to follow up here, it seems like this issue can happen on pretty much any platform; it's been seen on at least TGL, DG1, DG2, ADL-S, ADL-P, and MTL so far. On some platforms it is very hard to reproduce without constantly loading/unloading the driver but on other platforms it shows up on nearly every driver load.

It seems to be some kind of subtle race condition, caching problem, etc. because seemingly unrelated changes to the code (e.g., changing the size of GuC regset allocations, loading vs not loading the HuC, etc.) can have a significant impact on the ease of reproduction for a platform. Adding an artificial 500ms sleep at the end of the 'for_each_hw_engine' loop in xe_gt_record_default_lrcs also seems to make the failures go away.

When reproduced, it usually happens on one of the later engines we initialize. For example, on the ADL-P I was using today I was seeing ~50% clean driver load, ~45% failure while submitting the LRC workarounds for VECS0 (the last engine in the platform's engine list), and ~5% failure while submitting the LRC workarounds for VCS2 (the second to last engine).

unassigned @demarchi

@mattrope are you working on fixing this? Otherwise we need to find someone else...

Would be pretty bad if someone in management tries Xe kmd and get 50% of driver load failures.

@zehortigoza, I spent a couple days looking at it last week, but still haven't figured out what's causing it. I'll keep looking at it when I have time, but it would be good if someone more familiar with Xe's submission model took a look as well.

Looking into this now, have series with a bit of extra debug that would getr merged: https://patchwork.freedesktop.org/series/115744/

So I have quite a few data points:

On TGL (DUT025) with GuC firmware 70.5.2 with CONFIG_DRM_XE_LARGE_GUC_BUFFER clear it boots fine
On TGL (DUT025) with GuC firmware 70.5.2 with CONFIG_DRM_XE_LARGE_GUC_BUFFER set the boot hangs
On TGL (DUT025) with GuC firmware 70.6.2 it always boots
On ADL in CI with CONFIG_DRM_XE_LARGE_GUC_BUFFER hacked out it boots: https://patchwork.freedesktop.org/series/116004/

I'm unsure what version the GuC firmware in CI but I'm thinking we update the GuC firmware to 70.6.2 in the kernel + in CI and see this issue resolves itself.

TGL/DG2: crash during driver load

Designs

Child items ...

Activity

Admin message

Admin message

TGL/DG2: crash during driver load

Activity