Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Just to follow up here, it seems like this issue can happen on pretty much any platform; it's been seen on at least TGL, DG1, DG2, ADL-S, ADL-P, and MTL so far. On some platforms it is very hard to reproduce without constantly loading/unloading the driver but on other platforms it shows up on nearly every driver load.
It seems to be some kind of subtle race condition, caching problem, etc. because seemingly unrelated changes to the code (e.g., changing the size of GuC regset allocations, loading vs not loading the HuC, etc.) can have a significant impact on the ease of reproduction for a platform. Adding an artificial 500ms sleep at the end of the 'for_each_hw_engine' loop in xe_gt_record_default_lrcs also seems to make the failures go away.
When reproduced, it usually happens on one of the later engines we initialize. For example, on the ADL-P I was using today I was seeing ~50% clean driver load, ~45% failure while submitting the LRC workarounds for VECS0 (the last engine in the platform's engine list), and ~5% failure while submitting the LRC workarounds for VCS2 (the second to last engine).
@zehortigoza, I spent a couple days looking at it last week, but still haven't figured out what's causing it. I'll keep looking at it when I have time, but it would be good if someone more familiar with Xe's submission model took a look as well.
I'm unsure what version the GuC firmware in CI but I'm thinking we update the GuC firmware to 70.6.2 in the kernel + in CI and see this issue resolves itself.