Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Welcome to our new datacenter. The migration is still not over, but we try to bring up the service to the best we can. There are some parts not working yet (shared runners, previous job logs, previous job artifacts, ... ) but we try to do our best.
We do not guarantee data while the migration is not over, please consider this as read-only
Mesa is reporting the following: if OA metrics are collected for an exec_queue, after the OA stream is closed, future batch buffers submitted on the exec_queue do not complete. KMD sees this hang and resets the GPU. This is being seen on Xe2+ platforms. The hangs are not seen if OA stream is not closed.
To illustrate what @zehortigoza's patch above does, it is equivalent to the patch below (also @zehortigoza's earlier patch since @zehortigoza was investigating this issue):
The code above is referring to Bit 8 in CTXT_SR_CTL register (Bspec: 60314). So when OA stream is opened, we write 1 to bit 8 and when OA stream is closed we write 0 to bit 8 (not directly to the register but to the context image). So it seems if we skip writing 0 when OA stream is closed, we don't see these hangs.
Specification mention about enabled this bit in golden image so maybe it should never moved from 1 -> 0 but I'm don't have any documentation backing it up.