Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
A CI Bug Log filter associated to this bug has been updated by rveesamx.
Description:BMG LNL: igt@xe_drm_fdinfo@drm-busy-exec-queue-dbtestroy-idles - fail - Test assertion failure function check_results, Failed assertion: 95.0 < percent
Equivalent query: runconfig_tag IS IN ["xe"] AND machine_tag IS IN ["LNL", "BMG"] AND ((testsuite_name = "IGT" AND test_name IS IN ["igt@xe_drm_fdinfo@utilization-single-full-load-destroy-queue", "igt@xe_drm_fdinfo@drm-busy-exec-queue-destroy-idle"])) AND ((testsuite_name = "IGT" AND status_name IS IN ["fail"])) AND stderr ~= 'Test assertion failure function check_results.*\n.*Failed assertion: 95.0 < percent'
Ravi Vchanged title from igt@xe_drm_fdinfo@drm-busy-exec-queue-destroy-idle - fail - Test assertion failure function check_results, Failed assertion: 95.0 < percent to igt@xe_drm_fdinfo@subtests - fail - Test assertion failure function check_results, Failed assertion: 95.0 < percent
changed title from igt@xe_drm_fdinfo@drm-busy-exec-queue-destroy-idle - fail - Test assertion failure function check_results, Failed assertion: 95.0 < percent to igt@xe_drm_fdinfo@subtests - fail - Test assertion failure function check_results, Failed assertion: 95.0 < percent
The CI Bug Log issue associated to this bug has been updated by Vinay.
New filters associated
ADL_P BMG: igt@xe_drm_fdinfo@drm* - fail - Test assertion failure function check_results, Failed assertion: percent < 105.0
(No new failures associated)
Ravi V changed title from igt@xe_drm_fdinfo@drm-busy-exec-queue-destroy-idle - fail - Test assertion failure function check_results, Failed assertion: 95.0 < percent to igt@xe_drm_fdinfo@subtests - fail - Test assertion failure function check_results, Failed assertion: 95.0 < percent 3 weeks ago
@rveesam please fix the filter and let this bug be about the exec-queue-destroy only. Other possible bugs are not the same thing.
A CI Bug Log filter associated to this bug has been updated by Vinay.
Description:DG2 BMG LNL: igt@xe_drm_fdinfo@subtests - fail - Test assertion failure function check_results, Failed assertion: 95.0 < percent
Equivalent query: runconfig_tag IS IN ["xe"] AND machine_tag IS IN ["LNLDG2", "BMG", "LNL"] AND ((testsuite_name = "IGT" AND test_name IS IN ["igt@xe_drm_fdinfo@utilization-others-idle", "igt@xe_drm_fdinfo@utilization-single-full-load-destroy-queue", "igt@xe_drm_fdinfo@utilization-single-full-load-isolation", "igt@xe_drm_fdinfo@utilization-others-full-load", "igt@xe_drm_fdinfo@drm-busy-exec-queue-destroy-idle"])) AND ((testsuite_name = "IGT" AND status_name IS IN ["fail"])) AND stderr ~= 'Test assertion failure function check_results.*\n.*Failed assertion: 95.0 < percent'
Looks like exec queue destroy ioctl will erase the exec queue from the xef xa_array. Later when we try to dump run ticks, there is no exec queue in the array, so only GPU timestamp is updated. I see that at a later point when the job is freed, the correct value of the run ticks is being updated, but it's too late since IGT already sampled the ticks.
I think we may need to add one more point in the kernel where we update the run ticks. If not, we should just have a retry policy in the IGT for this test. I would look into the former.
Additional notes: When this test does pass, it just means that the job was freed as soon as it was done or when the queue was destroyed and it updated the stats on free. For such cases, even though the drminfo returns only captures the GPU timestamp, the test still works since the update of run ticks (into the xef object) and the GPU timestamp query happen close to each other and in the right order.
https://patchwork.freedesktop.org/series/140538/ should fix these issues and also reduce the number of times we need to update the timestamp. This latter part should make another race much less probable: updating the delta on xef is not protected by any lock and the update on fdinfo query could race with the one from the workqueue.... since now the update only happens when the exec queue is going away, or when someone is querying it, I think it should be safe.