changes in drm scheduler introduced by commit 2fe205008e9b70c67a9f3502831074ff36b00093 cause system lock-up
Hardware: MSI Alpha 15 Laptop (Ryzen7 5800H, Radeon RX660M)
The changes introduce by commit 2fe205008e9b70c67a9f3502831074ff36b00093 lead to a system lockup and possible kernel panic after 15-45min when using firefox. System hangs with capslock LED flashing, no traces of this are left in logs, using another machine to monitor dmesg -w over ssh also gives no error and a serial console is not available.
The easiest way to trigger this error is Civilisation VI where usually one or two turns are sufficient to trigger it. The error is still present in linux-next-20220930. The following patch is sufficient to fix the error (at least it seems like this so far)
diff --git b/drivers/gpu/drm/scheduler/sched_main.c a/drivers/gpu/drm/scheduler/sched_main.c
index 4f2395d1a791..e5a4ecde0063 100644
--- b/drivers/gpu/drm/scheduler/sched_main.c
+++ a/drivers/gpu/drm/scheduler/sched_main.c
@@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
job = list_first_entry_or_null(&sched->pending_list,
struct drm_sched_job, list);
- if (job && dma_fence_is_signaled(job->s_fence->parent)) {
+ if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
/* remove job from pending_list */
list_del_init(&job->list);
@@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
if (next) {
next->s_fence->scheduled.timestamp =
- job->s_fence->parent->timestamp;
+ job->s_fence->finished.timestamp;
/* start TO timer for next job */
drm_sched_start_timeout(sched);
}
This was also the cause of the instability I mentioned in issue #2170 (closed)