-
- Downloads
drm/sched: add optional errno to drm_sched_start()
The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for each job with a NULL parent in the pending list, making it difficult to distinguish between recovery methods, whether a queue reset or a full GPU reset was used. To improve this, we first try a soft recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with a queue reset, where the error code remains -ENODATA for the job. Finally, for a full GPU reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures. This enables user mode and test applications to validate the expected correctness of the requested operation. After a successful queue reset, the only way to continue normal operation is to call drm_sched_job_done with the specific error code -ENODATA. v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status and distinguish between queue reset and GPU reset. v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl. v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional parameter to set dma_fence_set_error, allowing us to handle the specific error codes appropriately and dispose of bad jobs with the selected error code depending on whether it was a queue reset or GPU reset. v4: Alex suggested using a new name, drm_sched_start_with_recovery_error, which more accurately describes the function's purpose. Additionally, it was recommended to add documentation details about the new method. v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex) v6 (chk): rebase on upstream changes, cleanup the commit message, drop the new function again and update all callers, apply the errno also to scheduler fences with hw fences v7 (chk): rebased Signed-off-by:Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by:
Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by:
Christian König <christian.koenig@amd.com> Acked-by:
Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240826122541.85663-1-christian.koenig@amd.com
Showing
- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c 1 addition, 1 deletiondrivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 2 additions, 2 deletionsdrivers/gpu/drm/amd/amdgpu/amdgpu_device.c
- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 1 addition, 1 deletiondrivers/gpu/drm/amd/amdgpu/amdgpu_job.c
- drivers/gpu/drm/etnaviv/etnaviv_sched.c 1 addition, 1 deletiondrivers/gpu/drm/etnaviv/etnaviv_sched.c
- drivers/gpu/drm/imagination/pvr_queue.c 2 additions, 2 deletionsdrivers/gpu/drm/imagination/pvr_queue.c
- drivers/gpu/drm/lima/lima_sched.c 1 addition, 1 deletiondrivers/gpu/drm/lima/lima_sched.c
- drivers/gpu/drm/nouveau/nouveau_sched.c 1 addition, 1 deletiondrivers/gpu/drm/nouveau/nouveau_sched.c
- drivers/gpu/drm/panfrost/panfrost_job.c 1 addition, 1 deletiondrivers/gpu/drm/panfrost/panfrost_job.c
- drivers/gpu/drm/panthor/panthor_mmu.c 1 addition, 1 deletiondrivers/gpu/drm/panthor/panthor_mmu.c
- drivers/gpu/drm/panthor/panthor_sched.c 1 addition, 1 deletiondrivers/gpu/drm/panthor/panthor_sched.c
- drivers/gpu/drm/scheduler/sched_main.c 4 additions, 3 deletionsdrivers/gpu/drm/scheduler/sched_main.c
- drivers/gpu/drm/v3d/v3d_sched.c 1 addition, 1 deletiondrivers/gpu/drm/v3d/v3d_sched.c
- include/drm/gpu_scheduler.h 1 addition, 1 deletioninclude/drm/gpu_scheduler.h
Loading
Please register or sign in to comment