Skip to content
Snippets Groups Projects
  1. Mar 21, 2025
  2. Feb 25, 2025
    • Jie1zhang's avatar
      drm/amdgpu: Improve SDMA reset logic with guilty queue tracking · fdbfaaaa
      Jie1zhang authored
      
      This patch includes the remaining improvements to the SDMA reset logic:
      - Added `gfx_guilty` and `page_guilty` flags to track guilty queues.
      - Updated the reset and resume functions to handle the guilty state.
      - Cached the `rptr` before reset.
      
      v2:
         1.replace the caller with a guilty bool.
         If the queue is the guilty one, set the rptr and wptr  to the saved wptr value,
         else, set the rptr and wptr to the saved rptr value. (Alex)
         2. cache the rptr before the reset. (Alex)
      
      v3: Keeping intermediate variables like u64 rwptr simplifies resotre rptr/wptr.(Lijo)
      
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Suggested-by: default avatarJiadong Zhu <Jiadong.Zhu@amd.com>
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      fdbfaaaa
    • Jie1zhang's avatar
      drm/amdgpu: Introduce conditional user queue suspension for SDMA resets · 4c02f730
      Jie1zhang authored
      
      - Modify the `amdgpu_sdma_reset_engine` function to accept a `suspend_user_queues` parameter.
      - This parameter allows the function to conditionally suspend and resume user queues during SDMA resets.
      - Ensure that user queues are suspended only when necessary to avoid unnecessary overhead and potential deadlocks.
      - Restart the scheduler's work queue for the GFX and page rings after the reset to allow new tasks to be submitted.
      
      This change improves synchronization between the KGD and the KFD during SDMA resets,
      ensuring proper handling of user queues and avoiding race conditions.
      
      V2: replace the ring_lock with the existed the scheduler
          locks for the queues (ring->sched) on the sdma engine.(Alex)
      
      v3: call drm_sched_wqueue_stop() rather than job_list_lock.
          If a GPU ring reset was already initiated for one ring at amdgpu_job_timedout,
          skip resetting that ring and call drm_sched_wqueue_stop()
          for the other rings (Alex)
      
         replace  the common lock (sdma_reset_lock) with DQM lock to
         to resolve reset races between the two driver sections during KFD eviction.(Jon)
      
         Rename the caller to Reset_src and
         Change AMDGPU_RESET_SRC_SDMA_KGD/KFD to AMDGPU_RESET_SRC_SDMA_HWS/RING (Jon)
      
      v4: restart the wqueue if the reset was successful,
          or fall back to a full adapter reset. (Alex)
      
         move definition of reset source to enumeration AMDGPU_RESET_SRCS, and
         check reset src in amdgpu_sdma_reset_instance (Jon)
      
      v5: Call amdgpu_amdkfd_suspend/resume at the start/end of reset function respectively under !SRC_HWS
          conditions only (Jon)
      
      v6: replace the paramter src with a bool suspend_user_queues,
          remove the paramter src in pre/post func. (Jon)
      
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Suggested-by: default avatarJiadong Zhu <Jiadong.Zhu@amd.com>
      Suggested-by: default avatarJonathan Kim <Jonathan.Kim@amd.com>
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Acked-by: default avatarJonathan Kim <jonathan.kim@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      4c02f730
    • Jie1zhang's avatar
      drm/amdgpu/kfd: Add shared SDMA reset functionality with callback support · f3304495
      Jie1zhang authored
      
      This patch introduces shared SDMA reset functionality between AMDGPU and KFD.
      The implementation includes the following key changes:
      
      1. Added `amdgpu_sdma_reset_queue`:
         - Resets a specific SDMA queue by instance ID.
         - Invokes registered pre-reset and post-reset callbacks to allow KFD and AMDGPU
           to save/restore their state during the reset process.
      
      2. Added `amdgpu_set_on_reset_callbacks`:
         - Allows KFD and AMDGPU to register callback functions for pre-reset and
           post-reset operations.
         - Callbacks are stored in a global linked list and invoked in the correct order
           during SDMA reset.
      
      This patch ensures that both AMDGPU and KFD can handle SDMA reset events
      gracefully, with proper state saving and restoration. It also provides a flexible
      callback mechanism for future extensions.
      
      v2: fix CamelCase and put the SDMA helper into amdgpu_sdma.c (Alex)
      
      v3: rename the `amdgpu_register_on_reset_callbacks` function to
            `amdgpu_sdma_register_on_reset_callbacks`
          move global reset_callback_list to struct amdgpu_sdma (Alex)
      
      v4: Update the reset callback function description and
         rename the reset function to amdgpu_sdma_reset_engine (Alex)
      
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Suggested-by: default avatarJiadong Zhu <Jiadong.Zhu@amd.com>
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      f3304495
  3. Jan 24, 2025
  4. Nov 08, 2024
  5. Nov 04, 2024
  6. Jul 23, 2024
  7. Apr 30, 2024
  8. Apr 26, 2024
  9. Oct 26, 2023
  10. Jun 09, 2023
  11. Jan 19, 2023
  12. Jan 09, 2023
  13. Oct 10, 2022
  14. Sep 29, 2022
  15. Mar 02, 2022
  16. Feb 17, 2022
  17. Jan 14, 2022
    • yipechai's avatar
      drm/amdgpu: Modify sdma block to fit for the unified ras block data and ops · bdc4292b
      yipechai authored
      
      1.Modify sdma block to fit for the unified ras block data and ops.
      2.Change amdgpu_sdma_ras_funcs to amdgpu_sdma_ras, and the corresponding variable name remove _funcs suffix.
      3.Remove the const flag of sdma ras variable so that sdma ras block can be able to be inserted into amdgpu device ras block link list.
      4.Invoke amdgpu_ras_register_ras_block function to register sdma ras block into amdgpu device ras block link list.
      5.Remove the redundant code about sdma in amdgpu_ras.c after using the unified ras block.
      6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of sdma versions. If .ras_late_init and .ras_fini had been defined by the selected sdma version, the defined functions will take effect; if not defined, default fill them with amdgpu_sdma_ras_late_init and amdgpu_sdma_ras_fini.
      
      v2: squash in warning fix (Alex)
      
      Signed-off-by: default avataryipechai <YiPeng.Chai@amd.com>
      Reviewed-by: default avatarHawking Zhang <Hawking.Zhang@amd.com>
      Reviewed-by: default avatarJohn Clements <john.clements@amd.com>
      Reviewed-by: default avatarTao Zhou <tao.zhou1@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      bdc4292b
  18. Mar 05, 2021
  19. Apr 28, 2020
  20. Apr 09, 2020
    • Nirmoy Das's avatar
      drm/amdgpu: rework sched_list generation · 1c6d567b
      Nirmoy Das authored
      
      Generate HW IP's sched_list in amdgpu_ring_init() instead of
      amdgpu_ctx.c. This makes amdgpu_ctx_init_compute_sched(),
      ring.has_high_prio and amdgpu_ctx_init_sched() unnecessary.
      This patch also stores sched_list for all HW IPs in one big
      array in struct amdgpu_device which makes amdgpu_ctx_init_entity()
      much more leaner.
      
      v2:
      fix a coding style issue
      do not use drm hw_ip const to populate amdgpu_ring_type enum
      
      v3:
      remove ctx reference and move sched array and num_sched to a struct
      use num_scheds to detect uninitialized scheduler list
      
      v4:
      use array_index_nospec for user space controlled variables
      fix possible checkpatch.pl warnings
      
      Signed-off-by: default avatarNirmoy Das <nirmoy.das@amd.com>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      1c6d567b
  21. Mar 05, 2020
  22. Jan 14, 2020
  23. Dec 18, 2019
  24. Oct 03, 2019
  25. Sep 13, 2019
  26. Jul 18, 2019
  27. Jun 21, 2019
  28. Apr 03, 2019
  29. Mar 19, 2019
  30. Nov 05, 2018
Loading