- Mar 21, 2025
-
-
Jie1zhang authored
- Modify the VM invalidation engine allocation logic to handle SDMA page rings. SDMA page rings now share the VM invalidation engine with SDMA gfx rings instead of allocating a separate engine. This change ensures efficient resource management and avoids the issue of insufficient VM invalidation engines. - Add synchronization for GPU TLB flush operations in gmc_v9_0.c. Use spin_lock and spin_unlock to ensure thread safety and prevent race conditions during TLB flush operations. This improves the stability and reliability of the driver, especially in multi-threaded environments. v2: replace the sdma ring check with a function `amdgpu_sdma_is_page_queue` to check if a ring is an SDMA page queue.(Lijo) v3: Add GC version check, only enabled on GC9.4.3/9.4.4/9.5.0 v4: Fix code style and add more detailed description (Christian) v5: Remove dependency on vm_inv_eng loop order, explicitly lookup shared inv_eng(Christian/Lijo) v6: Added search shared ring function amdgpu_sdma_get_shared_ring (Lijo) Suggested-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Jesse Zhang <jesse.zhang@amd.com> Reviewed-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Alex Deucher authored
The gfx and page queues are per instance, so track them per instance. v2: drop extra parameter (Lijo) Fixes: fdbfaaaa ("drm/amdgpu: Improve SDMA reset logic with guilty queue tracking") Reviewed-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Alex Deucher authored
Move the kfd suspend/resume code into the caller. That is where the KFD is likely to detect a reset so on the KFD side there is no need to call them. Also add a mutex to lock the actual reset sequence. v2: make the locking per instance Fixes: bac38ca8 ("drm/amdkfd: implement per queue sdma reset for gfx 9.4+") Reviewed-by:
Jesse Zhang <jesse.zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Feb 25, 2025
-
-
Jie1zhang authored
This patch includes the remaining improvements to the SDMA reset logic: - Added `gfx_guilty` and `page_guilty` flags to track guilty queues. - Updated the reset and resume functions to handle the guilty state. - Cached the `rptr` before reset. v2: 1.replace the caller with a guilty bool. If the queue is the guilty one, set the rptr and wptr to the saved wptr value, else, set the rptr and wptr to the saved rptr value. (Alex) 2. cache the rptr before the reset. (Alex) v3: Keeping intermediate variables like u64 rwptr simplifies resotre rptr/wptr.(Lijo) Suggested-by:
Alex Deucher <alexander.deucher@amd.com> Suggested-by:
Jiadong Zhu <Jiadong.Zhu@amd.com> Signed-off-by:
Jesse Zhang <jesse.zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Jie1zhang authored
- Modify the `amdgpu_sdma_reset_engine` function to accept a `suspend_user_queues` parameter. - This parameter allows the function to conditionally suspend and resume user queues during SDMA resets. - Ensure that user queues are suspended only when necessary to avoid unnecessary overhead and potential deadlocks. - Restart the scheduler's work queue for the GFX and page rings after the reset to allow new tasks to be submitted. This change improves synchronization between the KGD and the KFD during SDMA resets, ensuring proper handling of user queues and avoiding race conditions. V2: replace the ring_lock with the existed the scheduler locks for the queues (ring->sched) on the sdma engine.(Alex) v3: call drm_sched_wqueue_stop() rather than job_list_lock. If a GPU ring reset was already initiated for one ring at amdgpu_job_timedout, skip resetting that ring and call drm_sched_wqueue_stop() for the other rings (Alex) replace the common lock (sdma_reset_lock) with DQM lock to to resolve reset races between the two driver sections during KFD eviction.(Jon) Rename the caller to Reset_src and Change AMDGPU_RESET_SRC_SDMA_KGD/KFD to AMDGPU_RESET_SRC_SDMA_HWS/RING (Jon) v4: restart the wqueue if the reset was successful, or fall back to a full adapter reset. (Alex) move definition of reset source to enumeration AMDGPU_RESET_SRCS, and check reset src in amdgpu_sdma_reset_instance (Jon) v5: Call amdgpu_amdkfd_suspend/resume at the start/end of reset function respectively under !SRC_HWS conditions only (Jon) v6: replace the paramter src with a bool suspend_user_queues, remove the paramter src in pre/post func. (Jon) Suggested-by:
Alex Deucher <alexander.deucher@amd.com> Suggested-by:
Jiadong Zhu <Jiadong.Zhu@amd.com> Suggested-by:
Jonathan Kim <Jonathan.Kim@amd.com> Signed-off-by:
Jesse Zhang <jesse.zhang@amd.com> Acked-by:
Jonathan Kim <jonathan.kim@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Jie1zhang authored
This patch introduces shared SDMA reset functionality between AMDGPU and KFD. The implementation includes the following key changes: 1. Added `amdgpu_sdma_reset_queue`: - Resets a specific SDMA queue by instance ID. - Invokes registered pre-reset and post-reset callbacks to allow KFD and AMDGPU to save/restore their state during the reset process. 2. Added `amdgpu_set_on_reset_callbacks`: - Allows KFD and AMDGPU to register callback functions for pre-reset and post-reset operations. - Callbacks are stored in a global linked list and invoked in the correct order during SDMA reset. This patch ensures that both AMDGPU and KFD can handle SDMA reset events gracefully, with proper state saving and restoration. It also provides a flexible callback mechanism for future extensions. v2: fix CamelCase and put the SDMA helper into amdgpu_sdma.c (Alex) v3: rename the `amdgpu_register_on_reset_callbacks` function to `amdgpu_sdma_register_on_reset_callbacks` move global reset_callback_list to struct amdgpu_sdma (Alex) v4: Update the reset callback function description and rename the reset function to amdgpu_sdma_reset_engine (Alex) Suggested-by:
Alex Deucher <alexander.deucher@amd.com> Suggested-by:
Jiadong Zhu <Jiadong.Zhu@amd.com> Signed-off-by:
Jesse Zhang <jesse.zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jan 24, 2025
-
-
lijo lazar authored
Context empty interrupt is enabled for SDMA 4.4.2. Add a handler for context empty interrupt so that it is disposed of fast, and not propagated to KFD layer. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Suggested-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Nov 08, 2024
-
-
Jie1zhang authored
Add the sysfs interface for sdma: sdma_reset_mask The interface is read-only and show the resets supported by the IP. For example, full adapter reset (mode1/mode2/BACO/etc), soft reset, queue reset, and pipe reset. V2: the sysfs node returns a text string instead of some flags (Christian) v3: add a generic helper which takes the ring as parameter and print the strings in the order they are applied (Christian) check amdgpu_gpu_recovery before creating sysfs file itself, and initialize supported_reset_types in IP version files (Lijo) Signed-off-by:
Jesse Zhang <Jesse.Zhang@amd.com> Suggested-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Tim Huang <tim.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Nov 04, 2024
-
-
Jie1zhang authored
Userspace wants to run jobs on a specific sdma ring for verification purposes. This debugfs entry helps to disable or enable submitting jobs to a specific ring. This entry is populated only if there are at least two or more cores in the sdma ip. Signed-off-by:
Jesse Zhang <jesse.zhang@amd.com> Suggested-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Tim Huang <tim.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jul 23, 2024
-
-
Sunil khatri authored
Add ip dump for sdma_v5_2 for devcoredump for all instances of sdma. Signed-off-by:
Sunil Khatri <sunil.khatri@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Apr 30, 2024
-
-
Likun Gao authored
Add new members in sdma instance structure for sdma v7_0 firmware. Signed-off-by:
Likun Gao <Likun.Gao@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Apr 26, 2024
-
-
frank.min authored
Replace tmz flag into buffer flag to make it easier to understand and extend Signed-off-by:
Likun Gao <Likun.Gao@amd.com> Signed-off-by:
Frank Min <Frank.Min@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Oct 26, 2023
-
-
Alex Deucher authored
Rather than doing this in the IP code for the SDMA paging engine, move it up to the core device level init level. This should fix the scheduler init ordering. v2: drop extra parens v3: drop SDMA helpers v4: Added a Fixes tag because amdgpu dereferences an uninitialized scheduler without this patch, and this patch fixes this. (Luben) Tested-by:
Luben Tuikov <luben.tuikov@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20231025171928.3318505-1-alexander.deucher@amd.com Acked-by:
Christian König <christian.koenig@amd.com> Fixes: 56e44960 ("drm/sched: Convert the GPU scheduler to variable number of run-queues") Signed-off-by:
Luben Tuikov <ltuikov89@gmail.com>
-
- Jun 09, 2023
-
-
Hawking Zhang authored
Add query_ras_error_count callback for sdma v4_4_2. It will be used to query and log sdma uncorrectable error count and memory block. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
lijo lazar authored
Add a mask of SDMA instances available for use. On certain ASIC configs, not all SDMA instances are available for software use. v2: Change sdma mask type to uint32_t (Le) Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Le Ma <Le.Ma@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
lema1 authored
Initialize SDMA instances on each AID. v2: revise coding fault in hw_fini Signed-off-by:
Le Ma <le.ma@amd.com> Acked-by:
Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
lema1 authored
add some elements below: - num_aid - aid_id for each sdma instance - num_inst_per_aid for sdma and extend macro size below: - SDMA_MAX_INSTANCES to 16 - AMDGPU_MAX_RINGS to 96 - AMDGPU_MAX_HWIP_RINGS to 32 v2: move aid_id from amdgpu_ring to amdgpu_sdma_instance. (Lijo) Signed-off-by:
Le Ma <le.ma@amd.com> Acked-by:
Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jan 19, 2023
-
-
YiPeng Chai authored
Add sdma ras function on sdma v6_0_3. Signed-off-by:
YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jan 09, 2023
-
-
Mario Limonciello authored
Simplifies the code so that all SDMA versions will get the firmware name from `amdgpu_ucode_ip_version_decode`. v2: squash in fix from Srinivasan Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Mario Limonciello <mario.limonciello@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Oct 10, 2022
-
-
Alex Deucher authored
Switch all of the SDMA implementations to use the helper to tear down the ttm buffer manager. Tested-by:
Bokun Zhang <Bokun.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Sep 29, 2022
-
-
Likun Gao authored
Add an common function to init SDMA related microcode. Signed-off-by:
Likun Gao <Likun.Gao@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Mar 02, 2022
-
-
yipechai authored
Remove redundant calls of amdgpu_ras_block_late_fini in sdma ras block. Signed-off-by:
yipechai <YiPeng.Chai@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
yipechai authored
Modify .ras_fini function pointer parameter so that we can remove redundant intermediate calls in some ras blocks. Signed-off-by:
yipechai <YiPeng.Chai@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Feb 17, 2022
-
-
yipechai authored
Modify .ras_late_init function pointer parameter so that it can remove redundant intermediate calls in some ras blocks. Signed-off-by:
yipechai <YiPeng.Chai@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jan 14, 2022
-
-
yipechai authored
1.Modify sdma block to fit for the unified ras block data and ops. 2.Change amdgpu_sdma_ras_funcs to amdgpu_sdma_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of sdma ras variable so that sdma ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register sdma ras block into amdgpu device ras block link list. 5.Remove the redundant code about sdma in amdgpu_ras.c after using the unified ras block. 6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of sdma versions. If .ras_late_init and .ras_fini had been defined by the selected sdma version, the defined functions will take effect; if not defined, default fill them with amdgpu_sdma_ras_late_init and amdgpu_sdma_ras_fini. v2: squash in warning fix (Alex) Signed-off-by:
yipechai <YiPeng.Chai@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
John Clements <john.clements@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Mar 05, 2021
-
-
Feifei Xu authored
Add VM_HOLE/DOORBELL_INVALID_BE/POLL_TIMEOUT/SRBMWRITE interrupt info printing. Signed-off-by:
Feifei Xu <Feifei.Xu@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Apr 28, 2020
-
-
Aaron Liu authored
This patch expands sdma copy_buffer interface with tmz parameter. Signed-off-by:
Aaron Liu <aaron.liu@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
Huang Rui <ray.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Apr 09, 2020
-
-
Nirmoy Das authored
Generate HW IP's sched_list in amdgpu_ring_init() instead of amdgpu_ctx.c. This makes amdgpu_ctx_init_compute_sched(), ring.has_high_prio and amdgpu_ctx_init_sched() unnecessary. This patch also stores sched_list for all HW IPs in one big array in struct amdgpu_device which makes amdgpu_ctx_init_entity() much more leaner. v2: fix a coding style issue do not use drm hw_ip const to populate amdgpu_ring_type enum v3: remove ctx reference and move sched array and num_sched to a struct use num_scheds to detect uninitialized scheduler list v4: use array_index_nospec for user space controlled variables fix possible checkpatch.pl warnings Signed-off-by:
Nirmoy Das <nirmoy.das@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Mar 05, 2020
-
-
Hawking Zhang authored
SDMA ras error counters are dirty ones after cold reboot Read operation is needed to reset them to 0 Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jan 14, 2020
-
-
Hawking Zhang authored
move ras_late_init and ras_fini to sdma_ras_funcs table Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
query_ras_error_count function will be invoked to query single bit error count detected in sdma ip block Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Dec 18, 2019
-
-
Nirmoy Das authored
This sched array can be passed on to entity creation routine instead of manually creating such sched array on every context creation. v2: squash in missing break fix Signed-off-by:
Nirmoy Das <nirmoy.das@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Oct 03, 2019
-
-
Tao Zhou authored
sdma_ras_fini can be shared among all generations of sdma Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
sdma ras ecc functions can be reused among all sdma generations Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Sep 13, 2019
-
-
Hawking Zhang authored
amdgpu_sdma_ras_late_init is used to init sdma specfic ras debugfs/sysfs node and sdma specific interrupt handler. It can be shared among sdma generations Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jul 18, 2019
-
-
lema1 authored
This change is needed for Arcturus which has 8 sdma instances. The CG/PG part is not covered for now. Signed-off-by:
Le Ma <le.ma@amd.com> Acked-by:
Snow Zhang < <Snow.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jun 21, 2019
-
-
Jack Xiao authored
Allocate CSA for the given sdma ring. Acked-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Jack Xiao <Jack.Xiao@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Apr 03, 2019
-
-
Emily Deng authored
Fix the issue about TDR-2 will have "fallback timer expired on ring sdma1". It is because the wrong number of irq types setting. Signed-off-by:
Emily Deng <Emily.Deng@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Mar 19, 2019
-
-
xinhui pan authored
register IH, enable ras features on sdma. create sysfs debugfs file for sdma. Signed-off-by:
xinhui pan <xinhui.pan@amd.com> Signed-off-by:
Feifei Xu <Feifei.Xu@amd.com> Signed-off-by:
Eric Huang <JinhuiEric.Huang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Nov 05, 2018
-
-
Rex Zhu authored
Get the sdma index from ring v2: refine function name Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Flora Cui <flora.cui@amd.com> Signed-off-by:
Rex Zhu <Rex.Zhu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-