- Mar 21, 2025
-
-
Christian König authored
Limiting the number of available VMIDs to enforce isolation causes some issues with gang submit and applying certain HW workarounds which require multiple VMIDs to work correctly. So instead start to track all submissions to the relevant engines in a per partition data structure and use the dma_fences of the submissions to enforce isolation similar to what a VMID limit does. v2: use ~0l for jobs without isolation to distinct it from kernel submissions which uses NULL for the owner. Add some warning when we are OOM. Signed-off-by:
Christian König <christian.koenig@amd.com> Acked-by:
Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Feb 27, 2025
-
-
André Almeida authored
Prior to the addition of ring reset, the debug option `debug_disable_soft_recovery` could be used to force a full device reset. Now that we have ring reset, create a debug option to disable them in amdgpu, forcing the driver to go with the full device reset path again when both options are combined. This option is useful for testing and debugging purposes when one wants to test the full reset from userspace. Signed-off-by:
André Almeida <andrealmeid@igalia.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Feb 17, 2025
-
-
Candice Li authored
Enable GECC only when the default memory ECC mode or the module parameter amdgpu_ras_enable is activated. v2: Add kernel message to remind users explicitly set amdgpu_ras_enable=1 before driver loading to enable GECC and set amdgpu_ras_enable=0 to disable GECC when GECC is currently enabled if needed. Signed-off-by:
Candice Li <candice.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Acked-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
Introduce utility functions designed to assist in populating CPER records. v2: call cper_init/fini in device_ip_init/fini. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Feb 13, 2025
-
-
Alex Deucher authored
On big and small APUs we send KFD VRAM allocations to GTT since the carve out is either non-existent or relatively small. However, if someone sets the carve out size to be relatively large, we may end up using GTT rather than VRAM. No change of logic with this patch, but it allows the driver to determine which logic to use based on the carve out size in the future. Reviewed-by:
Mario Limonciello <mario.limonciello@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
lijo lazar authored
Use bios_release wrapper to release memory allocated for vbios image and reset the variables. v2: Use the same wrapper for clean up in sw_fini (Alex Deucher) Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Dec 18, 2024
-
-
Christian König authored
This partially reverts commit 194eb174. This commit introduced a new state variable into adev without even remotely worrying about CPU barriers. Since we already have the amdgpu_in_reset() function exactly for this use case partially revert that. Signed-off-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Dec 10, 2024
-
-
Mario Limonciello authored
As part of the suspend sequence VRAM needs to be evicted on dGPUs. In order to make suspend/resume more reliable we moved this into the pmops prepare() callback so that the suspend sequence would fail but the system could remain operational under high memory usage suspend. Another class of issues exist though where due to memory fragementation there isn't a large enough contiguous space and swap isn't accessible. Add support for a suspend/hibernate notification callback that could evict VRAM before tasks are frozen. This should allow paging out to swap if necessary. Link: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 Link: drm/amd#3476 Closes: drm/amd#2362 Closes: drm/amd#3781 Reviewed-by:
Lijo Lazar <lijo.lazar@amd.com> Link: https://lore.kernel.org/r/20241128032656.2090059-2-superm1@kernel.org Signed-off-by:
Mario Limonciello <mario.limonciello@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Nov 20, 2024
-
-
lijo lazar authored
When device needs to be reset before initialization, it's not required for all IPs to be initialized before a reset. In such cases, it needs to identify whether the IP/feature is initialized for the first time or whether it's reinitialized after a reset. Add RESET_RECOVERY init level to identify post reset reinitialization phase. This only provides a device level identification, IP/features may choose to track their state independently also. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Acked-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Nov 08, 2024
-
-
Jie1zhang authored
Add two sysfs interfaces for gfx and compute: gfx_reset_mask compute_reset_mask These interfaces are read-only and show the resets supported by the IP. For example, full adapter reset (mode1/mode2/BACO/etc), soft reset, queue reset, and pipe reset. V2: the sysfs node returns a text string instead of some flags (Christian) v3: add a generic helper which takes the ring as parameter and print the strings in the order they are applied (Christian) check amdgpu_gpu_recovery before creating sysfs file itself, and initialize supported_reset_types in IP version files (Lijo) v4: Fixing uninitialized variables (Tim) Signed-off-by:
Jesse Zhang <Jesse.Zhang@amd.com> Suggested-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Tim Huang <tim.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Nov 04, 2024
-
-
Alex Deucher authored
Make sure KFD gets a turn when serializing access to the GC IP. Currently non-KFD jobs can starve KFD if they submit often enough. This patch prevents that by stalling non-KFD if its time period has elapsed. v2: fix units v3: check enablement properly Acked-by:
Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Oct 28, 2024
-
-
prike Liang authored
To check the status of S3 suspend completion, use the PM core pm_suspend_global_flags bit(1) to detect S3 abort events. Therefore, clean up the AMDGPU driver's private flag suspend_complete. Signed-off-by:
Prike Liang <Prike.Liang@amd.com> Reviewed-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Oct 22, 2024
-
-
Sunil khatri authored
Before making a function call to resume, validate the function pointer like we do in sw_init. Use the helper function amdgpu_ip_block_resume where same checks and calls are repeated. Signed-off-by:
Sunil Khatri <sunil.khatri@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Sunil khatri authored
Before making a function call to suspend, validate the function pointer like we do in sw_init. Use the helper function amdgpu_ip_block_suspend where same checks and calls are repeated. Signed-off-by:
Sunil Khatri <sunil.khatri@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Sep 26, 2024
-
-
lijo lazar authored
Drop delayed reset work handler as it is no longer used. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Acked-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Tested-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Dr. David Alan Gilbert authored
amdgpu_atpx_dgpu_req_power_for_displays has been unused since commit bdb1ccb0 ("drm/amdgpu: remove ATPX_DGPU_REQ_POWER_FOR_DISPLAYS check when hotplug-in") amdgpu_atpx_get_dhandle has been unused since commit f9b7f370 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") Remove them. Signed-off-by:
Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Dr. David Alan Gilbert authored
amdgpu_device_ip_is_idle is unused. It was renamed from 'amdgpu_is_idle' which was originally added in commit 5dbbb60b ("drm/amdgpu: add IP helpers for wait_for_idle and is_idle") but hasn't been used. Remove it. Signed-off-by:
Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
lijo lazar authored
In some cases, device needs to be reset before first use. Add handlers for doing device reset during driver init sequence. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Reviewed-by:
Feifei Xu <feifxu@amd.com> Acked-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Tested-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Sunil khatri authored
To handle amdgpu_device reference for different GPUs we add it's reference in each ip block which can be used to differentiate between difference gpu devices. Signed-off-by:
Sunil Khatri <sunil.khatri@amd.com> Suggested-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
lijo lazar authored
Move the reinitialization part after a reset to another function. No functional changes. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Acked-by:
Alex Deucher <alexander.deucher@amd.com> Acked-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Tested-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
lijo lazar authored
Add init levels to define the level to which device needs to be initialized. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Acked-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Tested-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Asad Kamal authored
Add helper function to check if ip block is enabled Signed-off-by:
Asad Kamal <asad.kamal@amd.com> Reviewed-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Sep 18, 2024
-
-
Christian König authored
This was only used as workaround for recovering the page tables after VRAM was lost and is no longer necessary after the function amdgpu_vm_bo_reset_state_machine() started to do the same. Compute never used shadows either, so the only proplematic case left is SVM and that is most likely not recoverable in any way when VRAM is lost. Signed-off-by:
Christian König <christian.koenig@amd.com> Acked-by:
Lijo Lazar <lijo.lazar@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Sep 06, 2024
-
-
Ramesh Errabolu authored
Enables users to update SVM's default granularity, used in buffer migration and handling of recoverable page faults. Param value is set in terms of log(numPages(buffer)), e.g. 9 for a 2 MIB buffer Signed-off-by:
Ramesh Errabolu <Ramesh.Errabolu@amd.com> Reviewed-by:
Philip Yang <Philip.Yang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Aug 29, 2024
-
-
Alex Deucher authored
Add this flag to enable experimental resets for testing before they are fully validated. Reviewed-and-tested-by:
Jiadong Zhu <Jiadong.Zhu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Aug 21, 2024
-
-
SRINIVASAN SHANMUGAM authored
This commit introduces the Enforce Isolation Handler designed to enforce shader isolation on AMD GPUs, which helps to prevent data leakage between different processes. The handler counts the number of emitted fences for each GFX and compute ring. If there are any fences, it schedules the `enforce_isolation_work` to be run after a delay of `GFX_SLICE_PERIOD`. If there are no fences, it signals the Kernel Fusion Driver (KFD) to resume the runqueue. The function is synchronized using the `enforce_isolation_mutex`. This commit also introduces a reference count mechanism (kfd_sch_req_count) to keep track of the number of requests to enable the KFD scheduler. When a request to enable the KFD scheduler is made, the reference count is decremented. When the reference count reaches zero, a delayed work is scheduled to enforce isolation after a delay of GFX_SLICE_PERIOD. When a request to disable the KFD scheduler is made, the function first checks if the reference count is zero. If it is, it cancels the delayed work for enforcing isolation and checks if the KFD scheduler is active. If the KFD scheduler is active, it sends a request to stop the KFD scheduler and sets the KFD scheduler state to inactive. Then, it increments the reference count. The function is synchronized using the kfd_sch_mutex to ensure that the KFD scheduler state and reference count are updated atomically. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by:
Christian König <christian.koenig@amd.com> Suggested-by:
Alex Deucher <alexander.deucher@amd.com>
-
SRINIVASAN SHANMUGAM authored
This commit adds a new sysfs attribute 'enforce_isolation' to control the 'enforce_isolation' setting per GPU. The attribute can be read and written, and accepts values 0 (disabled) and 1 (enabled). When 'enforce_isolation' is enabled, reserved VMIDs are allocated for each ring. When it's disabled, the reserved VMIDs are freed. The set function locks a mutex before changing the 'enforce_isolation' flag and the VMIDs, and unlocks it afterwards. This ensures that these operations are atomic and prevents race conditions and other concurrency issues. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Aug 16, 2024
-
-
SRINIVASAN SHANMUGAM authored
This commit makes enforce_isolation setting to be per GPU and per partition by adding the enforce_isolation array to the adev structure. The adev variable is set based on the global enforce_isolation module parameter during device initialization. In amdgpu_ids.c, the adev->enforce_isolation value for the current GPU is used to determine whether to enforce isolation between graphics and compute processes on that GPU. In amdgpu_ids.c, the adev->enforce_isolation value for the current GPU and partition is used to determine whether to enforce isolation between graphics and compute processes on that GPU and partition. This allows the enforce_isolation setting to be controlled individually for each GPU and each partition, which is useful in a system with multiple GPUs and partitions where different isolation settings might be desired for different GPUs and partitions. v2: fix loop in amdgpu_vmid_mgr_init() (Alex) Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Suggested-by:
Christian König <christian.koenig@amd.com>
-
Zhang Zekun authored
amdgpu_gart_table_vram_pin() and amdgpu_gart_table_vram_unpin() has been removed since commit 575e55ee ("drm/amdgpu: recover gart table at resume") remain the declarations untouched in the header files. Besides, amdgpu_dm_display_resume() has also beed removed since commit a80aa93d ("drm/amd/display: Unify dm resume sequence into a single call"). So, let's remove this unused declarations. Signed-off-by:
Zhang Zekun <zhangzekun11@huawei.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Aug 13, 2024
-
-
Victor Skvortsov authored
KIQ timeouts no longer seen. This reverts commit 3a19a8af. Signed-off-by:
Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by:
Zhigang Luo <zhigang.luo@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Thomas Zimmermann authored
Remove the implementation of struct drm_driver.lastclose. The hook was only necessary before in-kernel DRM clients existed, but is now obsolete. The code in amdgpu_driver_lastclose_kms() is performed by drm_lastclose(). v2: - update commit message Signed-off-by:
Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by:
Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240812083000.337744-3-tzimmermann@suse.de
-
- Aug 06, 2024
-
-
Sunil khatri authored
debugfs register list for dump is cleaned as it have some issues related to proper power state of the IP before register read. Since the above mentioned is removed we no longer want this to be dumped part of the devcoredump and hence removed. Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Sunil Khatri <sunil.khatri@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jun 27, 2024
-
-
Alex Deucher authored
Add new config option and set proper dependencies for ISP. v2: add missed guards, drop separate Kconfig Reviewed-by:
Pratap Nirujogi <pratap.nirujogi@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Cc: Pratap Nirujogi <pratap.nirujogi@amd.com>
-
Pratap Nirujogi authored
Add the isp driver in amdgpu to support ISP device on the APUs that supports ISP IP block. ISP hw block is used for camera front-end, pre and post processing operations. Reviewed-by:
Mario Limonciello <mario.limonciello@amd.com> Signed-off-by:
Pratap Nirujogi <pratap.nirujogi@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Pratap Nirujogi authored
ISP hw block is supported in some of the AMD GPU versions, add support to discover ISP IP in amdgpu_discovery. v2: squash in documentation update (Alex) Reviewed-by:
Mario Limonciello <mario.limonciello@amd.com> Signed-off-by:
Pratap Nirujogi <pratap.nirujogi@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Kenneth Feng authored
This reverts commit d3620eea. Revert this due to a final solution: commit ed3165d6 ("drm/amdgpu/jpeg5: reprogram doorbell setting after power up for each playback") Signed-off-by:
Kenneth Feng <kenneth.feng@amd.com> Reviewed-by:
Sonny Jiang <sonjiang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jun 19, 2024
-
-
Christian König authored
We need to ensure that even when using a reserved VMID that the gang members can still run in parallel. Signed-off-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jun 14, 2024
-
-
Mario Limonciello authored
Currently, amdgpu will always set up the brightness at 100% when it loads. However this is jarring when the BIOS has it previously programmed to a much lower value. The ACPI ATIF method includes two members for "ac_level" and "dc_level". These represent the default values that should be used if the system is brought up in AC and DC respectively. Use these values to set up the default brightness when the backlight device is registered. v2: squash in ACPI fix Reviewed-by:
Leo Li <sunpeng.li@amd.com> Acked-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Mario Limonciello <mario.limonciello@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- May 20, 2024
-
-
Victor Skvortsov authored
Runtime KIQ interface to read/write registers in VF may take longer than expected for BM environment. Extend the timeout. Signed-off-by:
Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by:
Zhigang Luo <zhigang.luo@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- May 17, 2024
-
-
Jiapeng Chong authored
./drivers/gpu/drm/amd/amdgpu/amdgpu.h: amdgpu_umsch_mm.h is included more than once. Reported-by:
Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9063 Signed-off-by:
Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-