- 23 Sep, 2021 4 commits
-
-
Philip Yang authored
Device manager releases device-specific resources when a driver disconnects from a device, devm_memunmap_pages and devm_release_mem_region calls in svm_migrate_fini are redundant. It causes below warning trace after patch "drm/amdgpu: Split amdgpu_device_fini into early and late", so remove function svm_migrate_fini. BUG: drm/amd#1718 WARNING: CPU: 1 PID: 3646 at drivers/base/devres.c:795 devm_release_action+0x51/0x60 Call Trace: ? memunmap_pages+0x360/0x360 svm_migrate_fini+0x2d/0x60 [amdgpu] kgd2kfd_device_exit+0x23/0xa0 [amdgpu] amdgpu_amdkfd_device_fini_sw+0x1d/0x30 [amdgpu] amdgpu_device_fini_sw+0x45/0x290 [amdgpu] amdgpu_driver_release_kms+0x12/0x30 [amdgpu] drm_dev_release+0x20/0x40 [drm] release_nodes+0x196/0x1e0 device_release_driver_internal+0x104/0x1d0 driver_detach+0x47/0x90 bus_remove_driver+0x7a/0xd0 pci_unregister_driver+0x3d/0x90 amdgpu_exit+0x11/0x20 [amdgpu] Signed-off-by:
Philip Yang <Philip.Yang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Philip Yang authored
If svm migration init failed to create pgmap for device memory, set pgmap type to 0 to disable device SVM support capability. Signed-off-by:
Philip Yang <Philip.Yang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Philip Yang authored
For xnack off, restore work dma unmap previous system memory page, and dma map the updated system memory page to update GPU mapping, this is not dma mapping leaking, remove the WARN_ONCE for dma mapping leaking. prange->dma_addr store the VRAM page pfn after the range migrated to VRAM, should not dma unmap VRAM page when updating GPU mapping or remove prange. Add helper svm_is_valid_dma_mapping_addr to check VRAM page and error cases. Mask out SVM_RANGE_VRAM_DOMAIN flag in dma_addr before calling amdgpu vm update to avoid BUG_ON(*addr & 0xFFFF00000000003FULL), and set it again immediately after. This flag is used to know the type of page later to dma unmapping system memory page. Fixes: 1d5dbfe6 ("drm/amdkfd: classify and map mixed svm range pages in GPU") Signed-off-by:
Philip Yang <Philip.Yang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Philip Yang authored
SVM range may includes multiple VMAs with different vm_flags, if prange page index is the last page of the VMA offset + npages, update GPU mapping to create GPU page table with same VMA access permission. Signed-off-by:
Philip Yang <Philip.Yang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 16 Sep, 2021 2 commits
-
-
James Zhu authored
Separate kfd_iommu_resume from kfd_resume for fine-tuning of amdgpu device init/resume/reset/recovery sequence. v2: squash in fix for !CONFIG_HSA_AMD Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211277 Signed-off-by:
James Zhu <James.Zhu@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
-
Felix Kuehling authored
On some GPUs the PCIe atomic requirement for KFD depends on the MEC firmware version. Add a firmware version check for this. The minimum firmware version that works without atomics can be updated in the device_info structure for each GPU type. Move PCIe atomic detection from kgd2kfd_probe into kgd2kfd_device_init because the MEC firmware is not loaded yet at the probe stage. Signed-off-by:
Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 01 Sep, 2021 1 commit
-
-
Alex Sierra authored
During svm restore pages interrupt handler, kfd_process ref count was never dropped when xnack was disabled. Therefore, the object was never released. Fixes: 2383f56b ("drm/amdkfd: page table restore through svm API") Signed-off-by:
Alex Sierra <alex.sierra@amd.com> Reviewed-by:
Philip Yang <philip.yang@amd.com> Reviewed-by:
Jonathan Kim <jonathan.kim@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
-
- 26 Aug, 2021 1 commit
-
-
Sean Keely authored
On systems with multiple SH per SE compute_static_thread_mgmt_se# is split into independent masks, one for each SH, in the upper and lower 16 bits. We need to detect this and apply cu masking to each SH. The cu mask bits are assigned first to each SE, then to alternate SHs, then finally to higher CU id. This ensures that the maximum number of SPIs are engaged as early as possible while balancing CU assignment to each SH. v2: Use max SH/SE rather than max SH in cu_per_sh. v3: Fix comment blocks, ensure se_mask is initially zero filled, and correctly assign se.sh.cu positions to unset bits in cu_mask. Signed-off-by:
Sean Keely <Sean.Keely@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 24 Aug, 2021 2 commits
-
-
Philip Yang authored
Restore retry fault or prefetch range, or restore svm range after eviction to map range to GPU with correct read or write access permission. Range may includes multiple VMAs, update GPU page table with offset of prange, number of pages for each VMA according VMA access permission. Signed-off-by:
Philip Yang <Philip.Yang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Philip Yang authored
Check range access permission to restore GPU retry fault, if GPU retry fault on address which belongs to VMA, and VMA has no read or write permission requested by GPU, failed to restore the address. The vm fault event will pass back to user space. Signed-off-by:
Philip Yang <Philip.Yang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 16 Aug, 2021 2 commits
-
-
zhang yifan authored
KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test. Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18 tests from 1 test case. [----------] Global test environment set-up. [----------] 18 tests from KFDSVMRangeTest [ RUN ] KFDSVMRangeTest.BasicSystemMemTest [ OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms) [ RUN ] KFDSVMRangeTest.SetGetAttributesTest [ ] Get default atrributes /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure Value of: expectedDefaultResults[i] Actual: 4294967295 Expected: outputAttributes[i].value Which is: 0 /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure Value of: expectedDefaultResults[i] Actual: 4294967295 Expected: outputAttributes[i].value Which is: 0 /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152: Failure Value of: expectedDefaultResults[i] Actual: 4 Expected: outputAttributes[i].type Which is: 2 [ ] Setting/Getting atrributes [ FAILED ] the root cause is that svm work queue has not finished when svm_range_get_attr is called, thus some garbage svm interval tree data make svm_range_get_attr get wrong result. Flush work queue before iterate svm interval tree. Signed-off-by:
Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
zhang yifan authored
KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test. Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18 tests from 1 test case. [----------] Global test environment set-up. [----------] 18 tests from KFDSVMRangeTest [ RUN ] KFDSVMRangeTest.BasicSystemMemTest [ OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms) [ RUN ] KFDSVMRangeTest.SetGetAttributesTest [ ] Get default atrributes /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure Value of: expectedDefaultResults[i] Actual: 4294967295 Expected: outputAttributes[i].value Which is: 0 /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure Value of: expectedDefaultResults[i] Actual: 4294967295 Expected: outputAttributes[i].value Which is: 0 /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152: Failure Value of: expectedDefaultResults[i] Actual: 4 Expected: outputAttributes[i].type Which is: 2 [ ] Setting/Getting atrributes [ FAILED ] the root cause is that svm work queue has not finished when svm_range_get_attr is called, thus some garbage svm interval tree data make svm_range_get_attr get wrong result. Flush work queue before iterate svm interval tree. Signed-off-by:
Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 11 Aug, 2021 2 commits
-
-
Mukul Joshi authored
This patch adds support to program trap handler settings when loading driver with software scheduler (sched_policy=2). Signed-off-by:
Mukul Joshi <mukul.joshi@amd.com> Suggested-by:
Jay Cornwall <Jay.Cornwall@amd.com> Reviewed-by:
Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Philip Yang authored
For xnack on, if range ACCESS or ACCESS_IN_PLACE (AIP) by single GPU, or range is ACCESS_IN_PLACE by mGPUs and all mGPUs connection on XGMI same hive, the best prefetch location is prefetch_loc GPU. Otherwise, the best prefetch location is always CPU because GPU does not have coherent mapping VRAM of other GPUs even with large-BAR PCIe connection. Signed-off-by:
Philip Yang <Philip.Yang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 06 Aug, 2021 2 commits
-
-
Felix Kuehling authored
Currently the SVM get_attr call allows querying, which flags are set in the entire address range. Add the opposite query, which flags are clear in the entire address range. Both queries can be combined in a single get_attr call, which allows answering questions such as, "is this address range coherent, non-coherent, or a mix of both"? Proposed userspace for UAPI: https://github.com/RadeonOpenCompute/ROCR-Runtime/tree/memory_model_queries Signed-off-by:
Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by:
Philip Yand <philip.yang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Graham Sider authored
Add u32 gfx_target_version field to kfd_node_properties and kfd_device_info. Populate <asic>_device_info structs accordingly and expose to sysfs. This allows eliminating device-ID-based lookup tables in user mode for future ASICs. Signed-off-by:
Graham Sider <Graham.Sider@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 02 Aug, 2021 4 commits
-
-
Eric Huang authored
It is to workaround HW bug on other Asics and based on reverting two commits back: drm/amdkfd: Add heavy-weight TLB flush after unmapping drm/amdkfd: Add memory sync before TLB flush on unmap Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Eric Huang authored
This reverts commit 4bba567c . Revert reason: The issue has been resolved. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Eric Huang authored
This reverts commit 7ed9876c . Revert reason: The issue has been resolved. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Eric Huang authored
This reverts commit 430f8e6e . Revert reason: Issue has been resolved. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 28 Jul, 2021 3 commits
-
-
Eric Huang authored
This reverts commit 4bba567c . Revert reason: The issue has been resolved. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Eric Huang authored
This reverts commit 7ed9876c . Revert reason: The issue has been resolved. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Eric Huang authored
This reverts commit 430f8e6e . Revert reason: Issue has been resolved. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 23 Jul, 2021 7 commits
-
-
Tao Zhou authored
Add KFD support for cyan_skillfish. v2: whitespace fixes (Alex) Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Graham Sider authored
Update Arcturus/Aldebaran thermal throttle SMI event path to use ASIC-independent throttler bits when logging. Signed-off-by:
Graham Sider <Graham.Sider@amd.com> Reviewed-by:
Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Oak Zeng authored
start_cpsch and stop_cpsch can be called during kfd device initialization or during gpu reset/recovery. So they can run concurrently. Currently in start_cpsch and stop_cpsch, pm_init and pm_uninit is not protected by the dpm lock. Imagine such a case that user use packet manager's function to submit a pm4 packet to hang hws (ie through command cat /sys/class/kfd/kfd/topology/nodes/1/gpu_id | sudo tee /sys/kernel/debug/kfd/hang_hws), while kfd device is under device reset/recovery so packet manager can be not initialized. There will be unpredictable protection fault in such case. This patch moves pm_init/uninit inside the dpm lock and check packet manager is initialized before using packet manager function. Signed-off-by:
Oak Zeng <Oak.Zeng@amd.com> Acked-by:
Christian Konig <christian.koenig@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Oak Zeng authored
This variable will be used to determine whether packet manager is initialized or not, in a future patch. Signed-off-by:
Oak Zeng <Oak.Zeng@amd.com> Acked-by:
Christian Konig <christian.koenig@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Oak Zeng authored
Renaming packets to packet_mgr to reflect the real meaning of this variable. Signed-off-by:
Oak Zeng <Oak.Zeng@amd.com> Acked-by:
Christian Konig <christian.koenig@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Jonathan Kim authored
Similar to xGMI reporting the min/max bandwidth between direct peers, PCIe will report the min/max bandwidth to the KFD. Signed-off-by:
Jonathan Kim <jonathan.kim@amd.com> Reviewed-by:
Felix Kuehling <felix.kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Jonathan Kim authored
Report the min/max bandwidth in megabytes to the kfd for direct xgmi connections only. Indirect peers will report 0 since indirect route is unknown. Signed-off-by:
Jonathan Kim <jonathan.kim@amd.com> Reviewed-by:
Felix Kuehling <felix.kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 13 Jul, 2021 7 commits
-
-
Eric Huang authored
This reverts commit 1098d658 . Reason for revert: it causes regressions on several Asics. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Eric Huang authored
This reverts commit 31f33243 . Reason for revert: it causes regressions on several Asics. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Eric Huang authored
This reverts commit 3be4dca1 . Reason for revert: it causes regressions on several Asics. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Philip Yang authored
prange is NULL if vm fault retry on invalid address, for this case, can not use prange to get pdd, use adev to get gpuidx and then get pdd instead, then increase pdd vm fault counter. Signed-off-by:
Philip Yang <Philip.Yang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Eric Huang authored
This reverts commit 1098d658 . Reason for revert: it causes regressions on several Asics. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Eric Huang authored
This reverts commit 31f33243 . Reason for revert: it causes regressions on several Asics. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Eric Huang authored
This reverts commit 3be4dca1 . Reason for revert: it causes regressions on several Asics. Signed-off-by:
Eric Huang <jinhuieric.huang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 08 Jul, 2021 1 commit
-
-
Philip Yang authored
prange is NULL if vm fault retry on invalid address, for this case, can not use prange to get pdd, use adev to get gpuidx and then get pdd instead, then increase pdd vm fault counter. Signed-off-by:
Philip Yang <Philip.Yang@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 01 Jul, 2021 2 commits
-
-
Alex Sierra authored
Each zone-device page holds a reference to the SVM BO that manages its backing storage. This is necessary to correctly hold on to the BO in case zone_device pages are shared with a child-process. Signed-off-by:
Alex Sierra <alex.sierra@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Alex Sierra authored
This is for debug purposes only. It conditionally generates partial migrations to test mixed CPU/GPU memory domain pages in a prange easily. Signed-off-by:
Alex Sierra <alex.sierra@amd.com> Reviewed-by:
Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-