Skip to content
Snippets Groups Projects
  1. Mar 18, 2025
  2. Mar 07, 2025
  3. Mar 05, 2025
    • Jie1zhang's avatar
      drm/amdgpu: Update SDMA scheduler mask handling to include page queue · 77bd621d
      Jie1zhang authored
      
      This patch updates the SDMA scheduler mask handling to include the page queue
      if it exists. The scheduler mask is calculated based on the number of SDMA
      instances and the presence of the page queue. The mask is updated to reflect
      the state of both the SDMA gfx ring and the page queue.
      
      Changes:
      - Add handling for the SDMA page queue in `amdgpu_debugfs_sdma_sched_mask_set`.
      - Update scheduler mask calculations to include the page queue.
      - Modify `amdgpu_debugfs_sdma_sched_mask_get` to return the correct mask value.
      
      This change is necessary to verify multiple queues (SDMA gfx queue + page queue)
      and ensure proper scheduling and state management for SDMA instances.
      
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      77bd621d
  4. Feb 25, 2025
    • Jie1zhang's avatar
      drm/amdgpu: Improve SDMA reset logic with guilty queue tracking · fdbfaaaa
      Jie1zhang authored
      
      This patch includes the remaining improvements to the SDMA reset logic:
      - Added `gfx_guilty` and `page_guilty` flags to track guilty queues.
      - Updated the reset and resume functions to handle the guilty state.
      - Cached the `rptr` before reset.
      
      v2:
         1.replace the caller with a guilty bool.
         If the queue is the guilty one, set the rptr and wptr  to the saved wptr value,
         else, set the rptr and wptr to the saved rptr value. (Alex)
         2. cache the rptr before the reset. (Alex)
      
      v3: Keeping intermediate variables like u64 rwptr simplifies resotre rptr/wptr.(Lijo)
      
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Suggested-by: default avatarJiadong Zhu <Jiadong.Zhu@amd.com>
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      fdbfaaaa
    • Jie1zhang's avatar
      drm/amdgpu: Introduce conditional user queue suspension for SDMA resets · 4c02f730
      Jie1zhang authored
      
      - Modify the `amdgpu_sdma_reset_engine` function to accept a `suspend_user_queues` parameter.
      - This parameter allows the function to conditionally suspend and resume user queues during SDMA resets.
      - Ensure that user queues are suspended only when necessary to avoid unnecessary overhead and potential deadlocks.
      - Restart the scheduler's work queue for the GFX and page rings after the reset to allow new tasks to be submitted.
      
      This change improves synchronization between the KGD and the KFD during SDMA resets,
      ensuring proper handling of user queues and avoiding race conditions.
      
      V2: replace the ring_lock with the existed the scheduler
          locks for the queues (ring->sched) on the sdma engine.(Alex)
      
      v3: call drm_sched_wqueue_stop() rather than job_list_lock.
          If a GPU ring reset was already initiated for one ring at amdgpu_job_timedout,
          skip resetting that ring and call drm_sched_wqueue_stop()
          for the other rings (Alex)
      
         replace  the common lock (sdma_reset_lock) with DQM lock to
         to resolve reset races between the two driver sections during KFD eviction.(Jon)
      
         Rename the caller to Reset_src and
         Change AMDGPU_RESET_SRC_SDMA_KGD/KFD to AMDGPU_RESET_SRC_SDMA_HWS/RING (Jon)
      
      v4: restart the wqueue if the reset was successful,
          or fall back to a full adapter reset. (Alex)
      
         move definition of reset source to enumeration AMDGPU_RESET_SRCS, and
         check reset src in amdgpu_sdma_reset_instance (Jon)
      
      v5: Call amdgpu_amdkfd_suspend/resume at the start/end of reset function respectively under !SRC_HWS
          conditions only (Jon)
      
      v6: replace the paramter src with a bool suspend_user_queues,
          remove the paramter src in pre/post func. (Jon)
      
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Suggested-by: default avatarJiadong Zhu <Jiadong.Zhu@amd.com>
      Suggested-by: default avatarJonathan Kim <Jonathan.Kim@amd.com>
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Acked-by: default avatarJonathan Kim <jonathan.kim@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      4c02f730
    • Jie1zhang's avatar
      drm/amdgpu/kfd: Add shared SDMA reset functionality with callback support · f3304495
      Jie1zhang authored
      
      This patch introduces shared SDMA reset functionality between AMDGPU and KFD.
      The implementation includes the following key changes:
      
      1. Added `amdgpu_sdma_reset_queue`:
         - Resets a specific SDMA queue by instance ID.
         - Invokes registered pre-reset and post-reset callbacks to allow KFD and AMDGPU
           to save/restore their state during the reset process.
      
      2. Added `amdgpu_set_on_reset_callbacks`:
         - Allows KFD and AMDGPU to register callback functions for pre-reset and
           post-reset operations.
         - Callbacks are stored in a global linked list and invoked in the correct order
           during SDMA reset.
      
      This patch ensures that both AMDGPU and KFD can handle SDMA reset events
      gracefully, with proper state saving and restoration. It also provides a flexible
      callback mechanism for future extensions.
      
      v2: fix CamelCase and put the SDMA helper into amdgpu_sdma.c (Alex)
      
      v3: rename the `amdgpu_register_on_reset_callbacks` function to
            `amdgpu_sdma_register_on_reset_callbacks`
          move global reset_callback_list to struct amdgpu_sdma (Alex)
      
      v4: Update the reset callback function description and
         rename the reset function to amdgpu_sdma_reset_engine (Alex)
      
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Suggested-by: default avatarJiadong Zhu <Jiadong.Zhu@amd.com>
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      f3304495
  5. Jan 09, 2025
  6. Dec 10, 2024
  7. Nov 21, 2024
    • Jie1zhang's avatar
      drm/amdgpu: Fix sysfs warning when hotplugging · 2f1b1352
      Jie1zhang authored
      
      Fix the similar warning when hotplugging:
      
      [  155.585721] kernfs: can not remove 'enforce_isolation', no directory
      [  155.592201] WARNING: CPU: 3 PID: 6960 at fs/kernfs/dir.c:1683 kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.601145] Modules linked in: xt_MASQUERADE xt_comment nft_compat veth bridge stp llc overlay nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr intel_rapl_msr amd_atl intel_rapl_common amd64_edac edac_mce_amd amdgpu kvm_amd kvm ipmi_ssif amdxcp rapl drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm_ttm_helper ttm pcspkr drm_display_helper acpi_cpufreq drm_kms_helper video wmi k10temp i2c_piix4 acpi_ipmi ipmi_si drm zram ip_tables loop squashfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sp5100_tco ixgbe rfkill ccp dca sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf ipmi_msghandler fuse
      [  155.685224] systemd-journald[1354]: Compressed data object 957 -> 524 using ZSTD
      [  155.685687] CPU: 3 PID: 6960 Comm: amd_pci_unplug Not tainted 6.10.0-1148853.1.zuul.164395107d6642bdb451071313e9378d #1
      [  155.704149] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019
      [  155.712383] RIP: 0010:kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.717805] Code: a0 00 48 89 ef e8 37 96 c7 ff 5b b8 fe ff ff ff 5d 41 5c 41 5d e9 f7 96 a0 00 0f 0b eb ab 48 c7 c7 48 ba 7e 8f e8 f7 66 bf ff <0f> 0b eb dc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
      [  155.736766] RSP: 0018:ffffb1685d7a3e20 EFLAGS: 00010296
      [  155.742108] RAX: 0000000000000038 RBX: ffff929e94c80000 RCX: 0000000000000000
      [  155.749363] RDX: ffff928e1efaf200 RSI: ffff928e1efa18c0 RDI: ffff928e1efa18c0
      [  155.756612] RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000003
      [  155.763855] R10: ffffb1685d7a3cd8 R11: ffffffff8fb3e1c8 R12: ffffffffc1ef5341
      [  155.771104] R13: ffff929e94cc5530 R14: 0000000000000000 R15: 0000000000000000
      [  155.778357] FS:  00007fd9dd8d9c40(0000) GS:ffff928e1ef80000(0000) knlGS:0000000000000000
      [  155.786594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  155.792450] CR2: 0000561245ceee38 CR3: 0000000113018000 CR4: 00000000003506f0
      [  155.799702] Call Trace:
      [  155.802254]  <TASK>
      [  155.804460]  ? __warn+0x80/0x120
      [  155.807798]  ? kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.812617]  ? report_bug+0x164/0x190
      [  155.816393]  ? handle_bug+0x3c/0x80
      [  155.819994]  ? exc_invalid_op+0x17/0x70
      [  155.823939]  ? asm_exc_invalid_op+0x1a/0x20
      [  155.828235]  ? kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.833058]  amdgpu_gfx_sysfs_fini+0x59/0xd0 [amdgpu]
      [  155.838637]  gfx_v9_0_sw_fini+0x123/0x1c0 [amdgpu]
      [  155.843887]  amdgpu_device_fini_sw+0xbc/0x3e0 [amdgpu]
      [  155.849432]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
      [  155.855235]  drm_dev_put.part.0+0x3c/0x60 [drm]
      [  155.859914]  drm_release+0x8b/0xc0 [drm]
      [  155.863978]  __fput+0xf1/0x2c0
      [  155.867141]  __x64_sys_close+0x3c/0x80
      [  155.870998]  do_syscall_64+0x64/0x170
      
      V2: Add details in comments (Tim)
      
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Reported-by: default avatarAndy Dong <andy.dong@amd.com>
      Reviewed-by: default avatarTim Huang <tim.huang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      2f1b1352
  8. Nov 08, 2024
  9. Nov 04, 2024
  10. Jun 14, 2024
  11. May 02, 2024
  12. Apr 30, 2024
  13. Oct 26, 2023
  14. Sep 20, 2023
  15. Aug 30, 2023
    • Lee Jones's avatar
      drm/amd/amdgpu/amdgpu_sdma: Increase buffer size to account for all possible values · ac84d99a
      Lee Jones authored
      
      Fixes the following W=1 kernel build warning(s):
      
       drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c: In function ‘amdgpu_sdma_init_microcode’:
       drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c:217:64: warning: ‘.bin’ directive output may be truncated writing 4 bytes into a region of size between 0 and 32 [-Wformat-truncation=]
       drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c:217:17: note: ‘snprintf’ output between 13 and 52 bytes into a destination of size 40
       drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c:215:66: warning: ‘snprintf’ output may be truncated before the last format character [-Wformat-truncation=]
       drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c:215:17: note: ‘snprintf’ output between 12 and 41 bytes into a destination of size 40
      
      Signed-off-by: default avatarLee Jones <lee@kernel.org>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      ac84d99a
  16. Jul 25, 2023
  17. Jun 30, 2023
  18. Jun 09, 2023
  19. Jan 19, 2023
  20. Jan 09, 2023
  21. Oct 10, 2022
  22. Oct 06, 2022
  23. Sep 29, 2022
  24. May 26, 2022
  25. May 04, 2022
  26. Mar 02, 2022
  27. Feb 17, 2022
  28. Feb 14, 2022
  29. Aug 16, 2021
Loading