Skip to content
Snippets Groups Projects
  1. Mar 18, 2025
  2. Mar 14, 2025
  3. Mar 10, 2025
  4. Feb 21, 2025
  5. Feb 13, 2025
  6. Feb 06, 2025
  7. Jan 13, 2025
  8. Jan 11, 2025
    • SRINIVASAN SHANMUGAM's avatar
      drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX Isolation · 02a96a8e
      SRINIVASAN SHANMUGAM authored and SRINIVASAN SHANMUGAM's avatar SRINIVASAN SHANMUGAM committed
      
      This commit addresses a circular locking dependency issue within the GFX
      isolation mechanism. The problem was identified by a warning indicating
      a potential deadlock due to inconsistent lock acquisition order.
      
      - The `amdgpu_gfx_enforce_isolation_ring_begin_use` and
        `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously
        acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`,
        leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is
        called while `enforce_isolation_mutex` is held, and
        `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is
        held, it can create a circular dependency.
      
      By ensuring consistent lock usage, this fix resolves the issue:
      
      [  606.297333] ======================================================
      [  606.297343] WARNING: possible circular locking dependency detected
      [  606.297353] 6.10.0-amd-mlkd-610-311224-lof #19 Tainted: G           OE
      [  606.297365] ------------------------------------------------------
      [  606.297375] kworker/u96:3/3825 is trying to acquire lock:
      [  606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610
      [  606.297413]
                     but task is already holding lock:
      [  606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
      [  606.297725]
                     which lock already depends on the new lock.
      
      [  606.297738]
                     the existing dependency chain (in reverse order) is:
      [  606.297749]
                     -> #2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}:
      [  606.297765]        __mutex_lock+0x85/0x930
      [  606.297776]        mutex_lock_nested+0x1b/0x30
      [  606.297786]        amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
      [  606.298007]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
      [  606.298225]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
      [  606.298412]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
      [  606.298603]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
      [  606.298866]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
      [  606.298880]        process_one_work+0x21e/0x680
      [  606.298890]        worker_thread+0x190/0x350
      [  606.298899]        kthread+0xe7/0x120
      [  606.298908]        ret_from_fork+0x3c/0x60
      [  606.298919]        ret_from_fork_asm+0x1a/0x30
      [  606.298929]
                     -> #1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}:
      [  606.298947]        __mutex_lock+0x85/0x930
      [  606.298956]        mutex_lock_nested+0x1b/0x30
      [  606.298966]        amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu]
      [  606.299190]        process_one_work+0x21e/0x680
      [  606.299199]        worker_thread+0x190/0x350
      [  606.299208]        kthread+0xe7/0x120
      [  606.299217]        ret_from_fork+0x3c/0x60
      [  606.299227]        ret_from_fork_asm+0x1a/0x30
      [  606.299236]
                     -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}:
      [  606.299257]        __lock_acquire+0x16f9/0x2810
      [  606.299267]        lock_acquire+0xd1/0x300
      [  606.299276]        __flush_work+0x250/0x610
      [  606.299286]        cancel_delayed_work_sync+0x71/0x80
      [  606.299296]        amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
      [  606.299509]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
      [  606.299723]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
      [  606.299909]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
      [  606.300101]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
      [  606.300355]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
      [  606.300369]        process_one_work+0x21e/0x680
      [  606.300378]        worker_thread+0x190/0x350
      [  606.300387]        kthread+0xe7/0x120
      [  606.300396]        ret_from_fork+0x3c/0x60
      [  606.300406]        ret_from_fork_asm+0x1a/0x30
      [  606.300416]
                     other info that might help us debug this:
      
      [  606.300428] Chain exists of:
                       (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex
      
      [  606.300458]  Possible unsafe locking scenario:
      
      [  606.300468]        CPU0                    CPU1
      [  606.300476]        ----                    ----
      [  606.300484]   lock(&adev->gfx.kfd_sch_mutex);
      [  606.300494]                                lock(&adev->enforce_isolation_mutex);
      [  606.300508]                                lock(&adev->gfx.kfd_sch_mutex);
      [  606.300521]   lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work));
      [  606.300536]
                      *** DEADLOCK ***
      
      [  606.300546] 5 locks held by kworker/u96:3/3825:
      [  606.300555]  #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
      [  606.300577]  #1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
      [  606.300600]  #2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu]
      [  606.300837]  #3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
      [  606.301062]  #4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610
      [  606.301083]
                     stack backtrace:
      [  606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G           OE      6.10.0-amd-mlkd-610-311224-lof #19
      [  606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024
      [  606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
      [  606.301140] Call Trace:
      [  606.301146]  <TASK>
      [  606.301154]  dump_stack_lvl+0x9b/0xf0
      [  606.301166]  dump_stack+0x10/0x20
      [  606.301175]  print_circular_bug+0x26c/0x340
      [  606.301187]  check_noncircular+0x157/0x170
      [  606.301197]  ? register_lock_class+0x48/0x490
      [  606.301213]  __lock_acquire+0x16f9/0x2810
      [  606.301230]  lock_acquire+0xd1/0x300
      [  606.301239]  ? __flush_work+0x232/0x610
      [  606.301250]  ? srso_alias_return_thunk+0x5/0xfbef5
      [  606.301261]  ? mark_held_locks+0x54/0x90
      [  606.301274]  ? __flush_work+0x232/0x610
      [  606.301284]  __flush_work+0x250/0x610
      [  606.301293]  ? __flush_work+0x232/0x610
      [  606.301305]  ? __pfx_wq_barrier_func+0x10/0x10
      [  606.301318]  ? mark_held_locks+0x54/0x90
      [  606.301331]  ? srso_alias_return_thunk+0x5/0xfbef5
      [  606.301345]  cancel_delayed_work_sync+0x71/0x80
      [  606.301356]  amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
      [  606.301661]  amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
      [  606.302050]  ? srso_alias_return_thunk+0x5/0xfbef5
      [  606.302069]  amdgpu_ring_alloc+0x48/0x70 [amdgpu]
      [  606.302452]  amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
      [  606.302862]  ? drm_sched_entity_error+0x82/0x190 [gpu_sched]
      [  606.302890]  amdgpu_job_run+0xac/0x1e0 [amdgpu]
      [  606.303366]  drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
      [  606.303388]  process_one_work+0x21e/0x680
      [  606.303409]  worker_thread+0x190/0x350
      [  606.303424]  ? __pfx_worker_thread+0x10/0x10
      [  606.303437]  kthread+0xe7/0x120
      [  606.303449]  ? __pfx_kthread+0x10/0x10
      [  606.303463]  ret_from_fork+0x3c/0x60
      [  606.303476]  ? __pfx_kthread+0x10/0x10
      [  606.303489]  ret_from_fork_asm+0x1a/0x30
      [  606.303512]  </TASK>
      
      v2: Refactor lock handling to resolve circular dependency (Alex)
      
      - Introduced a `sched_work` flag to defer the call to
        `amdgpu_gfx_kfd_sch_ctrl` until after releasing
        `enforce_isolation_mutex`.
      - This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside
        the critical section, preventing the circular dependency and deadlock.
      - The `sched_work` flag is set within the mutex-protected section if
        conditions are met, and the actual function call is made afterward.
      - This approach ensures consistent lock acquisition order.
      
      Fixes: afefd6f2 ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization")
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      02a96a8e
  9. Dec 17, 2024
  10. Dec 16, 2024
  11. Dec 10, 2024
    • SRINIVASAN SHANMUGAM's avatar
      drm/amd/amdgpu: Add Annotations to Process Isolation functions · 1fcd5b97
      SRINIVASAN SHANMUGAM authored
      
      This update adds explanations to key functions that manage how the
      Kernel Fusion Driver (KFD) and Kernel Graphics Driver (KGD) share the
      GPU.
      
      amdgpu_gfx_enforce_isolation_wait_for_kfd: Controls the waiting period
      for KFD to ensure it takes turns with KGD in using the GPU. It uses a
      mutex to safely manage shared data, like timing and state, and tracks
      when KFD starts and stops waiting.
      
      amdgpu_gfx_enforce_isolation_ring_begin_use: Ensures KFD has enough time
      to run before new tasks are submitted to the GPU ring. It uses a mutex
      to synchronize access and may adjust the KFD scheduler.
      
      amdgpu_gfx_enforce_isolation_ring_end_use: Handles cleanup and state
      updates when finishing the use of a GPU ring. It may also adjust the KFD
      scheduler, using a mutex to manage shared data access.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      1fcd5b97
    • SRINIVASAN SHANMUGAM's avatar
      drm/amd/amdgpu: Add Descriptions to Process Isolation and Cleaner Shader Sysfs Functions · cdf5faee
      SRINIVASAN SHANMUGAM authored
      
      This update adds explanations to key functions related to process
      isolation and cleaner shader execution sysfs interfaces.
      
      - `amdgpu_gfx_set_run_cleaner_shader`: Describes how to manually run a
        cleaner shader, which clears the Local Data Store (LDS) and General
        Purpose Registers (GPRs) to ensure data isolation between GPU workloads.
      
      - `amdgpu_gfx_get_enforce_isolation`: Describes how to query the current
        settings of the 'enforce_isolation' feature for each GPU partition.
      
      - `amdgpu_gfx_set_enforce_isolation`: Describes how to enable or disable
        process isolation for GPU partitions through the sysfs interface.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      cdf5faee
  12. Nov 21, 2024
    • Jie1zhang's avatar
      drm/amdgpu: Fix sysfs warning when hotplugging · 8265ec4c
      Jie1zhang authored
      
      Fix the similar warning when hotplugging:
      
      [  155.585721] kernfs: can not remove 'enforce_isolation', no directory
      [  155.592201] WARNING: CPU: 3 PID: 6960 at fs/kernfs/dir.c:1683 kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.601145] Modules linked in: xt_MASQUERADE xt_comment nft_compat veth bridge stp llc overlay nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr intel_rapl_msr amd_atl intel_rapl_common amd64_edac edac_mce_amd amdgpu kvm_amd kvm ipmi_ssif amdxcp rapl drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm_ttm_helper ttm pcspkr drm_display_helper acpi_cpufreq drm_kms_helper video wmi k10temp i2c_piix4 acpi_ipmi ipmi_si drm zram ip_tables loop squashfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sp5100_tco ixgbe rfkill ccp dca sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf ipmi_msghandler fuse
      [  155.685224] systemd-journald[1354]: Compressed data object 957 -> 524 using ZSTD
      [  155.685687] CPU: 3 PID: 6960 Comm: amd_pci_unplug Not tainted 6.10.0-1148853.1.zuul.164395107d6642bdb451071313e9378d #1
      [  155.704149] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019
      [  155.712383] RIP: 0010:kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.717805] Code: a0 00 48 89 ef e8 37 96 c7 ff 5b b8 fe ff ff ff 5d 41 5c 41 5d e9 f7 96 a0 00 0f 0b eb ab 48 c7 c7 48 ba 7e 8f e8 f7 66 bf ff <0f> 0b eb dc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
      [  155.736766] RSP: 0018:ffffb1685d7a3e20 EFLAGS: 00010296
      [  155.742108] RAX: 0000000000000038 RBX: ffff929e94c80000 RCX: 0000000000000000
      [  155.749363] RDX: ffff928e1efaf200 RSI: ffff928e1efa18c0 RDI: ffff928e1efa18c0
      [  155.756612] RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000003
      [  155.763855] R10: ffffb1685d7a3cd8 R11: ffffffff8fb3e1c8 R12: ffffffffc1ef5341
      [  155.771104] R13: ffff929e94cc5530 R14: 0000000000000000 R15: 0000000000000000
      [  155.778357] FS:  00007fd9dd8d9c40(0000) GS:ffff928e1ef80000(0000) knlGS:0000000000000000
      [  155.786594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  155.792450] CR2: 0000561245ceee38 CR3: 0000000113018000 CR4: 00000000003506f0
      [  155.799702] Call Trace:
      [  155.802254]  <TASK>
      [  155.804460]  ? __warn+0x80/0x120
      [  155.807798]  ? kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.812617]  ? report_bug+0x164/0x190
      [  155.816393]  ? handle_bug+0x3c/0x80
      [  155.819994]  ? exc_invalid_op+0x17/0x70
      [  155.823939]  ? asm_exc_invalid_op+0x1a/0x20
      [  155.828235]  ? kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.833058]  amdgpu_gfx_sysfs_fini+0x59/0xd0 [amdgpu]
      [  155.838637]  gfx_v9_0_sw_fini+0x123/0x1c0 [amdgpu]
      [  155.843887]  amdgpu_device_fini_sw+0xbc/0x3e0 [amdgpu]
      [  155.849432]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
      [  155.855235]  drm_dev_put.part.0+0x3c/0x60 [drm]
      [  155.859914]  drm_release+0x8b/0xc0 [drm]
      [  155.863978]  __fput+0xf1/0x2c0
      [  155.867141]  __x64_sys_close+0x3c/0x80
      [  155.870998]  do_syscall_64+0x64/0x170
      
      V2: Add details in comments (Tim)
      
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Reported-by: default avatarAndy Dong <andy.dong@amd.com>
      Reviewed-by: default avatarTim Huang <tim.huang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      8265ec4c
    • Alex Deucher's avatar
      drm/amdgpu: partially revert VCN IP block instancing support · fba4761c
      Alex Deucher authored
      
      This partially reverts the VCN IP block rework.  There
      are too many corner cases and chances for regressions.
      
      While this aligned better with the original design, years
      of hardware has used the old pattern.  Best to stick with
      it at this point.
      
      Acked-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      fba4761c
  13. Nov 20, 2024
    • Jie1zhang's avatar
      drm/amdgpu: Fix sysfs warning when hotplugging · 8d852937
      Jie1zhang authored
      
      Fix the similar warning when hotplugging:
      
      [  155.585721] kernfs: can not remove 'enforce_isolation', no directory
      [  155.592201] WARNING: CPU: 3 PID: 6960 at fs/kernfs/dir.c:1683 kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.601145] Modules linked in: xt_MASQUERADE xt_comment nft_compat veth bridge stp llc overlay nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr intel_rapl_msr amd_atl intel_rapl_common amd64_edac edac_mce_amd amdgpu kvm_amd kvm ipmi_ssif amdxcp rapl drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm_ttm_helper ttm pcspkr drm_display_helper acpi_cpufreq drm_kms_helper video wmi k10temp i2c_piix4 acpi_ipmi ipmi_si drm zram ip_tables loop squashfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sp5100_tco ixgbe rfkill ccp dca sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf ipmi_msghandler fuse
      [  155.685224] systemd-journald[1354]: Compressed data object 957 -> 524 using ZSTD
      [  155.685687] CPU: 3 PID: 6960 Comm: amd_pci_unplug Not tainted 6.10.0-1148853.1.zuul.164395107d6642bdb451071313e9378d #1
      [  155.704149] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019
      [  155.712383] RIP: 0010:kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.717805] Code: a0 00 48 89 ef e8 37 96 c7 ff 5b b8 fe ff ff ff 5d 41 5c 41 5d e9 f7 96 a0 00 0f 0b eb ab 48 c7 c7 48 ba 7e 8f e8 f7 66 bf ff <0f> 0b eb dc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
      [  155.736766] RSP: 0018:ffffb1685d7a3e20 EFLAGS: 00010296
      [  155.742108] RAX: 0000000000000038 RBX: ffff929e94c80000 RCX: 0000000000000000
      [  155.749363] RDX: ffff928e1efaf200 RSI: ffff928e1efa18c0 RDI: ffff928e1efa18c0
      [  155.756612] RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000003
      [  155.763855] R10: ffffb1685d7a3cd8 R11: ffffffff8fb3e1c8 R12: ffffffffc1ef5341
      [  155.771104] R13: ffff929e94cc5530 R14: 0000000000000000 R15: 0000000000000000
      [  155.778357] FS:  00007fd9dd8d9c40(0000) GS:ffff928e1ef80000(0000) knlGS:0000000000000000
      [  155.786594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  155.792450] CR2: 0000561245ceee38 CR3: 0000000113018000 CR4: 00000000003506f0
      [  155.799702] Call Trace:
      [  155.802254]  <TASK>
      [  155.804460]  ? __warn+0x80/0x120
      [  155.807798]  ? kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.812617]  ? report_bug+0x164/0x190
      [  155.816393]  ? handle_bug+0x3c/0x80
      [  155.819994]  ? exc_invalid_op+0x17/0x70
      [  155.823939]  ? asm_exc_invalid_op+0x1a/0x20
      [  155.828235]  ? kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.833058]  amdgpu_gfx_sysfs_fini+0x59/0xd0 [amdgpu]
      [  155.838637]  gfx_v9_0_sw_fini+0x123/0x1c0 [amdgpu]
      [  155.843887]  amdgpu_device_fini_sw+0xbc/0x3e0 [amdgpu]
      [  155.849432]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
      [  155.855235]  drm_dev_put.part.0+0x3c/0x60 [drm]
      [  155.859914]  drm_release+0x8b/0xc0 [drm]
      [  155.863978]  __fput+0xf1/0x2c0
      [  155.867141]  __x64_sys_close+0x3c/0x80
      [  155.870998]  do_syscall_64+0x64/0x170
      
      V2: Add details in comments (Tim)
      
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Reported-by: default avatarAndy Dong <andy.dong@amd.com>
      Reviewed-by: default avatarTim Huang <tim.huang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      8d852937
    • Jie1zhang's avatar
      drm/amdgpu: revert fix warning when removing sysfs · 29d549d3
      Jie1zhang authored
      
      This reverts commit 330d97e9
      the dev->unplugged flag will also be set to true ,
      Only uninstall the driver by amdgpu_exit,not actually unplug the device.
      that will cause a new issue.
      
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Reviewed-by: default avatarTim Huang <tim.huang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      29d549d3
  14. Nov 12, 2024
  15. Nov 11, 2024
    • Jie1zhang's avatar
      drm/amdgpu: fix warning when removing sysfs · 58ae9611
      Jie1zhang authored
      
      Fix the similar warning:
      
      [  155.585721] kernfs: can not remove 'enforce_isolation', no directory
      [  155.592201] WARNING: CPU: 3 PID: 6960 at fs/kernfs/dir.c:1683 kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.601145] Modules linked in: xt_MASQUERADE xt_comment nft_compat veth bridge stp llc overlay nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr intel_rapl_msr amd_atl intel_rapl_common amd64_edac edac_mce_amd amdgpu kvm_amd kvm ipmi_ssif amdxcp rapl drm_exec gpu_sched drm_buddy i2c_algo_bit drm_suballoc_helper drm_ttm_helper ttm pcspkr drm_display_helper acpi_cpufreq drm_kms_helper video wmi k10temp i2c_piix4 acpi_ipmi ipmi_si drm zram ip_tables loop squashfs dm_multipath crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sp5100_tco ixgbe rfkill ccp dca sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf ipmi_msghandler fuse
      [  155.685224] systemd-journald[1354]: Compressed data object 957 -> 524 using ZSTD
      [  155.685687] CPU: 3 PID: 6960 Comm: amd_pci_unplug Not tainted 6.10.0-1148853.1.zuul.164395107d6642bdb451071313e9378d #1
      [  155.704149] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019
      [  155.712383] RIP: 0010:kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.717805] Code: a0 00 48 89 ef e8 37 96 c7 ff 5b b8 fe ff ff ff 5d 41 5c 41 5d e9 f7 96 a0 00 0f 0b eb ab 48 c7 c7 48 ba 7e 8f e8 f7 66 bf ff <0f> 0b eb dc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
      [  155.736766] RSP: 0018:ffffb1685d7a3e20 EFLAGS: 00010296
      [  155.742108] RAX: 0000000000000038 RBX: ffff929e94c80000 RCX: 0000000000000000
      [  155.749363] RDX: ffff928e1efaf200 RSI: ffff928e1efa18c0 RDI: ffff928e1efa18c0
      [  155.756612] RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000003
      [  155.763855] R10: ffffb1685d7a3cd8 R11: ffffffff8fb3e1c8 R12: ffffffffc1ef5341
      [  155.771104] R13: ffff929e94cc5530 R14: 0000000000000000 R15: 0000000000000000
      [  155.778357] FS:  00007fd9dd8d9c40(0000) GS:ffff928e1ef80000(0000) knlGS:0000000000000000
      [  155.786594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  155.792450] CR2: 0000561245ceee38 CR3: 0000000113018000 CR4: 00000000003506f0
      [  155.799702] Call Trace:
      [  155.802254]  <TASK>
      [  155.804460]  ? __warn+0x80/0x120
      [  155.807798]  ? kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.812617]  ? report_bug+0x164/0x190
      [  155.816393]  ? handle_bug+0x3c/0x80
      [  155.819994]  ? exc_invalid_op+0x17/0x70
      [  155.823939]  ? asm_exc_invalid_op+0x1a/0x20
      [  155.828235]  ? kernfs_remove_by_name_ns+0xb9/0xc0
      [  155.833058]  amdgpu_gfx_sysfs_fini+0x59/0xd0 [amdgpu]
      [  155.838637]  gfx_v9_0_sw_fini+0x123/0x1c0 [amdgpu]
      [  155.843887]  amdgpu_device_fini_sw+0xbc/0x3e0 [amdgpu]
      [  155.849432]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
      [  155.855235]  drm_dev_put.part.0+0x3c/0x60 [drm]
      [  155.859914]  drm_release+0x8b/0xc0 [drm]
      [  155.863978]  __fput+0xf1/0x2c0
      [  155.867141]  __x64_sys_close+0x3c/0x80
      [  155.870998]  do_syscall_64+0x64/0x170
      
      Check if the device is unplugged before deleting sysfs files.
      
      Signed-off-by: default avatarJesse Zhang <jesse.zhang@amd.com>
      Suggested-by: default avatarLijo Lazar <lijo.lazar@amd.com>
      Reviewed-by: default avatarTim Huang <tim.huang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      58ae9611
    • Boyuan Zhang's avatar
      drm/amd/pm: add inst to dpm_set_powergating_by_smu · 26d0b9eb
      Boyuan Zhang authored
      
      Add an instance parameter to amdgpu_dpm_set_powergating_by_smu() function,
      and use the instance to call set_powergating_by_smu().
      
      v2: remove duplicated functions.
      
      remove for-loop in amdgpu_dpm_set_powergating_by_smu(), and temporarily
      move it to amdgpu_dpm_enable_vcn(), in order to keep the exact same logic
      as before, until further separation in next patch.
      
      v3: drop SI logic in amdgpu_dpm_enable_vcn().
      
      Signed-off-by: default avatarBoyuan Zhang <boyuan.zhang@amd.com>
      Acked-by: default avatarChristian König <christian.koenig@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      26d0b9eb
    • Victor Skvortsov's avatar
      drm/amdgpu: Implement virt req_ras_err_count · 84a2947e
      Victor Skvortsov authored
      
      Enable RAS late init  if VF RAS Telemetry is supported.
      
      When enabled, the VF can use this interface to query total
      RAS error counts from the host.
      
      The VF FB access may abruptly end due to a fatal error,
      therefore the VF must cache and sanitize the input.
      
      The Host allows 15 Telemetry messages every 60 seconds, afterwhich
      the host will ignore any more in-coming telemetry messages. The VF will
      rate limit its msg calling to once every 5 seconds (12 times in 60 seconds).
      While the VF is rate limited, it will continue to report the last
      good cached data.
      
      v2: Flip generate report & update statistics order for VF
      
      Signed-off-by: default avatarVictor Skvortsov <victor.skvortsov@amd.com>
      Acked-by: default avatarTao Zhou <tao.zhou1@amd.com>
      Reviewed-by: default avatarZhigang Luo <zhigang.luo@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      84a2947e
  16. Nov 08, 2024
  17. Nov 04, 2024
  18. Oct 22, 2024
  19. Oct 15, 2024
  20. Sep 26, 2024
  21. Sep 06, 2024
  22. Sep 02, 2024
  23. Aug 29, 2024
  24. Aug 21, 2024
    • SRINIVASAN SHANMUGAM's avatar
      drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization · afefd6f2
      SRINIVASAN SHANMUGAM authored
      
      This commit introduces the Enforce Isolation Handler designed to enforce
      shader isolation on AMD GPUs, which helps to prevent data leakage
      between different processes.
      
      The handler counts the number of emitted fences for each GFX and compute
      ring. If there are any fences, it schedules the `enforce_isolation_work`
      to be run after a delay of `GFX_SLICE_PERIOD`. If there are no fences,
      it signals the Kernel Fusion Driver (KFD) to resume the runqueue.
      
      The function is synchronized using the `enforce_isolation_mutex`.
      
      This commit also introduces a reference count mechanism
      (kfd_sch_req_count) to keep track of the number of requests to enable
      the KFD scheduler. When a request to enable the KFD scheduler is made,
      the reference count is decremented. When the reference count reaches
      zero, a delayed work is scheduled to enforce isolation after a delay of
      GFX_SLICE_PERIOD.
      
      When a request to disable the KFD scheduler is made, the function first
      checks if the reference count is zero. If it is, it cancels the delayed
      work for enforcing isolation and checks if the KFD scheduler is active.
      If the KFD scheduler is active, it sends a request to stop the KFD
      scheduler and sets the KFD scheduler state to inactive. Then, it
      increments the reference count.
      
      The function is synchronized using the kfd_sch_mutex to ensure that the
      KFD scheduler state and reference count are updated atomically.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Suggested-by: default avatarChristian König <christian.koenig@amd.com>
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      afefd6f2
    • SRINIVASAN SHANMUGAM's avatar
      drm/amdgpu: Add sysfs interface for running cleaner shader · d361ad5d
      SRINIVASAN SHANMUGAM authored
      
      This patch adds a new sysfs interface for running the cleaner shader on
      AMD GPUs. The cleaner shader is used to clear GPU memory before it's
      reused, which can help prevent data leakage between different processes.
      
      The new sysfs file is write-only and is named `run_cleaner_shader`.
      Write the number of the partition to this file to trigger the cleaner shader
      on that partition. There is only one partition on GPUs which do not
      support partitioning.
      
      Changes made in this patch:
      
      - Added `amdgpu_set_run_cleaner_shader` function to handle writes to the
        `run_cleaner_shader` sysfs file.
      - Added `run_cleaner_shader` to the list of device attributes in
        `amdgpu_device_attrs`.
      - Updated `default_attr_update` to handle `run_cleaner_shader`.
      - Added `AMDGPU_DEVICE_ATTR_WO` macro to create write-only device
        attributes.
      
      v2: fix error handling (Alex)
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      d361ad5d
    • SRINIVASAN SHANMUGAM's avatar
      drm/amdgpu: Add enforce_isolation sysfs attribute · e189be9b
      SRINIVASAN SHANMUGAM authored
      
      This commit adds a new sysfs attribute 'enforce_isolation' to control
      the 'enforce_isolation' setting per GPU. The attribute can be read and
      written, and accepts values 0 (disabled) and 1 (enabled).
      
      When 'enforce_isolation' is enabled, reserved VMIDs are allocated for
      each ring. When it's disabled, the reserved VMIDs are freed.
      
      The set function locks a mutex before changing the 'enforce_isolation'
      flag and the VMIDs, and unlocks it afterwards. This ensures that these
      operations are atomic and prevents race conditions and other concurrency
      issues.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      e189be9b
Loading