Skip to content
Snippets Groups Projects
  1. Jun 13, 2023
  2. Jun 07, 2023
  3. Mar 13, 2023
  4. Mar 10, 2023
    • lyndonli's avatar
      drm/amdgpu: Fix call trace warning and hang when removing amdgpu device · 93bb18d2
      lyndonli authored
      
      On GPUs with RAS enabled, below call trace and hang are observed when
      shutting down device.
      
      v2: use DRM device unplugged flag instead of shutdown flag as the check to
      prevent memory wipe in shutdown stage.
      
      [ +0.000000] RIP: 0010:amdgpu_vram_mgr_fini+0x18d/0x1c0 [amdgpu]
      [ +0.000001] PKRU: 55555554
      [ +0.000001] Call Trace:
      [ +0.000001] <TASK>
      [ +0.000002] amdgpu_ttm_fini+0x140/0x1c0 [amdgpu]
      [ +0.000183] amdgpu_bo_fini+0x27/0xa0 [amdgpu]
      [ +0.000184] gmc_v11_0_sw_fini+0x2b/0x40 [amdgpu]
      [ +0.000163] amdgpu_device_fini_sw+0xb6/0x510 [amdgpu]
      [ +0.000152] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
      [ +0.000090] drm_dev_release+0x28/0x50 [drm]
      [ +0.000016] devm_drm_dev_init_release+0x38/0x60 [drm]
      [ +0.000011] devm_action_release+0x15/0x20
      [ +0.000003] release_nodes+0x40/0xc0
      [ +0.000001] devres_release_all+0x9e/0xe0
      [ +0.000001] device_unbind_cleanup+0x12/0x80
      [ +0.000003] device_release_driver_internal+0xff/0x160
      [ +0.000001] driver_detach+0x4a/0x90
      [ +0.000001] bus_remove_driver+0x6c/0xf0
      [ +0.000001] driver_unregister+0x31/0x50
      [ +0.000001] pci_unregister_driver+0x40/0x90
      [ +0.000003] amdgpu_exit+0x15/0x120 [amdgpu]
      
      Signed-off-by: default avatarlyndonli <Lyndon.Li@amd.com>
      Reviewed-by: default avatarGuchun Chen <guchun.chen@amd.com>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      93bb18d2
  5. Mar 07, 2023
    • lyndonli's avatar
      drm/amdgpu: Fix call trace warning and hang when removing amdgpu device · f999adb7
      lyndonli authored
      
      On GPUs with RAS enabled, below call trace and hang are observed when
      shutting down device.
      
      v2: use DRM device unplugged flag instead of shutdown flag as the check to
      prevent memory wipe in shutdown stage.
      
      [ +0.000000] RIP: 0010:amdgpu_vram_mgr_fini+0x18d/0x1c0 [amdgpu]
      [ +0.000001] PKRU: 55555554
      [ +0.000001] Call Trace:
      [ +0.000001] <TASK>
      [ +0.000002] amdgpu_ttm_fini+0x140/0x1c0 [amdgpu]
      [ +0.000183] amdgpu_bo_fini+0x27/0xa0 [amdgpu]
      [ +0.000184] gmc_v11_0_sw_fini+0x2b/0x40 [amdgpu]
      [ +0.000163] amdgpu_device_fini_sw+0xb6/0x510 [amdgpu]
      [ +0.000152] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
      [ +0.000090] drm_dev_release+0x28/0x50 [drm]
      [ +0.000016] devm_drm_dev_init_release+0x38/0x60 [drm]
      [ +0.000011] devm_action_release+0x15/0x20
      [ +0.000003] release_nodes+0x40/0xc0
      [ +0.000001] devres_release_all+0x9e/0xe0
      [ +0.000001] device_unbind_cleanup+0x12/0x80
      [ +0.000003] device_release_driver_internal+0xff/0x160
      [ +0.000001] driver_detach+0x4a/0x90
      [ +0.000001] bus_remove_driver+0x6c/0xf0
      [ +0.000001] driver_unregister+0x31/0x50
      [ +0.000001] pci_unregister_driver+0x40/0x90
      [ +0.000003] amdgpu_exit+0x15/0x120 [amdgpu]
      
      Signed-off-by: default avatarlyndonli <Lyndon.Li@amd.com>
      Reviewed-by: default avatarGuchun Chen <guchun.chen@amd.com>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      f999adb7
  6. Feb 23, 2023
  7. Feb 09, 2023
  8. Jan 19, 2023
  9. Jan 09, 2023
  10. Dec 14, 2022
  11. Dec 09, 2022
  12. Dec 06, 2022
  13. Oct 27, 2022
  14. Oct 06, 2022
    • PhilipY's avatar
      drm/amdgpu: Set vmbo destroy after pt bo is created · 9a3c6067
      PhilipY authored
      
      Under VRAM usage pression, map to GPU may fail to create pt bo and
      vmbo->shadow_list is not initialized, then ttm_bo_release calling
      amdgpu_bo_vm_destroy to access vmbo->shadow_list generates below
      dmesg and NULL pointer access backtrace:
      
      Set vmbo destroy callback to amdgpu_bo_vm_destroy only after creating pt
      bo successfully, otherwise use default callback amdgpu_bo_destroy.
      
      amdgpu: amdgpu_vm_bo_update failed
      amdgpu: update_gpuvm_pte() failed
      amdgpu: Failed to map bo to gpuvm
      amdgpu 0000:43:00.0: amdgpu: Failed to map peer:0000:43:00.0 mem_domain:2
      BUG: kernel NULL pointer dereference, address:
       RIP: 0010:amdgpu_bo_vm_destroy+0x4d/0x80 [amdgpu]
       Call Trace:
        <TASK>
        ttm_bo_release+0x207/0x320 [amdttm]
        amdttm_bo_init_reserved+0x1d6/0x210 [amdttm]
        amdgpu_bo_create+0x1ba/0x520 [amdgpu]
        amdgpu_bo_create_vm+0x3a/0x80 [amdgpu]
        amdgpu_vm_pt_create+0xde/0x270 [amdgpu]
        amdgpu_vm_ptes_update+0x63b/0x710 [amdgpu]
        amdgpu_vm_update_range+0x2e7/0x6e0 [amdgpu]
        amdgpu_vm_bo_update+0x2bd/0x600 [amdgpu]
        update_gpuvm_pte+0x160/0x420 [amdgpu]
        amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x313/0x1130 [amdgpu]
        kfd_ioctl_map_memory_to_gpu+0x115/0x390 [amdgpu]
        kfd_ioctl+0x24a/0x5b0 [amdgpu]
      
      Signed-off-by: default avatarPhilip Yang <Philip.Yang@amd.com>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      9a3c6067
  15. Jul 14, 2022
  16. Jul 11, 2022
  17. May 26, 2022
  18. Apr 07, 2022
  19. Apr 06, 2022
  20. Apr 01, 2022
  21. Mar 25, 2022
  22. Feb 14, 2022
  23. Feb 07, 2022
    • Rajneesh Bhardwaj's avatar
      drm/amdgpu: Fix recursive locking warning · 447c7997
      Rajneesh Bhardwaj authored
      
      Noticed the below warning while running a pytorch workload on vega10
      GPUs. Change to trylock to avoid conflicts with already held reservation
      locks.
      
      [  +0.000003] WARNING: possible recursive locking detected
      [  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
      [  +0.000004] --------------------------------------------
      [  +0.000002] python/4822 is trying to acquire lock:
      [  +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
      at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000203]
                    but task is already holding lock:
      [  +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
      at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
      [  +0.000017]
                    other info that might help us debug this:
      [  +0.000002]  Possible unsafe locking scenario:
      
      [  +0.000003]        CPU0
      [  +0.000002]        ----
      [  +0.000002]   lock(reservation_ww_class_mutex);
      [  +0.000004]   lock(reservation_ww_class_mutex);
      [  +0.000003]
                     *** DEADLOCK ***
      
      [  +0.000002]  May be due to missing lock nesting notation
      
      [  +0.000003] 7 locks held by python/4822:
      [  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
      kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
      [  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
      amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
      [  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
      amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
      [  +0.000236]  #3: ffffb2b35606fd28
      (reservation_ww_class_acquire){+.+.}-{0:0}, at:
      amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
      [  +0.000235]  #4: ffff932cbb7181f8
      (reservation_ww_class_mutex){+.+.}-{3:3}, at:
      ttm_eu_reserve_buffers+0x270/0x470 [ttm]
      [  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
      drm_dev_enter+0x5/0xa0 [drm]
      [  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
      at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
      [  +0.000195]
                    stack backtrace:
      [  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
      5.13.0-kfd-rajneesh #1030
      [  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
      08/29/2018
      [  +0.000003] Call Trace:
      [  +0.000003]  dump_stack+0x6d/0x89
      [  +0.000010]  __lock_acquire+0xb93/0x1a90
      [  +0.000009]  lock_acquire+0x25d/0x2d0
      [  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000184]  ? lock_is_held_type+0xa2/0x110
      [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
      [  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000183]  ? lock_release+0x13f/0x270
      [  +0.000005]  ? lock_is_held_type+0xa2/0x110
      [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
      [  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
      [  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
      [  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
      [  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
      [  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
      [  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
      [  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
      [  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
      [  +0.000190]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
      [amdgpu]
      [  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
      [  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
      [  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
      [  +0.000216]  ? lock_release+0x13f/0x270
      [  +0.000006]  ? __fget_files+0x107/0x1e0
      [  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
      [  +0.000007]  do_syscall_64+0x36/0x70
      [  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  +0.000007] RIP: 0033:0x7fbff90a7317
      [  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
      48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
      05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
      [  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
      0000000000000010
      [  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
      00007fbff90a7317
      [  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
      0000000000000004
      [  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
      00007fbcc402d880
      [  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
      00000000c0184b18
      [  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
      00007fbcc402d820
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Alex Deucher <Alexander.Deucher@amd.com>
      
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      447c7997
  24. Jan 27, 2022
  25. Jan 11, 2022
  26. Nov 17, 2021
  27. Nov 05, 2021
  28. Oct 07, 2021
  29. Sep 07, 2021
  30. Aug 25, 2021
Loading