amd issues
https://gitlab.freedesktop.org/drm/amd/-/issues
2023-10-17T14:54:08Z
https://gitlab.freedesktop.org/drm/amd/-/issues/2456
Whole system freeze from isolate_migratepages_block / ttm_pool_alloc
2023-10-17T14:54:08Z
nihui
Whole system freeze from isolate_migratepages_block / ttm_pool_alloc
Whole system freeze from isolate_migratepages_block / ttm_pool_alloc
When I played minecract 1.19 for nearly 6 hours, I was trying to open the journalmap map mod, the whole system was frozen, lost the response, and had to restart the co...
Whole system freeze from isolate_migratepages_block / ttm_pool_alloc
When I played minecract 1.19 for nearly 6 hours, I was trying to open the journalmap map mod, the whole system was frozen, lost the response, and had to restart the computer forcibly.
I found some of the last logs, which show that it is related to amdgpu, so report here
This is the first time I have encountered this situation
- r9-5950x
- rx7900xtx
- Fedora 37
- linux 6.1.14
- xorg-x11-server-Xorg-1.20.14-18.fc37.x86_64
- mesa 22.3.6
plasma x11 desktop environment
The related log is as follows
## Hardware description:
- CPU: AMD Ryzen 9 5950X (32) @ 3.400GHz
- GPU: 2d:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX] [1002:744c] (rev c8)
- System Memory: 32G
- Display(s): 4K 60hz
- Type of Display Connection: DP
## System information:
- Distro name and Version: Fedora 37
- Kernel version: Linux nihui-pc 6.1.14-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Feb 26 00:13:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
- AMD official driver version: N/A
### Log files (for system lockups / game freezes / crashes)
```
3月 11 17:21:29 nihui-pc kernel: ------------[ cut here ]------------
3月 11 17:21:29 nihui-pc kernel: kernel BUG at mm/zsmalloc.c:1793!
3月 11 17:21:29 nihui-pc kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
3月 11 17:21:29 nihui-pc kernel: CPU: 29 PID: 1645 Comm: Xorg Tainted: G OE 6.1.14-200.fc37.x86_64 #1
3月 11 17:21:29 nihui-pc kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C94/MAG B550M MORTAR WIFI (MS-7C94), BIOS 1.80 07/01/2021
3月 11 17:21:29 nihui-pc kernel: RIP: 0010:zs_page_putback+0x87/0x90
3月 11 17:21:29 nihui-pc kernel: Code: 5d e9 7d 4d a4 00 48 c7 c6 a8 d5 70 9b 48 89 df e8 9e 2d f6 ff 0f 0b 48 c7 c6 80 47 71 9b 48 89 df e8 8d 2d f6 ff 0f 0b 0f 0b <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 53 48 89 fb e8 e1 c3
3月 11 17:21:29 nihui-pc kernel: RSP: 0018:ffffb3edc7fab4b8 EFLAGS: 00010246
3月 11 17:21:29 nihui-pc kernel: RAX: 0000000000000000 RBX: ffff90966e88cc08 RCX: ffffb3edc7fab690
3月 11 17:21:29 nihui-pc kernel: RDX: 00000000000000ff RSI: fffff6908c7b8108 RDI: 0000000000000000
3月 11 17:21:29 nihui-pc kernel: RBP: ffff90966e88cc38 R08: 000000000000006c R09: ffffffffffffffff
3月 11 17:21:29 nihui-pc kernel: R10: 00000000000389c0 R11: ffff909b1f2d5000 R12: ffffb3edc7fab690
3月 11 17:21:29 nihui-pc kernel: R13: dead000000000122 R14: dead000000000100 R15: fffff6908c7b8108
3月 11 17:21:29 nihui-pc kernel: FS: 00007fcd991ffa80(0000) GS:ffff909aff140000(0000) knlGS:0000000000000000
3月 11 17:21:29 nihui-pc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
3月 11 17:21:29 nihui-pc kernel: CR2: 00007fcd8dc2c0b2 CR3: 0000000110204000 CR4: 0000000000750ee0
3月 11 17:21:29 nihui-pc kernel: PKRU: 55555554
3月 11 17:21:29 nihui-pc kernel: Call Trace:
3月 11 17:21:29 nihui-pc kernel: <TASK>
3月 11 17:21:29 nihui-pc kernel: putback_movable_pages+0x2b1/0x310
3月 11 17:21:29 nihui-pc kernel: isolate_migratepages_block+0x32f/0x1840
3月 11 17:21:29 nihui-pc kernel: ? __compaction_suitable+0x74/0xb0
3月 11 17:21:29 nihui-pc kernel: compact_zone+0x378/0xdd0
3月 11 17:21:29 nihui-pc kernel: compact_zone_order+0xaa/0x100
3月 11 17:21:29 nihui-pc kernel: try_to_compact_pages+0xf0/0x2f0
3月 11 17:21:29 nihui-pc kernel: __alloc_pages_direct_compact+0x85/0x270
3月 11 17:21:29 nihui-pc kernel: __alloc_pages_slowpath.constprop.0+0x6c3/0xe20
3月 11 17:21:29 nihui-pc kernel: ? prepare_alloc_pages.constprop.0+0xf6/0x1a0
3月 11 17:21:29 nihui-pc kernel: __alloc_pages+0x209/0x230
3月 11 17:21:29 nihui-pc kernel: ttm_pool_alloc+0x2af/0x5a0 [ttm]
3月 11 17:21:29 nihui-pc kernel: amdgpu_ttm_tt_populate+0x35/0x90 [amdgpu]
3月 11 17:21:29 nihui-pc kernel: ttm_tt_populate+0x9d/0x140 [ttm]
3月 11 17:21:29 nihui-pc kernel: ttm_bo_handle_move_mem+0x15f/0x170 [ttm]
3月 11 17:21:29 nihui-pc kernel: ttm_mem_evict_first+0x204/0x490 [ttm]
3月 11 17:21:29 nihui-pc kernel: ttm_bo_mem_space+0x1c9/0x220 [ttm]
3月 11 17:21:29 nihui-pc kernel: ttm_bo_validate+0x97/0x120 [ttm]
3月 11 17:21:29 nihui-pc kernel: ? drm_vma_offset_add+0x59/0x60
3月 11 17:21:29 nihui-pc kernel: ttm_bo_init_reserved+0x15f/0x1d0 [ttm]
3月 11 17:21:29 nihui-pc kernel: amdgpu_bo_create+0x1c0/0x480 [amdgpu]
3月 11 17:21:29 nihui-pc kernel: ? amdgpu_bo_vm_destroy+0x80/0x80 [amdgpu]
3月 11 17:21:29 nihui-pc kernel: amdgpu_bo_create_user+0x2c/0x50 [amdgpu]
3月 11 17:21:29 nihui-pc kernel: amdgpu_gem_create_ioctl+0x138/0x370 [amdgpu]
3月 11 17:21:29 nihui-pc kernel: ? amdgpu_bo_vm_destroy+0x80/0x80 [amdgpu]
3月 11 17:21:29 nihui-pc kernel: ? amdgpu_gem_force_release+0x140/0x140 [amdgpu]
3月 11 17:21:29 nihui-pc kernel: drm_ioctl_kernel+0xa9/0x150
3月 11 17:21:29 nihui-pc kernel: drm_ioctl+0x22f/0x410
3月 11 17:21:29 nihui-pc kernel: ? amdgpu_gem_force_release+0x140/0x140 [amdgpu]
3月 11 17:21:29 nihui-pc kernel: amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
3月 11 17:21:29 nihui-pc kernel: __x64_sys_ioctl+0x90/0xd0
3月 11 17:21:29 nihui-pc kernel: do_syscall_64+0x5b/0x80
3月 11 17:21:29 nihui-pc kernel: ? __rseq_handle_notify_resume+0x96/0x460
3月 11 17:21:29 nihui-pc kernel: ? fpregs_restore_userregs+0x12/0xe0
3月 11 17:21:29 nihui-pc kernel: ? exit_to_user_mode_prepare+0x18f/0x1f0
3月 11 17:21:29 nihui-pc kernel: ? syscall_exit_to_user_mode+0x17/0x40
3月 11 17:21:29 nihui-pc kernel: ? do_syscall_64+0x67/0x80
3月 11 17:21:29 nihui-pc kernel: ? exit_to_user_mode_prepare+0x180/0x1f0
3月 11 17:21:29 nihui-pc kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
3月 11 17:21:29 nihui-pc kernel: RIP: 0033:0x7fcd9988bd6f
3月 11 17:21:29 nihui-pc kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
3月 11 17:21:29 nihui-pc kernel: RSP: 002b:00007fffef91c6b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
3月 11 17:21:29 nihui-pc kernel: RAX: ffffffffffffffda RBX: 0000555c28e9e670 RCX: 00007fcd9988bd6f
3月 11 17:21:29 nihui-pc kernel: RDX: 00007fffef91c750 RSI: 00000000c0206440 RDI: 0000000000000012
3月 11 17:21:29 nihui-pc kernel: RBP: 00007fffef91c750 R08: 0000000000000007 R09: 0000000000000010
3月 11 17:21:29 nihui-pc kernel: R10: 0000555c276b0010 R11: 0000000000000246 R12: 00000000c0206440
3月 11 17:21:29 nihui-pc kernel: R13: 0000000000000012 R14: 0000555c2776e360 R15: 0000000000000211
3月 11 17:21:29 nihui-pc kernel: </TASK>
3月 11 17:21:29 nihui-pc kernel: Modules linked in: tls uinput exfat rfcomm snd_seq_dummy snd_hrtimer nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntra>
3月 11 17:21:29 nihui-pc kernel: videodev bluetooth snd_timer cfg80211 mc joydev rapl snd wmi_bmof pcspkr i2c_piix4 k10temp soundcore rfkill gpio_amdpt gpio_generic acpi_cpufreq zram amdgpu drm_ttm_helper ttm iommu_v2 video crct10dif_pclmul crc32_pclmul crc32c_intel gpu_sched polyval_clmulni polyval_generic nvme d>
3月 11 17:21:29 nihui-pc kernel: ---[ end trace 0000000000000000 ]---
3月 11 17:21:29 nihui-pc kernel: RIP: 0010:zs_page_putback+0x87/0x90
3月 11 17:21:29 nihui-pc kernel: Code: 5d e9 7d 4d a4 00 48 c7 c6 a8 d5 70 9b 48 89 df e8 9e 2d f6 ff 0f 0b 48 c7 c6 80 47 71 9b 48 89 df e8 8d 2d f6 ff 0f 0b 0f 0b <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 53 48 89 fb e8 e1 c3
3月 11 17:21:29 nihui-pc kernel: RSP: 0018:ffffb3edc7fab4b8 EFLAGS: 00010246
3月 11 17:21:29 nihui-pc kernel: RAX: 0000000000000000 RBX: ffff90966e88cc08 RCX: ffffb3edc7fab690
3月 11 17:21:29 nihui-pc kernel: RDX: 00000000000000ff RSI: fffff6908c7b8108 RDI: 0000000000000000
3月 11 17:21:29 nihui-pc kernel: RBP: ffff90966e88cc38 R08: 000000000000006c R09: ffffffffffffffff
3月 11 17:21:29 nihui-pc kernel: R10: 00000000000389c0 R11: ffff909b1f2d5000 R12: ffffb3edc7fab690
3月 11 17:21:29 nihui-pc kernel: R13: dead000000000122 R14: dead000000000100 R15: fffff6908c7b8108
3月 11 17:21:29 nihui-pc kernel: FS: 00007fcd991ffa80(0000) GS:ffff909aff140000(0000) knlGS:0000000000000000
3月 11 17:21:29 nihui-pc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
3月 11 17:21:29 nihui-pc kernel: CR2: 00007fcd8dc2c0b2 CR3: 0000000110204000 CR4: 0000000000750ee0
3月 11 17:21:29 nihui-pc kernel: PKRU: 55555554
3月 11 17:21:29 nihui-pc kernel: note: Xorg[1645] exited with preempt_count 1
3月 11 17:21:30 nihui-pc abrt-dump-journal-oops[1511]: abrt-dump-journal-oops: Found oopses: 1
3月 11 17:21:30 nihui-pc abrt-dump-journal-oops[1511]: abrt-dump-journal-oops: Creating problem directories
3月 11 17:21:30 nihui-pc abrt-server[41279]: Can't find a meaningful backtrace for hashing in '.'
3月 11 17:21:30 nihui-pc abrt-server[41279]: Preserving oops '.' because DropNotReportableOopses is 'no'
3月 11 17:21:31 nihui-pc abrt-notification[41296]: [🡕] System encountered a non-fatal error in ??()
3月 11 17:21:31 nihui-pc abrt-dump-journal-oops[1511]: Reported 1 kernel oopses to Abrt
3月 11 17:21:33 nihui-pc kernel: sched: RT throttling activated
3月 11 17:21:56 nihui-pc kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [Port0:3055]
```
https://gitlab.freedesktop.org/drm/amd/-/issues/3225
WARNING: CPU: 2 PID: 17784 at drivers/gpu/drm/ttm/ttm_bo.c:326 ttm_bo_release...
2024-02-29T14:14:45Z
Paul Menzel
WARNING: CPU: 2 PID: 17784 at drivers/gpu/drm/ttm/ttm_bo.c:326 ttm_bo_release+0x299/0x2f0 [ttm]
On a Dell OptiPlex 5055, I found the trace below
```
$ dmesg | grep -e "DMI:" -e "Linux version" -e microcode
[ 0.000000] Linux version 6.6.12.mx64.461 (root@theinternet.molgen.mpg.de) (gcc (GCC) 12.3.0, GNU ld (GNU Binutils) 2.41) #...
On a Dell OptiPlex 5055, I found the trace below
```
$ dmesg | grep -e "DMI:" -e "Linux version" -e microcode
[ 0.000000] Linux version 6.6.12.mx64.461 (root@theinternet.molgen.mpg.de) (gcc (GCC) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP PREEMPT_DYNAMIC Thu Jan 18 10:47:08 CET 2024
[ 0.000000] DMI: Dell Inc. OptiPlex 5055 Ryzen CPU/0P03DX, BIOS 1.1.20 05/31/2019
[ 3.416106] microcode: CPU0: patch_level=0x08001137
[ 3.416110] microcode: CPU5: patch_level=0x08001137
[ 3.416110] microcode: CPU4: patch_level=0x08001137
[ 3.416111] microcode: CPU6: patch_level=0x08001137
[ 3.416112] microcode: CPU2: patch_level=0x08001137
[ 3.416113] microcode: CPU7: patch_level=0x08001137
[ 3.416169] microcode: CPU3: patch_level=0x08001137
[ 3.421199] microcode: CPU1: patch_level=0x08001137
[ 3.447181] microcode: Microcode Update Driver: v2.2.
[…]
[104735.198116] ------------[ cut here ]------------
[104735.202876] WARNING: CPU: 2 PID: 17784 at drivers/gpu/drm/ttm/ttm_bo.c:326 ttm_bo_release+0x299/0x2f0 [ttm]
[104735.212812] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs i915 iosf_mbi intel_gtt 8021q garp stp mrp llc input_leds tpm_crb amdgpu hid_microsoft hid_generic usbhid i2c_algo_bit drm_exec drm_suballoc_helper amdxcp snd_ctl_led drm_buddy snd_hda_codec_realtek gpu_sched snd_hda_codec_generic drm_display_helper ledtrig_audio led_class drm_ttm_helper ttm kvm_amd video tg3 snd_hda_codec_hdmi kvm irqbypass tpm_tis tpm_tis_core tpm acpi_cpufreq rng_core snd_hda_intel i2c_piix4 k10temp efi_pstore libphy wmi_bmof wmi snd_intel_dspcfg snd_hda_codec snd_hda_core snd_pcm snd_timer snd soundcore crc32c_intel pstore nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc efivarfs ip_tables x_tables ipv6 autofs4
[104735.274787] CPU: 2 PID: 17784 Comm: Xorg Tainted: G W 6.6.12.mx64.461 #1
[104735.283029] Hardware name: Dell Inc. OptiPlex 5055 Ryzen CPU/0P03DX, BIOS 1.1.20 05/31/2019
[104735.291507] RIP: 0010:ttm_bo_release+0x299/0x2f0 [ttm]
[104735.296778] Code: 49 8b b4 24 40 08 00 00 48 83 c4 38 48 8d 53 30 bf 00 02 00 00 5b 5d 41 5c 41 5d 41 5e e9 bf b2 a7 e0 4c 89 e7 e9 53 fe ff ff <0f> 0b 48 83 7b 20 00 0f 84 9f fd ff ff 0f 0b e9 98 fd ff ff c7 43
[104735.315678] RSP: 0018:ffffc90000a9fdf0 EFLAGS: 00010202
[104735.321038] RAX: 0000000000000000 RBX: ffff88822b79edd0 RCX: 0000000000000000
[104735.328306] RDX: 0000000000000001 RSI: ffffffff8282ba88 RDI: ffff88822b79edd0
[104735.335573] RBP: ffff88822b79ec58 R08: 0000000000000064 R09: ffff888297a3cc08
[104735.342837] R10: ffffc90000a9fd30 R11: ffffc90000a9fd38 R12: ffff88811268eeb0
[104735.350104] R13: ffff888100216a20 R14: ffff888297ca8540 R15: 0000000000000000
[104735.357369] FS: 00007efc89c08940(0000) GS:ffff88840ea80000(0000) knlGS:0000000000000000
[104735.365586] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[104735.371458] CR2: 00007fdcc8968000 CR3: 00000001c8354000 CR4: 00000000003506e0
[104735.378741] Call Trace:
[104735.381303] <TASK>
[104735.383532] ? __warn+0x81/0x140
[104735.386884] ? ttm_bo_release+0x299/0x2f0 [ttm]
[104735.391556] ? report_bug+0x171/0x1a0
[104735.395351] ? handle_bug+0x3c/0x70
[104735.398956] ? exc_invalid_op+0x17/0x70
[104735.402925] ? asm_exc_invalid_op+0x1a/0x20
[104735.407236] ? ttm_bo_release+0x299/0x2f0 [ttm]
[104735.411899] ? srso_return_thunk+0x5/0x10
[104735.416032] ? fsnotify_grab_connector+0x43/0x80
[104735.420780] amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[104735.425590] amdgpu_gem_object_free+0x34/0x60 [amdgpu]
[104735.431002] drm_gem_dmabuf_release+0x37/0x50
[104735.435491] dma_buf_release+0x3e/0x90
[104735.439371] __dentry_kill+0xf5/0x170
[104735.443155] __fput+0x13d/0x280
[104735.446424] task_work_run+0x5d/0x90
[104735.450124] exit_to_user_mode_prepare+0x12a/0x130
[104735.455041] syscall_exit_to_user_mode+0x21/0x50
[104735.459788] ? srso_return_thunk+0x5/0x10
[104735.463930] do_syscall_64+0x52/0x90
[104735.467635] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[104735.472822] RIP: 0033:0x7efc8951b29b
[104735.476526] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1b 48 8b 44 24 18 64 48 2b 04 25 28 00
[104735.495441] RSP: 002b:00007ffe46f85c10 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[104735.503131] RAX: 0000000000000000 RBX: 00007ffe46f85ca8 RCX: 00007efc8951b29b
[104735.510380] RDX: 00007ffe46f85ca8 RSI: 0000000040086409 RDI: 000000000000000f
[104735.517634] RBP: 0000000040086409 R08: 0000000000000007 R09: 0000000000000000
[104735.524875] R10: de613b6fecf41175 R11: 0000000000000246 R12: 00000000007453d8
[104735.532126] R13: 000000000000000f R14: 0000000000746254 R15: 00007ffe46f85cf8
[104735.539385] </TASK>
[104735.541676] ---[ end trace 0000000000000000 ]---
```
```
$ lspci -nn -s 4:00
04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Oland [Radeon HD 8570 / R5 430 OEM / R7 240/340 / Radeon 520 OEM] [1002:6611] (rev 87)
04:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Oland/Hainan/Cape Verde/Pitcairn HDMI Audio [Radeon HD 7000 Series] [1002:aab0]
```
[Output of `dmesg`](/uploads/6b4261f0296b0792db5bf8830bfce240/linux-6.6.12--dell-optiplex-5055--system-firmware-1.1.20.txt)
https://gitlab.freedesktop.org/drm/amd/-/issues/3094
Possible memory leaks in `amdgpu` driver and/or `drm` kernel subsystem
2024-03-23T02:51:55Z
slabity
Possible memory leaks in `amdgpu` driver and/or `drm` kernel subsystem
## Brief summary of the problem:
For the past few months I have noticed my system's memory usage continually climbing over the course of a few hours/days to unreasonable amounts. The memory does not get freed, even when I close all appli...
## Brief summary of the problem:
For the past few months I have noticed my system's memory usage continually climbing over the course of a few hours/days to unreasonable amounts. The memory does not get freed, even when I close all applications, log out, and shut down the Wayland compositor (leaving nothing but a TTY to log in to).
None of my monitoring programs show any process taking up the used memory either, so I've essentially ruled out any userspace process.
Multiple calls to `sync` and `sysctl vm.drop_caches=3` a couple of times in a row does not reduce the used memory, so I'm ruling out the page cache (which `free` seems to indicate as a negligible amount anyways).
I have *not* ruled out the TTM cache from information I found in [this kernel report](https://bugzilla.kernel.org/show_bug.cgi?id=214425#c4). But running the `cat /sys/kernel/debug/dri/0/amdgpu_evict_vram`, `cat /sys/kernel/debug/dri/0/amdgpu_evict_gtt`, and "horrible incantation" of `for i in {1..1000}; do cat /sys/kernel/debug/ttm/page_pool_shrink; done` does not seem to affect memory usage either. **If anyone has a better method of checking this, please let me know and I'll give it a shot.**
In an attempt to debug, I have compiled a kernel with `kmemleak` enabled and have found a very large number of reported memory leaks. These *might be false positives*, but they all seem to be coming from the `amdgpu` driver and `drm` subsystems. The primary one seems to be located in `dcn32_*` functions called from `amdgpu_dm_atomic_commit_tail`. Here's an example:
```
unreferenced object 0xffff8881699f0000 (size 24624):
comm "kworker/u66:1", pid 198, jiffies 4294694718 (age 25021.159s)
hex dump (first 32 bytes):
01 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff9a1040b5>] __kmalloc_large_node+0xc5/0x150
[<ffffffff9a1048d6>] __kmalloc_node+0xc6/0x150
[<ffffffff9a0f6373>] kvmalloc_node+0x43/0xd0
[<ffffffffc09be00d>] dc_create_transfer_func+0x1d/0x30 [amdgpu]
[<ffffffffc09bee23>] dc_create_stream_for_sink+0x233/0x2a0 [amdgpu]
[<ffffffffc091d85e>] dcn32_add_phantom_pipes+0x4e/0x470 [amdgpu]
[<ffffffffc0826238>] dcn32_internal_validate_bw+0x1288/0x1ca0 [amdgpu]
[<ffffffffc0826f24>] dcn32_calculate_wm_and_dlg_fpu+0x144/0x14d0 [amdgpu]
[<ffffffffc091df05>] dcn32_calculate_wm_and_dlg+0x45/0x60 [amdgpu]
[<ffffffffc092a7a5>] dml1_validate+0x135/0x350 [amdgpu]
[<ffffffffc09b2718>] dc_update_planes_and_stream+0x7e8/0x1230 [amdgpu]
[<ffffffffc073d0ec>] amdgpu_dm_atomic_commit_tail+0x198c/0x3a90 [amdgpu]
[<ffffffffc0326ca4>] commit_tail+0x94/0x130 [drm_kms_helper]
[<ffffffff99ecc006>] process_one_work+0x176/0x340
[<ffffffff99ecc45b>] worker_thread+0x27b/0x3a0
[<ffffffff99ed5ff7>] kthread+0xd7/0x100
```
I will attach a log of a handful of these messages further down.
## Hardware description:
- CPU: AMD Ryzen 7 3800X 8-Core Processor
- GPU: `0d:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX] [1002:744c] (rev c8)`
- System Memory: 64GB DDR4
- Display(s):
- QX2710LED, 2560x1440, DP->DVI-D convertor
- Alienware AW3423DWF, 3440x1440, DP
## System information:
- Distro name and Version: I am using NixOS, and I'm running on the `nixpkgs-unstable` channel.
- Kernel version: 6.7.0-rc7 (I have had this issue on 6.6 and 6.5 as well, not sure before that)
- Custom kernel: N/A
- AMD official driver version:
- Mainline `amdgpu` driver
- Mesa 23.1.9
## How to reproduce the issue:
Unfortunately I'm not sure how to reproduce other than having the same hardware/software setup indicated above.
## Attached files:
### Log files (for system lockups / game freezes / crashes)
[kmemleak.log](/uploads/8457fb56a57e31e32ae3433837e7e258/kmemleak.log)
https://gitlab.freedesktop.org/drm/amd/-/issues/2835
Playing a DX12 game leads to Warning in drivers/gpu/drm/amd/amdgpu/amdgpu_obj...
2023-09-22T16:56:54Z
ms178
Playing a DX12 game leads to Warning in drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1347
With Kernel 6.4.15, I now saw the following trace for the first time playing a DX12 game:
```
[ 9898.765200] ------------[ cut here ]------------
[ 9898.765201] WARNING: CPU: 5 PID: 21561 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:13...
With Kernel 6.4.15, I now saw the following trace for the first time playing a DX12 game:
```
[ 9898.765200] ------------[ cut here ]------------
[ 9898.765201] WARNING: CPU: 5 PID: 21561 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1347 amdgpu_bo_release_notify+0x1b1/0x1e0 [amdgpu]
[ 9898.765321] Modules linked in: snd_hda_codec_realtek snd_hda_codec_generic vfat intel_rapl_msr fat intel_rapl_common sb_edac ledtrig_audio snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_intel snd_intel_dspcfg coretemp snd_hda_codec crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic snd_hwdep gf128mul ghash_clmulni_intel sha512_ssse3 snd_hda_core aesni_intel crypto_simd snd_pcm cryptd snd_timer acpi_cpufreq i2c_i801 i2c_smbus igb snd lpc_ich mei_wdt soundcore razerkbd(O) mousedev sch_fq_codel usbip_host usbip_core pkcs8_key_parser crypto_user fuse loop zram bpf_preload ip_tables x_tables ext4 crc32c_generic mbcache crc16 jbd2 usbhid amdgpu mfd_core drm_buddy drm_suballoc_helper video drm_ttm_helper crc32c_intel ttm i2c_algo_bit drm_display_helper cec xhci_pci xhci_pci_renesas gpu_sched wmi
[ 9898.765342] CPU: 5 PID: 21561 Comm: winepulse_mainl Tainted: G W O 6.4.15-4.1-cachyos-lto #1 6906424ba8419a309775a57bd0e84f63fd1761bc
[ 9898.765344] Hardware name: LENOVO GAMING TF/X99-TF Gaming, BIOS CX99DE26 10/10/2020
[ 9898.765345] RIP: 0010:amdgpu_bo_release_notify+0x1b1/0x1e0 [amdgpu]
[ 9898.765464] Code: e8 74 3b 62 c3 eb 02 7c 25 48 8b bb f8 00 00 00 e8 44 3d 31 c4 48 83 c4 10 5b 41 5e 41 5f c3 0f 0b e9 d5 fe ff ff 0f 0b eb eb <0f> 0b eb db be 03 00 00 00 e8 81 eb dd c3 eb cf 00 00 00 00 00 00
[ 9898.765465] RSP: 0018:ffff8b45ad0679b8 EFLAGS: 00010286
[ 9898.765466] RAX: 00000000fffffe00 RBX: ffff8b47e3678c58 RCX: 0000000000000000
[ 9898.765467] RDX: 00000000006d5245 RSI: fffffb9053010b01 RDI: ffff8b45ca7065f8
[ 9898.765468] RBP: ffff8b47e3678c58 R08: 0000000000000000 R09: ffffffff00000000
[ 9898.765468] R10: 0000000000000000 R11: ffff8b4200252300 R12: ffff8b47e3678c58
[ 9898.765469] R13: ffff8b48f73b0520 R14: ffff8b47e3678c00 R15: ffff8b45ca700010
[ 9898.765470] FS: 0000000000000000(0000) GS:ffff8b495f740000(0000) knlGS:0000000000000000
[ 9898.765471] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9898.765472] CR2: 00007f5cbf510010 CR3: 0000000575b3e004 CR4: 00000000001706e0
[ 9898.765472] Call Trace:
[ 9898.765473] <TASK>
[ 9898.765474] ? __warn+0x9e/0x160
[ 9898.765477] ? amdgpu_bo_release_notify+0x1b1/0x1e0 [amdgpu 88dc1ef9468b93fe48205cd5f8a3ff9e760dd224]
[ 9898.765597] ? report_bug+0x14c/0x180
[ 9898.765599] ? handle_bug+0x41/0x80
[ 9898.765601] ? exc_invalid_op+0x16/0x40
[ 9898.765602] ? asm_exc_invalid_op+0x16/0x20
[ 9898.765605] ? amdgpu_bo_release_notify+0x1b1/0x1e0 [amdgpu 88dc1ef9468b93fe48205cd5f8a3ff9e760dd224]
[ 9898.765724] ? amdgpu_bo_release_notify+0x121/0x1e0 [amdgpu 88dc1ef9468b93fe48205cd5f8a3ff9e760dd224]
[ 9898.765844] ttm_bo_put+0x165/0x440 [ttm c9ceab27450a9459ebd219665552c85b65a4754f]
[ 9898.765850] drm_gem_release+0x170/0x360
[ 9898.765852] drm_release+0x22a/0x4e0
[ 9898.765853] ____fput+0x149/0x2900
[ 9898.765856] ? exit_files+0x25e/0x6a0
[ 9898.765858] ? exit_sem+0x4b8/0xca0
[ 9898.765860] ? kfree+0x2ea/0x980
[ 9898.765862] ? exit_sem+0x4b8/0xca0
[ 9898.765863] ? do_exit+0x695/0x1320
[ 9898.765865] do_exit+0x743/0x1320
[ 9898.765867] do_group_exit+0x7f/0xa0
[ 9898.765869] get_signal+0x32a/0xc20
[ 9898.765871] arch_do_signal_or_restart+0x1a/0x200
[ 9898.765873] exit_to_user_mode_prepare+0x1856/0x1b00
[ 9898.765875] ? syscall_exit_to_user_mode+0x28/0x1a0
[ 9898.765878] syscall_exit_to_user_mode+0x28/0x1a0
[ 9898.765880] do_syscall_64+0x6b/0xa0
[ 9898.765894] ? do_syscall_64+0x6b/0xa0
[ 9898.765896] ? do_syscall_64+0x6b/0xa0
[ 9898.765897] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 9898.765900] RIP: 0033:0x7f9e6593ad0f
[ 9898.765904] Code: Unable to access opcode bytes at 0x7f9e6593ace5.
[ 9898.765904] RSP: 002b:000000013e3ae530 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
[ 9898.765906] RAX: fffffffffffffdfc RBX: 00007f9d445fc960 RCX: 00007f9e6593ad0f
[ 9898.765907] RDX: 00000000ffffffff RSI: 0000000000000002 RDI: 00007f9d445fc960
[ 9898.765907] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
[ 9898.765908] R10: 000d000d00000000 R11: 0000000000000293 R12: 00000000ffffffff
[ 9898.765909] R13: 00007f9e56887380 R14: 0000000000000001 R15: 0000000000000000
[ 9898.765910] </TASK>
[ 9898.765915] ---[ end trace 0000000000000000 ]---
```
System info:
```
System:
Host: klx99 Kernel: 6.4.15-4.1-cachyos-lto arch: x86_64 bits: 64
Desktop: KDE Plasma v: 5.27.7 Distro: CachyOS
Machine:
Type: Desktop System: LENOVO product: GAMING TF v: N/A
serial: <superuser required>
Mobo: Lenovo model: X99-TF Gaming v: G368J V1.1, NALEX
CPU:
Info: 18-core model: Intel Xeon E5-2696 v3 bits: 64 type: MT MCP cache:
L2: 4.5 MiB
Graphics:
Device-1: AMD Navi 21 [Radeon RX 6950 XT] driver: amdgpu v: kernel
Display: x11 server: X.Org v: 21.1.99 with: Xwayland v: 23.2.0 driver: X:
loaded: amdgpu unloaded: modesetting dri: radeonsi gpu: amdgpu
resolution: 2560x1440
API: OpenGL v: 4.6 Mesa 23.3.0-devel (git-bb91e0306c) renderer: AMD
Radeon RX 6950 XT (navi21 LLVM 18.0.0 DRM 3.52 6.4.15-4.1-cachyos-lto)
```
Here is the full dmesg output:
[amdgpu_dmesg_log.txt](/uploads/ac45c05ed270b688d3c3b52382812acd/amdgpu_dmesg_log.txt)
https://gitlab.freedesktop.org/drm/amd/-/issues/3041
page faults when unplugging rx 6800 XT
2023-12-11T14:30:10Z
Xaver Hugl
page faults when unplugging rx 6800 XT
The rx 6800 XT is in a Razer Core X, and when I unplugged it from my Framework Laptop 13, I got a lot of page fault warnings in my log.
This is the dmesg output after unplugging, with kernel 6.7 rc4: [hotunplug.log](/uploads/35a67517078...
The rx 6800 XT is in a Razer Core X, and when I unplugged it from my Framework Laptop 13, I got a lot of page fault warnings in my log.
This is the dmesg output after unplugging, with kernel 6.7 rc4: [hotunplug.log](/uploads/35a67517078cfbd616b8f89c0a7c2c8a/hotunplug.log)
Other than that, the last atomic commit with `DRM_MODE_PAGE_FLIP_EVENT` didn't result in a pageflip event, which caused KWin to hang for a bit, but after that everything seemed to work fine again
https://gitlab.freedesktop.org/drm/amd/-/issues/3011
list_add corruption with rusticl on amdgpu (Phoenix APU)
2023-12-04T15:02:57Z
Niccolò Belli
list_add corruption with rusticl on amdgpu (Phoenix APU)
```
list_add corruption. next->prev should be prev (ffffffffc0d474f0), but was ffff92ebfe2fbfd8. (next=ffffe87e44be5608)
```
I've tried to run the [setubal](https://math.dartmouth.edu/~sarunas/darktable_bench.html) darktable benchmark (...
```
list_add corruption. next->prev should be prev (ffffffffc0d474f0), but was ffff92ebfe2fbfd8. (next=ffffe87e44be5608)
```
I've tried to run the [setubal](https://math.dartmouth.edu/~sarunas/darktable_bench.html) darktable benchmark (`darktable-cli setubal.orf setubal.orf.xmp test.jpg --core -d perf -d opencl`) with latest mesa git and it spammed my journal with 27MB of list_add corruption: [journal.gz](https://gitlab.freedesktop.org/drm/amd/uploads/0049704943cbc149d2214d137fd3cfa3/journal.gz)
(the log is gzipped)
I've got 64GB of ram and I filled a tmpfs with 40GB of data so that so that only 24GB of ram remained available and it failed to allocate a buffer:
![image](/uploads/2dc35faa5b2938145b00d721572d77ee/image.png)
I've tried to open Chromium afterwards but I guess it didn't manage to reclaim the memory because it slowed down to a crawl and I had to reboot using magic sysrq keys.
> [What seems to happen is that the OOM killer runs and our error handling in the TTM pool is buggy and corrupts the free list.](https://gitlab.freedesktop.org/drm/amd/-/issues/2912#note_2180747)
https://gitlab.freedesktop.org/drm/amd/-/issues/2983
GTT memory not being freed, resulting in high system RAM being used, and ofte...
2023-11-16T15:29:24Z
QwertyChouskie
GTT memory not being freed, resulting in high system RAM being used, and often OOMs
Right now my system is sitting at 92% RAM usage (of 16GB) and 100% swap usage (1GB). Radeontop lists an absolutely massive GTT size of 7663MB, of which only around ~4000MB is currently being used (and that number could certainly go down...
Right now my system is sitting at 92% RAM usage (of 16GB) and 100% swap usage (1GB). Radeontop lists an absolutely massive GTT size of 7663MB, of which only around ~4000MB is currently being used (and that number could certainly go down further once I close out more browser tabs, since they seem to be the main user of VRAM right now). As far as I can tell, there's no way to force the kernel driver to reduce the GTT size, and it certainly doesn't appear to be happening automatically.
https://gitlab.freedesktop.org/drm/amd/-/issues/3171
BUG: KFENCE: use-after-free read in amdgpu_bo_move+0x1ce/0x710 [amdgpu]
2024-03-27T06:01:13Z
Martin Wolf
BUG: KFENCE: use-after-free read in amdgpu_bo_move+0x1ce/0x710 [amdgpu]
### System information
```
System:
Host: el-ryzerino Kernel: 6.7.4-200.fc39.x86_64 arch: x86_64 bits: 64
compiler: gcc v: 2.40-14.fc39 Desktop: GNOME v: 45.3 tk: GTK v: 3.24.41
wm: gnome-shell dm: GDM Distro: Fedora release 39...
### System information
```
System:
Host: el-ryzerino Kernel: 6.7.4-200.fc39.x86_64 arch: x86_64 bits: 64
compiler: gcc v: 2.40-14.fc39 Desktop: GNOME v: 45.3 tk: GTK v: 3.24.41
wm: gnome-shell dm: GDM Distro: Fedora release 39 (Thirty Nine)
CPU:
Info: 16-core model: AMD Ryzen 9 5950X bits: 64 type: MT MCP arch: Zen 3+
rev: 2 cache: L1: 1024 KiB L2: 8 MiB L3: 64 MiB
Speed (MHz): avg: 3400 min/max: 2200/5083 boost: enabled cores: 1: 3400
2: 3400 3: 3400 4: 3400 5: 3400 6: 3400 7: 3400 8: 3400 9: 3400 10: 3400
11: 3400 12: 3400 13: 3400 14: 3400 15: 3400 16: 3400 17: 3400 18: 3400
19: 3400 20: 3400 21: 3400 22: 3400 23: 3400 24: 3400 25: 3400 26: 3400
27: 3400 28: 3400 29: 3400 30: 3400 31: 3400 32: 3400 bogomips: 217189
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
Device-1: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX] vendor: ASRock
driver: amdgpu v: kernel arch: RDNA-3 pcie: speed: 16 GT/s lanes: 16 ports:
active: DP-4,HDMI-A-1 empty: DP-1, DP-2, DP-3, DP-5 bus-ID: 0e:00.0
chip-ID: 1002:744c
Display: server: X.Org v: 1.20.14 with: Xwayland v: 23.2.4
compositor: gnome-shell driver: X: loaded: amdgpu
unloaded: fbdev,modesetting,radeon,vesa dri: radeonsi gpu: amdgpu
display-ID: :0 screens: 1
Screen-1: 0 s-res: 4480x1440 s-dpi: 96
Monitor-1: DP-4 mapped: DisplayPort-3 pos: right model: HP Z24n G2
res: 1920x1200 dpi: 94 diag: 611mm (24.1")
Monitor-2: HDMI-A-1 mapped: HDMI-A-0 pos: primary,left model: XG27WQ
res: 2560x1440 dpi: 109 diag: 703mm (27.7")
API: EGL v: 1.5 platforms: device: 0 drv: radeonsi device: 1 drv: swrast
surfaceless: drv: radeonsi x11: drv: radeonsi inactive: gbm,wayland
API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 23.3.5 glx-v: 1.4
direct-render: yes renderer: AMD Radeon RX 7900 GRE (radeonsi navi31 LLVM
17.0.6 DRM 3.57 6.7.4-200.fc39.x86_64) device-ID: 1002:744c
API: Vulkan v: 1.3.268 surfaces: xcb,xlib device: 0 type: discrete-gpu
driver: mesa radv device-ID: 1002:744c device: 1 type: cpu
driver: mesa llvmpipe device-ID: 10005:0000
```
- OS: `"Fedora Linux 39 (Workstation Edition)`
- GPU: `0e:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX] [1002:744c] (rev ce)`
`It is a RX 7900 GRE`
- Kernel version: `Linux el-ryzerino 6.7.4-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Feb 5 22:21:14 UTC 2024 x86_64 GNU/Linux`
- Mesa version: `OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.3.5`
- Desktop manager and compositor: `Gnome 45`
### Describe the issue
I noticed an error in dmesg after waking up from suspend. I am not entirely sure what caused it.
### Log files as attachment
```
[34258.413097] ==================================================================
[34258.413099] BUG: KFENCE: use-after-free read in amdgpu_bo_move+0x1ce/0x710 [amdgpu]
[34258.413269] Use-after-free read at 0x000000008d0cefe0 (in kfence-#98):
[34258.413270] amdgpu_bo_move+0x1ce/0x710 [amdgpu]
[34258.413413] ttm_bo_handle_move_mem+0xbb/0x170 [ttm]
[34258.413417] ttm_bo_validate+0xe5/0x180 [ttm]
[34258.413422] amdgpu_cs_bo_validate+0x9c/0x2e0 [amdgpu]
[34258.413565] amdgpu_vm_validate_pt_bos+0xbd/0x380 [amdgpu]
[34258.413709] amdgpu_cs_parser_bos.isra.0+0x490/0x820 [amdgpu]
[34258.413845] amdgpu_cs_ioctl+0xa2d/0x1a30 [amdgpu]
[34258.413975] drm_ioctl_kernel+0xd6/0x180
[34258.413978] drm_ioctl+0x26d/0x4b0
[34258.413979] amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
[34258.414104] __x64_sys_ioctl+0x97/0xd0
[34258.414107] do_syscall_64+0x64/0xe0
[34258.414109] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[34258.414112] kfence-#98: 0x00000000dfd76b32-0x00000000f369dda2, size=240, cache=kmalloc-256
[34258.414113] allocated by task 193187 on cpu 17 at 34251.126096s:
[34258.414265] __kmem_cache_alloc_node+0x2a7/0x2e0
[34258.414267] kmalloc_trace+0x2a/0xa0
[34258.414269] amdgpu_gtt_mgr_new+0x40/0x140 [amdgpu]
[34258.414403] ttm_resource_alloc+0x3b/0x80 [ttm]
[34258.414407] ttm_bo_mem_space+0x88/0x230 [ttm]
[34258.414411] ttm_mem_evict_first+0x1c6/0x530 [ttm]
[34258.414415] ttm_resource_manager_evict_all+0xa7/0x1d0 [ttm]
[34258.414419] amdgpu_device_prepare+0x4e/0xd0 [amdgpu]
[34258.414546] pci_pm_prepare+0x34/0x70
[34258.414547] dpm_prepare+0x269/0x440
[34258.414549] dpm_suspend_start+0x1e/0x90
[34258.414551] suspend_devices_and_enter+0x16a/0x970
[34258.414552] pm_suspend+0x25e/0x590
[34258.414553] state_store+0x6c/0xd0
[34258.414555] kernfs_fop_write_iter+0x136/0x1d0
[34258.414556] vfs_write+0x23d/0x400
[34258.414558] ksys_write+0x6f/0xf0
[34258.414559] do_syscall_64+0x64/0xe0
[34258.414560] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[34258.414562] freed by task 53793 on cpu 27 at 34258.413092s:
[34258.414961] ttm_resource_free+0x6b/0x80 [ttm]
[34258.414965] ttm_bo_move_accel_cleanup+0xc8/0x2a0 [ttm]
[34258.414969] amdgpu_bo_move+0x5d0/0x710 [amdgpu]
[34258.415099] ttm_bo_handle_move_mem+0xbb/0x170 [ttm]
[34258.415103] ttm_bo_validate+0xe5/0x180 [ttm]
[34258.415107] amdgpu_cs_bo_validate+0x9c/0x2e0 [amdgpu]
[34258.415239] amdgpu_vm_validate_pt_bos+0xbd/0x380 [amdgpu]
[34258.415374] amdgpu_cs_parser_bos.isra.0+0x490/0x820 [amdgpu]
[34258.415505] amdgpu_cs_ioctl+0xa2d/0x1a30 [amdgpu]
[34258.415637] drm_ioctl_kernel+0xd6/0x180
[34258.415638] drm_ioctl+0x26d/0x4b0
[34258.415639] amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
[34258.415766] __x64_sys_ioctl+0x97/0xd0
[34258.415768] do_syscall_64+0x64/0xe0
[34258.415769] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[34258.415771] CPU: 27 PID: 53793 Comm: firefox:cs0 Not tainted 6.7.4-200.fc39.x86_64 #1
[34258.415773] Hardware name: To Be Filled By O.E.M. B550 Taichi/B550 Taichi, BIOS P3.40 01/18/2024
[34258.415774] ==================================================================
```
https://gitlab.freedesktop.org/drm/amd/-/issues/3091
AMD Radeon PRO W7700 fail to display after several suspend/resume
2024-03-26T09:49:28Z
Chris Chiu
AMD Radeon PRO W7700 fail to display after several suspend/resume
## Brief summary of the problem:
Sometimes The gfx fail to display after resume from S3. The system is still alive, can be access via network, but the screen is blank and can't come back until reboot. To make it easy to reproduce, run th...
## Brief summary of the problem:
Sometimes The gfx fail to display after resume from S3. The system is still alive, can be access via network, but the screen is blank and can't come back until reboot. To make it easy to reproduce, run the command `fwts s3 --s3-multiple=100` and it happens for each try.
## Hardware description:
- CPU: lshw -C display -numeric
- GPU: Advanced Micro Devices, Inc. [AMD/ATI] [1002:7470]
- Display(s): Dell U2720Q
## System information:
- Distro name and Version: Ubuntu 22.04
- Kernel version: Linux ubuntu 6.7.0-060700rc8drmtip20240104-generic
## How to reproduce the issue:
1. Connect to monitor via DP
2. Boot into OS
3. suspend/resume with `fwts s3 --s3-multiple=100`
The dmesg show as follows when screen gets blank
```
[ 9909.393371] ubuntu kernel: Workqueue: ttm ttm_bo_delayed_delete [ttm]
[ 9909.393413] ubuntu kernel: Call Trace:
[ 9909.393417] ubuntu kernel: <TASK>
[ 9909.393424] ubuntu kernel: __schedule+0x2cb/0x760
[ 9909.393437] ubuntu kernel: schedule+0x33/0x110
[ 9909.393443] ubuntu kernel: schedule_timeout+0x157/0x170
[ 9909.393454] ubuntu kernel: dma_fence_default_wait+0x1e1/0x220
[ 9909.393462] ubuntu kernel: ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[ 9909.393469] ubuntu kernel: dma_fence_wait_timeout+0x116/0x140
[ 9909.393476] ubuntu kernel: dma_resv_wait_timeout+0x7f/0xf0
[ 9909.393485] ubuntu kernel: ttm_bo_delayed_delete+0x2a/0xc0 [ttm]
```
## Attached files:
[drmtip_dmesg.log](/uploads/4c9bf6121fa061f3333627c7e8698f05/drmtip_dmesg.log)
https://gitlab.freedesktop.org/drm/amd/-/issues/1869
5.16.x kernel occasionally lock up system when turning DPMS on/off
2024-01-24T15:43:52Z
LaserEyess
5.16.x kernel occasionally lock up system when turning DPMS on/off
## Brief summary of the problem:
On upgrading to 5.16 from 5.15, I have begun to get issues where there is a probability of my entire system locking up, with no ability to even ping the machine, when my monitors go to sleep via DPMS. The...
## Brief summary of the problem:
On upgrading to 5.16 from 5.15, I have begun to get issues where there is a probability of my entire system locking up, with no ability to even ping the machine, when my monitors go to sleep via DPMS. The monitors wake back up on mouse movement, but no other indication that the machine is on exists.
## Hardware description:
- CPU: AMD Ryzen 9 3900X
- GPU: `Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]`
- System Memory: 32 GB
- Display(s): 2x 1440p144 Hz monitors, running at 120 Hz
- Type of Display Connection: DP 1.4
## System information:
- Distro name and Version: Arch Linux
- Kernel version: 5.16.1.arch1-1
- Custom kernel: No
- AMD official driver version: N/A
Other relevant information: I'm using wayland, sway 1.6.1.
## How to reproduce the issue:
1. Boot into 5.16
2. Turn off DPMS on your monitors
3. ????
4. Try to wake monitors back up
5. Your machine is hardlocked
## Attached files:
### Screenshots/video files
N/A
### Log files (for system lockups / game freezes / crashes)
I should mention that these logs are *not* at the time of the crash, they are minutes before it. The logs end very abruptly in my journal and I would guess that the lock up caused writes to journald to fail, thus, no information available of the actual crash. That behind said amdgpu did give some feedback before the crash:
<details><summary>Log before crash</summary>
```
Jan 17 18:16:00 mami kernel: ------------[ cut here ]------------
Jan 17 18:16:00 mami kernel: amdgpu 0000:0a:00.0: drm_WARN_ON(atomic_read(&vblank->refcount) == 0)
Jan 17 18:16:00 mami kernel: WARNING: CPU: 5 PID: 257 at drivers/gpu/drm/drm_vblank.c:1210 drm_vblank_put+0xee/0x100
Jan 17 18:16:00 mami kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs nct6775 hwmon_vid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c md_mod sunrpc nls_iso8859_1 vfat fat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr wireguard snd_hda_intel intel_rapl_common curve25519_x86_64 snd_intel_dspcfg libchacha20poly1305 snd_intel_sdw_acpi chacha_x86_64 snd_hda_codec poly1305_x86_64 libblake2s edac_mce_amd snd_hda_core blake2s_x86_64 eeepc_wmi asus_wmi libcurve25519_generic snd_hwdep amdgpu sparse_keymap snd_pcm libchacha kvm_amd libblake2s_generic platform_profile snd_timer ip6_udp_tunnel gpu_sched kvm snd drm_ttm_helper rapl udp_tunnel video pcspkr mxm_wmi wmi_bmof k10temp i2c_piix4 ttm soundcore mousedev joydev cfg80211 tpm_crb tpm_tis rfkill tpm_tis_core mac_hid acpi_cpufreq fuse ip_tables x_tables ext4 crc32c_generic
Jan 17 18:16:00 mami kernel: crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted asn1_encoder tee tpm hid_logitech_hidpp hid_logitech_dj usbhid uas usb_storage bridge crct10dif_pclmul crc32_pclmul stp crc32c_intel llc ghash_clmulni_intel aesni_intel crypto_simd cryptd ccp sp5100_tco rng_core sr_mod igb cdrom xhci_pci dca xhci_pci_renesas wmi pinctrl_amd vfio_pci vfio_pci_core irqbypass vfio_virqfd vfio_iommu_type1 vfio dm_mirror dm_region_hash dm_log dm_mod xpad ff_memless ipmi_devintf ipmi_msghandler sg bonding tls
Jan 17 18:16:00 mami kernel: CPU: 5 PID: 257 Comm: kworker/u64:5 Not tainted 5.16.1-arch1-1 #1 49bbb8d20d0329f70e47963ef5feb4a66c3cd442
Jan 17 18:16:00 mami kernel: Hardware name: System manufacturer System Product Name/Pro WS X570-ACE, BIOS 1201 11/18/2019
Jan 17 18:16:00 mami kernel: Workqueue: events_unbound commit_work
Jan 17 18:16:00 mami kernel: RIP: 0010:drm_vblank_put+0xee/0x100
Jan 17 18:16:00 mami kernel: Code: 8b 7f 08 4c 8b 67 50 4d 85 e4 74 22 e8 6b 86 01 00 48 c7 c1 e0 1d 51 93 4c 89 e2 48 c7 c7 9a 87 50 93 48 89 c6 e8 32 9c 3f 00 <0f> 0b eb b9 4c 8b 27 eb d9 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00
Jan 17 18:16:00 mami kernel: RSP: 0018:ffffbe0cc0d5faa0 EFLAGS: 00010246
Jan 17 18:16:00 mami kernel: RAX: 0000000000000000 RBX: ffff99f5c8dec800 RCX: 0000000000000000
Jan 17 18:16:00 mami kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 17 18:16:00 mami kernel: RBP: ffffbe0cc0d5fe58 R08: 0000000000000000 R09: 0000000000000000
Jan 17 18:16:00 mami kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff99f541f3f8d0
Jan 17 18:16:00 mami kernel: R13: ffff99f546050f80 R14: 0000000000000000 R15: ffff99f5c8defe00
Jan 17 18:16:00 mami kernel: FS: 0000000000000000(0000) GS:ffff99fc5eb40000(0000) knlGS:0000000000000000
Jan 17 18:16:00 mami kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 17 18:16:00 mami kernel: CR2: 00007f578840d000 CR3: 00000001e6ecc000 CR4: 0000000000350ee0
Jan 17 18:16:00 mami kernel: Call Trace:
Jan 17 18:16:00 mami kernel: <TASK>
Jan 17 18:16:00 mami kernel: amdgpu_dm_atomic_commit_tail+0x1793/0x2690 [amdgpu f85b8a8caf867a5d5ba40878af31ffe87241aba2]
Jan 17 18:16:00 mami kernel: ? 0xffffffff92000000
Jan 17 18:16:00 mami kernel: commit_tail+0x94/0x130
Jan 17 18:16:00 mami kernel: process_one_work+0x1e8/0x3c0
Jan 17 18:16:00 mami kernel: worker_thread+0x50/0x3c0
Jan 17 18:16:00 mami kernel: ? rescuer_thread+0x380/0x380
Jan 17 18:16:00 mami kernel: kthread+0x15c/0x180
Jan 17 18:16:00 mami kernel: ? set_kthread_struct+0x50/0x50
Jan 17 18:16:00 mami kernel: ret_from_fork+0x22/0x30
Jan 17 18:16:00 mami kernel: </TASK>
Jan 17 18:16:00 mami kernel: ---[ end trace b92f0f6d1b0ff057 ]---
Jan 17 18:16:00 mami kernel: [drm:dm_vblank_get_counter [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '5'!
Jan 17 18:16:00 mami kernel: [drm:dm_crtc_get_scanoutpos [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '5'!
Jan 17 18:16:00 mami kernel: [drm:dm_vblank_get_counter [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '5'!
Jan 17 18:16:00 mami kernel: [drm:dm_crtc_get_scanoutpos [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '5'!
Jan 17 18:16:00 mami kernel: [drm:dm_crtc_get_scanoutpos [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '5'!
Jan 17 18:16:00 mami kernel: [drm:dm_crtc_get_scanoutpos [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '5'!
Jan 17 18:16:00 mami kernel: [drm:dm_vblank_get_counter [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '5'!
Jan 17 18:16:00 mami kernel: [drm:dm_crtc_get_scanoutpos [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '5'!
Jan 17 18:16:00 mami kernel: [drm:dm_vblank_get_counter [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '5'!
Jan 17 18:16:00 mami kernel: [drm:dm_vblank_get_counter [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '4'!
Jan 17 18:16:00 mami kernel: [drm:dm_crtc_get_scanoutpos [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '4'!
Jan 17 18:16:00 mami kernel: [drm:dm_vblank_get_counter [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '4'!
Jan 17 18:16:00 mami kernel: [drm:dm_crtc_get_scanoutpos [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '4'!
Jan 17 18:16:00 mami kernel: [drm:dm_crtc_get_scanoutpos [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '4'!
Jan 17 18:16:00 mami kernel: [drm:dm_crtc_get_scanoutpos [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '4'!
Jan 17 18:16:00 mami kernel: [drm:dm_vblank_get_counter [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '4'!
Jan 17 18:16:00 mami kernel: [drm:dm_crtc_get_scanoutpos [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '4'!
Jan 17 18:16:00 mami kernel: [drm:dm_vblank_get_counter [amdgpu]] *ERROR* dc_stream_state is NULL for crtc '4'!
```
</details>
These errors seem very similar to errors I've had in the past, such as #1247. But they have not led to catastrophic lock ups like this, only sometimes monitors not waking up. I'm not sure if they're related.