gpu shuts off about 10 seconds after boot

Same effect, but a different cause:

[    8.962903] amdgpu 0000:0a:00.0: amdgpu: Failed to apply umc cdr workaround!
[    8.962906] amdgpu 0000:0a:00.0: amdgpu: Failed to post smu init!
[    8.962908] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <smu> failed -62
[    8.963146] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_late_init failed
[    8.963148] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init
[    8.963150] amdgpu 0000:0a:00.0: amdgpu: amdgpu: finishing device.
[    8.964546] Adding 102399996k swap on /dev/mapper/swap.  Priority:-2 extents:1 across:102399996k SSFS
[    9.031506] Console: switching to colour dummy device 80x25
[    9.041957] amdgpu 0000:0a:00.0: amdgpu: Fail to disable thermal alert!
[    9.049800] [drm] free PSP TMR buffer
[    9.083444] amdgpu: probe of 0000:0a:00.0 failed with error -62
[    9.154195] BUG: unable to handle page fault for address: ffffbb8d604fd000
[    9.154202] #PF: supervisor write access in kernel mode
[    9.154205] #PF: error_code(0x0002) - not-present page
[    9.154207] PGD 100000067 P4D 100000067 PUD 1001ba067 PMD 0 
[    9.154211] Oops: 0002 [#1] PREEMPT SMP NOPTI
[    9.154215] CPU: 0 PID: 849 Comm: systemd-udevd Tainted: P           OE     5.15.11-zen1-1-zen #1 13f0b9d562b5f0be00a88794d5d8613971de00b7
[    9.154219] Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 4007 12/08/2020
[    9.154222] RIP: 0010:vcn_v2_0_sw_fini+0x78/0x90 [amdgpu]
[    9.154474] Code: ff 85 c0 75 08 48 89 ef e8 75 11 ff ff 48 8b 54 24 08 65 48 2b 14 25 28 00 00 00 75 1e 48 83 c4 10 5b 5d 31 d2 89 d6 89 d7 c3 <c7> 03 00 00 00 00 8b 7c 24 04 e8 b9 16 0b dd eb b6 e8 22 04 4f dd
[    9.154478] RSP: 0018:ffffbb8d41a2b8e8 EFLAGS: 00010202
[    9.154481] RAX: 0000000000000001 RBX: ffffbb8d604fd000 RCX: 0000000000000000
[    9.154484] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    9.154486] RBP: ffffa0b146520000 R08: 0000000000000000 R09: 0000000000000000
[    9.154488] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0b14651fff0
[    9.154491] R13: ffffa0b146536918 R14: 0000000000000010 R15: ffffa0b141e6a380
[    9.154493] FS:  00007f337fa01a40(0000) GS:ffffa0bc4ea00000(0000) knlGS:0000000000000000
[    9.154496] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.154498] CR2: ffffbb8d604fd000 CR3: 000000010f56a000 CR4: 00000000003506f0
[    9.154501] Call Trace:
[    9.154503]  <TASK>
[    9.154505]  amdgpu_device_fini_sw+0xb6/0x2f0 [amdgpu 7a12e66e6d8729ae95c614bc60e86d9c04b88427]
[    9.154706]  amdgpu_driver_release_kms+0x12/0x30 [amdgpu 7a12e66e6d8729ae95c614bc60e86d9c04b88427]
[    9.154908]  devm_drm_dev_init_release+0x49/0x70
[    9.154913]  devres_release_all+0xb2/0x170
[    9.154918]  really_probe.part.0+0x19f/0x3a0
[    9.154922]  __driver_probe_device+0x151/0x230
[    9.154925]  driver_probe_device+0x1e/0x120
[    9.154928]  __driver_attach+0x94/0x1f0
[    9.154931]  ? __device_attach_driver+0x130/0x130
[    9.154934]  ? __device_attach_driver+0x130/0x130
[    9.154936]  bus_for_each_dev+0x8d/0xe0
[    9.154939]  bus_add_driver+0x158/0x200
[    9.154942]  driver_register+0x8f/0xf0
[    9.154945]  ? 0xffffffffc143f000
[    9.154947]  do_one_initcall+0x11a/0x2f0
[    9.154953]  do_init_module+0x5c/0x270
[    9.154957]  load_module+0x2432/0x2620
[    9.154961]  ? __x64_sys_init_module+0x79/0xe0
[    9.154964]  __x64_sys_init_module+0x79/0xe0
[    9.154967]  do_syscall_64+0x5c/0x90
[    9.154972]  ? syscall_exit_to_user_mode+0x23/0x50
[    9.154974]  ? do_syscall_64+0x69/0x90
[    9.154977]  ? exc_page_fault+0x72/0x180
[    9.154980]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    9.154983] RIP: 0033:0x7f338042c32e
[    9.154986] Code: 48 8b 0d 45 0b 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 12 0b 0c 00 f7 d8 64 89 01 48
[    9.154990] RSP: 002b:00007ffc3c272118 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    9.154994] RAX: ffffffffffffffda RBX: 00005557f2402c30 RCX: 00007f338042c32e
[    9.154996] RDX: 00007f3380580a9d RSI: 0000000001100b57 RDI: 00007f337d05e010
[    9.154998] RBP: 00007f337d05e010 R08: 00007f337e845000 R09: 0000000000000000
[    9.155000] R10: 00005557f24000a0 R11: 0000000000000246 R12: 00007f3380580a9d
[    9.155002] R13: 0000000000000001 R14: 00005557f24f00d0 R15: 00005557f2402c30
[    9.155005]  </TASK>
[    9.155006] Modules linked in: ext4 mbcache jbd2 dm_crypt cbc encrypted_keys trusted asn1_encoder tee tpm loop cfg80211 btusb btrtl btbcm btintel vboxnetflt(OE) uas bluetooth vboxnetadp(OE) usb_storage ecdh_generic crc16 vboxdrv(OE) zfs(POE) joydev mousedev hid_lenovo uvcvideo videobuf2_vmalloc videobuf2_memops snd_usb_audio videobuf2_v4l2 videobuf2_common zunicode(POE) snd_usbmidi_lib videodev snd_rawmidi zzstd(OE) snd_seq_device mc usbhid zlua(OE) igb zavl(POE) icp(POE) amdgpu(+) dca snd_hda_codec_realtek nls_iso8859_1 snd_hda_codec_generic zcommon(POE) ledtrig_audio vfat snd_hda_codec_hdmi znvpair(POE) fat spl(OE) intel_rapl_msr snd_hda_intel mac_hid intel_rapl_common snd_intel_dspcfg snd_intel_sdw_acpi eeepc_wmi asus_wmi snd_hda_codec acpi_cpufreq sparse_keymap platform_profile snd_hda_core video snd_hwdep rfkill gpu_sched snd_pcm wmi_bmof drm_ttm_helper edac_mce_amd snd_timer snd mxm_wmi soundcore ttm zenpower(OE) sp5100_tco i2c_piix4 kvm_amd pinctrl_amd wmi ccp pcspkr gpio_amdpt
[    9.155051]  gpio_generic rng_core kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl dm_mod usbip_host usbip_core sg crypto_user fuse bpf_preload ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq xhci_pci crc32c_intel xhci_pci_renesas
[    9.155082] CR2: ffffbb8d604fd000
[    9.155084] ---[ end trace 2a1d994e80879d7a ]---
[    9.155086] RIP: 0010:vcn_v2_0_sw_fini+0x78/0x90 [amdgpu]
[    9.155308] Code: ff 85 c0 75 08 48 89 ef e8 75 11 ff ff 48 8b 54 24 08 65 48 2b 14 25 28 00 00 00 75 1e 48 83 c4 10 5b 5d 31 d2 89 d6 89 d7 c3 <c7> 03 00 00 00 00 8b 7c 24 04 e8 b9 16 0b dd eb b6 e8 22 04 4f dd
[    9.155312] RSP: 0018:ffffbb8d41a2b8e8 EFLAGS: 00010202
[    9.155314] RAX: 0000000000000001 RBX: ffffbb8d604fd000 RCX: 0000000000000000
[    9.155317] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    9.155319] RBP: ffffa0b146520000 R08: 0000000000000000 R09: 0000000000000000
[    9.155321] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0b14651fff0
[    9.155323] R13: ffffa0b146536918 R14: 0000000000000010 R15: ffffa0b141e6a380
[    9.155325] FS:  00007f337fa01a40(0000) GS:ffffa0bc4ea00000(0000) knlGS:0000000000000000
[    9.155328] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.155330] CR2: ffffbb8d604fd000 CR3: 000000010f56a000 CR4: 00000000003506f0

changed title from gpu shuts off about 20 seconds after boot to gpu shuts off about 10 seconds after boot

Now it booted successfully on 5.15.5

changed the description

Does adding amdgpu.runpm=0 on the kernel command line in grub avoid the issue?

Nope, same result as in #1877 (comment 1227773)

changed the description

Another try, getting

[    8.938921] amdgpu 0000:0a:00.0: amdgpu: Failed to get overdrive table!
[    8.938925] amdgpu 0000:0a:00.0: amdgpu: Failed to setup default OD settings!
[    8.938927] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <smu> failed -62
[    8.939161] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_late_init failed
[    8.939163] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init
[    8.939165] amdgpu 0000:0a:00.0: amdgpu: amdgpu: finishing device.
[    9.991965] lenovo 0003:17EF:6047.0006: Sensitivity setting failed: -110
[   13.122824] random: crng init done
[   13.321934] amdgpu 0000:0a:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000012 SMN_C2PMSG_82:0x00000009
[   13.330894] amdgpu 0000:0a:00.0: amdgpu: Failed to power gate JPEG!
[   13.330896] [drm:jpeg_v2_0_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.

again.

Third boot, now it worked !?!

Good boot, dmesg | grep amdgpu

[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux-zen root=UUID=889bfbb5-f502-4691-bb1c-88fc53c2745a rw loglevel=3 quiet amdgpu.runpm=0
[    0.126540] Kernel command line: BOOT_IMAGE=/vmlinuz-linux-zen root=UUID=889bfbb5-f502-4691-bb1c-88fc53c2745a rw loglevel=3 quiet amdgpu.runpm=0
[    3.999514] [drm] amdgpu kernel modesetting enabled.
[    3.999609] amdgpu: Ignoring ACPI CRAT on non-APU system
[    3.999615] amdgpu: Virtual CRAT table created for CPU
[    3.999636] amdgpu: Topology: Add CPU node
[    3.999753] fb0: switching to amdgpu from EFI VGA
[    3.999850] amdgpu 0000:0a:00.0: vgaarb: deactivate vga console
[    3.999972] amdgpu 0000:0a:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[    4.001162] amdgpu 0000:0a:00.0: No more image in the PCI ROM
[    4.001190] amdgpu 0000:0a:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    4.001192] amdgpu: ATOM BIOS: xxx-xxx-xxx
[    4.001260] amdgpu 0000:0a:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    4.001264] amdgpu 0000:0a:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    4.001266] amdgpu 0000:0a:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    4.001325] [drm] amdgpu: 8176M of VRAM memory ready
[    4.001326] [drm] amdgpu: 8176M of GTT memory ready.
[    4.019391] amdgpu 0000:0a:00.0: amdgpu: PSP runtime database doesn't exist
[    4.075896] amdgpu 0000:0a:00.0: amdgpu: Will use PSP to load VCN firmware
[    4.290783] amdgpu 0000:0a:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    4.295682] amdgpu 0000:0a:00.0: amdgpu: RAP: optional rap ta ucode is not available
[    4.295693] amdgpu 0000:0a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    4.295815] amdgpu 0000:0a:00.0: amdgpu: use vbios provided pptable
[    4.295819] amdgpu 0000:0a:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[    4.332060] amdgpu 0000:0a:00.0: amdgpu: SMU is initialized successfully!
[    4.367934] snd_hda_intel 0000:0a:00.1: bound 0000:0a:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[    4.399384] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    4.446542] amdgpu: HMM registered 8176MB device memory
[    4.446603] amdgpu: SRAT table not found
[    4.446604] amdgpu: Virtual CRAT table created for GPU
[    4.446866] amdgpu: Topology: Add dGPU node [0x731f:0x1002]
[    4.446870] kfd kfd: amdgpu: added device 1002:731f
[    4.446888] amdgpu 0000:0a:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 10, active_cu_number 40
[    4.448612] fbcon: amdgpudrmfb (fb0) is primary device
[    4.616708] amdgpu 0000:0a:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[    4.623550] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    4.623555] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    4.623558] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    4.623561] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    4.623563] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    4.623565] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    4.623567] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    4.623569] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    4.623571] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    4.623572] amdgpu 0000:0a:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    4.623575] amdgpu 0000:0a:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[    4.623577] amdgpu 0000:0a:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[    4.623579] amdgpu 0000:0a:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[    4.623580] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[    4.623582] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
[    4.623584] amdgpu 0000:0a:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[    4.625154] [drm] Initialized amdgpu 3.44.0 20150101 for 0000:0a:00.0 on minor 0

I believe I'm seeing the same issue since updating to 5.16.2 (5.16.3 behaves the same), but the problem might've started earlier since I skipped a few versions,

Jan 29 10:55:05 DESKTOP-719H620 kernel: amdgpu 0000:0a:00.0: amdgpu: Failed to get overdrive table!
Jan 29 10:55:05 DESKTOP-719H620 kernel: amdgpu 0000:0a:00.0: amdgpu: Failed to setup default OD settings!
Jan 29 10:55:05 DESKTOP-719H620 kernel: [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <smu> failed -62
Jan 29 10:55:05 DESKTOP-719H620 kernel: amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_late_init failed
Jan 29 10:55:05 DESKTOP-719H620 kernel: amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init
Jan 29 10:55:05 DESKTOP-719H620 kernel: amdgpu 0000:0a:00.0: amdgpu: amdgpu: finishing device.
Jan 29 10:55:05 DESKTOP-719H620 kernel: Console: switching to colour dummy device 80x25
Jan 29 10:55:05 DESKTOP-719H620 kernel: [drm] free PSP TMR buffer
Jan 29 10:55:05 DESKTOP-719H620 kernel: amdgpu: probe of 0000:0a:00.0 failed with error -62
Jan 29 10:55:05 DESKTOP-719H620 kernel: BUG: unable to handle page fault for address: ffffbe4ba0950000
Jan 29 10:55:05 DESKTOP-719H620 kernel: #PF: supervisor write access in kernel mode
Jan 29 10:55:05 DESKTOP-719H620 kernel: #PF: error_code(0x0002) - not-present page
Jan 29 10:55:05 DESKTOP-719H620 kernel: PGD 100000067 P4D 100000067 PUD 1001bc067 PMD 0 
Jan 29 10:55:05 DESKTOP-719H620 kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Jan 29 10:55:05 DESKTOP-719H620 kernel: CPU: 16 PID: 492 Comm: systemd-udevd Not tainted 5.16.3-zen1-1-zen #1 1800e62fff88b0692cd9ea37476135509bb0850a
Jan 29 10:55:05 DESKTOP-719H620 kernel: Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F35d 10/13/2021
Jan 29 10:55:05 DESKTOP-719H620 kernel: RIP: 0010:vcn_v2_0_sw_fini+0x78/0x90 [amdgpu]
Jan 29 10:55:05 DESKTOP-719H620 kernel: Code: ff 85 c0 75 08 48 89 ef e8 d5 0f ff ff 48 8b 54 24 08 65 48 2b 14 25 28 00 00 00 75 1e 48 83 c4 10 5b 5d 31 d2 89 d6 89 d7 c3 <c7> 03 00 00 00 00 8b 7c 24 04 e8 b9 bf 55 ed eb b6 e8 22 90 9a ed
Jan 29 10:55:05 DESKTOP-719H620 kernel: RSP: 0018:ffffbe4b8215f8e8 EFLAGS: 00010202
Jan 29 10:55:05 DESKTOP-719H620 kernel: RAX: 0000000000000001 RBX: ffffbe4ba0950000 RCX: 0000000000000000
Jan 29 10:55:05 DESKTOP-719H620 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 29 10:55:05 DESKTOP-719H620 kernel: RBP: ffff9cceeba80000 R08: 0000000000000000 R09: 0000000000000000
Jan 29 10:55:05 DESKTOP-719H620 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cceeba95a20
Jan 29 10:55:05 DESKTOP-719H620 kernel: R13: ffff9cceeba96980 R14: 0000000000000010 R15: ffff9ccec1fa3380
Jan 29 10:55:05 DESKTOP-719H620 kernel: FS:  00007fc85ed08a40(0000) GS:ffff9cd5dee00000(0000) knlGS:0000000000000000
Jan 29 10:55:05 DESKTOP-719H620 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 29 10:55:05 DESKTOP-719H620 kernel: CR2: ffffbe4ba0950000 CR3: 000000010ec28000 CR4: 0000000000350ee0
Jan 29 10:55:05 DESKTOP-719H620 kernel: Call Trace:
Jan 29 10:55:05 DESKTOP-719H620 kernel:  <TASK>
Jan 29 10:55:05 DESKTOP-719H620 kernel:  amdgpu_device_fini_sw+0xb9/0x2d0 [amdgpu 66a3a19c8a2a53c416831e135d99ce6687bf77be]
Jan 29 10:55:05 DESKTOP-719H620 kernel:  amdgpu_driver_release_kms+0x12/0x30 [amdgpu 66a3a19c8a2a53c416831e135d99ce6687bf77be]
Jan 29 10:55:05 DESKTOP-719H620 kernel:  devm_drm_dev_init_release+0x49/0x70
Jan 29 10:55:05 DESKTOP-719H620 kernel:  devres_release_all+0xb1/0x170
Jan 29 10:55:05 DESKTOP-719H620 kernel:  really_probe.part.0+0x19f/0x3a0
Jan 29 10:55:05 DESKTOP-719H620 kernel:  __driver_probe_device+0x151/0x230
Jan 29 10:55:05 DESKTOP-719H620 kernel:  driver_probe_device+0x1e/0x120
Jan 29 10:55:05 DESKTOP-719H620 kernel:  __driver_attach+0x94/0x1f0
Jan 29 10:55:05 DESKTOP-719H620 kernel:  ? __device_attach_driver+0x130/0x130
Jan 29 10:55:05 DESKTOP-719H620 kernel:  ? __device_attach_driver+0x130/0x130
Jan 29 10:55:05 DESKTOP-719H620 kernel:  bus_for_each_dev+0x8d/0xe0
Jan 29 10:55:05 DESKTOP-719H620 kernel:  bus_add_driver+0x158/0x200
Jan 29 10:55:05 DESKTOP-719H620 kernel:  driver_register+0x8f/0xf0
Jan 29 10:55:05 DESKTOP-719H620 kernel:  ? 0xffffffffc0bb8000
Jan 29 10:55:05 DESKTOP-719H620 kernel:  do_one_initcall+0x11a/0x2f0
Jan 29 10:55:05 DESKTOP-719H620 kernel:  do_init_module+0x5c/0x270
Jan 29 10:55:05 DESKTOP-719H620 kernel:  load_module+0x24bc/0x2550
Jan 29 10:55:05 DESKTOP-719H620 kernel:  ? __x64_sys_init_module+0x79/0xe0
Jan 29 10:55:05 DESKTOP-719H620 kernel:  __x64_sys_init_module+0x79/0xe0
Jan 29 10:55:05 DESKTOP-719H620 kernel:  do_syscall_64+0x5c/0x90
Jan 29 10:55:05 DESKTOP-719H620 kernel:  ? syscall_exit_to_user_mode+0x23/0x50
Jan 29 10:55:05 DESKTOP-719H620 kernel:  ? do_syscall_64+0x69/0x90
Jan 29 10:55:05 DESKTOP-719H620 kernel:  ? syscall_exit_to_user_mode+0x23/0x50
Jan 29 10:55:05 DESKTOP-719H620 kernel:  ? do_syscall_64+0x69/0x90
Jan 29 10:55:05 DESKTOP-719H620 kernel:  ? exc_page_fault+0x72/0x180
Jan 29 10:55:05 DESKTOP-719H620 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 29 10:55:05 DESKTOP-719H620 kernel: RIP: 0033:0x7fc85f70832e
Jan 29 10:55:05 DESKTOP-719H620 kernel: Code: 48 8b 0d 45 0b 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 12 0b 0c 00 f7 d8 64 89 01 48
Jan 29 10:55:05 DESKTOP-719H620 kernel: RSP: 002b:00007ffcd5e3e9a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
Jan 29 10:55:05 DESKTOP-719H620 kernel: RAX: ffffffffffffffda RBX: 0000559c236baff0 RCX: 00007fc85f70832e
Jan 29 10:55:05 DESKTOP-719H620 kernel: RDX: 00007fc85f85ca9d RSI: 000000000110ffee RDI: 00007fc85c73c010
Jan 29 10:55:05 DESKTOP-719H620 kernel: RBP: 00007fc85c73c010 R08: 00007fc85db22000 R09: 0000000000000000
Jan 29 10:55:05 DESKTOP-719H620 kernel: R10: 0000559c23766c70 R11: 0000000000000246 R12: 00007fc85f85ca9d
Jan 29 10:55:05 DESKTOP-719H620 kernel: R13: 0000000000000002 R14: 0000559c235f9d20 R15: 0000559c236baff0
Jan 29 10:55:05 DESKTOP-719H620 kernel:  </TASK>
Jan 29 10:55:05 DESKTOP-719H620 kernel: Modules linked in: cmac algif_hash algif_skcipher af_alg bnep btusb btrtl btbcm btintel bluetooth hid_steam ecdh_generic crc16 cdc_ether usbnet cdc_acm mii joydev xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter rfkill bridge stp llc it87 hwmon_vid snd_usb_audio snd_usbmidi_lib snd_rawmidi uas snd_seq_device usb_storage mousedev mc nls_iso8859_1 vfat fat snd_hda_codec_realtek snd_hda_codec_generic amdgpu(+) ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi intel_rapl_msr snd_hda_codec intel_rapl_common usbhid snd_hda_core snd_hwdep edac_mce_amd snd_pcm crct10dif_pclmul gigabyte_wmi wmi_bmof mxm_wmi ghash_clmulni_intel btrfs blake2b_generic aesni_intel xor snd_timer raid6_pq libcrc32c crc32c_generic crc32c_intel crypto_simd gpu_sched igb snd sp5100_tco cryptd tpm_crb
Jan 29 10:55:05 DESKTOP-719H620 kernel:  drm_ttm_helper rapl ttm soundcore dca pcspkr tpm_tis k10temp i2c_piix4 tpm_tis_core tpm pinctrl_amd mac_hid wmi acpi_cpufreq sg crypto_user fuse bpf_preload ip_tables x_tables f2fs crc32_generic lz4hc_compress crc32_pclmul xhci_pci xhci_pci_renesas kvm_amd ccp rng_core kvm irqbypass
Jan 29 10:55:05 DESKTOP-719H620 kernel: CR2: ffffbe4ba0950000
Jan 29 10:55:05 DESKTOP-719H620 kernel: ---[ end trace 10271fb10cccf90d ]---
Jan 29 10:55:05 DESKTOP-719H620 kernel: RIP: 0010:vcn_v2_0_sw_fini+0x78/0x90 [amdgpu]
Jan 29 10:55:05 DESKTOP-719H620 kernel: Code: ff 85 c0 75 08 48 89 ef e8 d5 0f ff ff 48 8b 54 24 08 65 48 2b 14 25 28 00 00 00 75 1e 48 83 c4 10 5b 5d 31 d2 89 d6 89 d7 c3 <c7> 03 00 00 00 00 8b 7c 24 04 e8 b9 bf 55 ed eb b6 e8 22 90 9a ed
Jan 29 10:55:05 DESKTOP-719H620 kernel: RSP: 0018:ffffbe4b8215f8e8 EFLAGS: 00010202
Jan 29 10:55:05 DESKTOP-719H620 kernel: RAX: 0000000000000001 RBX: ffffbe4ba0950000 RCX: 0000000000000000
Jan 29 10:55:05 DESKTOP-719H620 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 29 10:55:05 DESKTOP-719H620 kernel: RBP: ffff9cceeba80000 R08: 0000000000000000 R09: 0000000000000000
Jan 29 10:55:05 DESKTOP-719H620 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9cceeba95a20
Jan 29 10:55:05 DESKTOP-719H620 kernel: R13: ffff9cceeba96980 R14: 0000000000000010 R15: ffff9ccec1fa3380
Jan 29 10:55:05 DESKTOP-719H620 kernel: FS:  00007fc85ed08a40(0000) GS:ffff9cd5dee00000(0000) knlGS:0000000000000000
Jan 29 10:55:05 DESKTOP-719H620 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 29 10:55:05 DESKTOP-719H620 kernel: CR2: ffffbe4ba0950000 CR3: 000000010ec28000 CR4: 0000000000350ee0

Hardware description:

CPU: AMD Ryzen 9 3900X 12-Core Processor
GPU: RX 5700XT
System Memory: 32G
Display(s): 2 * 1440p@144Hz
Type of Display Connection: DP

System information:

Distro name and Version: Arch
Kernel version: 5.16.2 and 5.16.3
Custom kernel: arch's zen kernel behaves the same

In my case the issue resembles #1237 (closed) because disconnecting one of the displays causes the boot to success and I can safely reconnect the screen once I'm in SDDM.

full-dmesg.log

Patch provided here #1886 (comment 1248613) fixes it for me

mentioned in commit agd5f/linux@45d8532d

mentioned in commit agd5f/linux@bcfab8e3

mentioned in commit agd5f/linux@6e7545dd

mentioned in commit nouveau@c1af5944

Hi @fardragon and @reactormonk , I tried to reproduce this issue by using the latest code from amd-staging-drm-next, but I could not reproduce this issue. Could you help me with these:

Which display resolution are you using? Is this happen independently of the display resolution?
Is it possible for you to try the latest code from amd-staging-drm-next? If so, could you make this tiny change in your kernel:

diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
index dfe2e1c25a26..b55868a0e0df 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
@@ -1069,7 +1069,7 @@ static const struct dc_debug_options debug_defaults_drv = {
               .timing_trace = false,
               .clock_trace = true,
               .disable_pplib_clock_request = true,
-               .pipe_split_policy = MPC_SPLIT_AVOID_MULT_DISP,
+               .pipe_split_policy = MPC_SPLIT_DYNAMIC,
               .force_single_disp_pipe_split = false,
               .disable_dcc = DCC_ENABLE,
               .vsr_support = true,

Thanks.

~~I believe it's the same patch as here #1886 (comment 1248613) if so I've already tested it and it fixed the issue for me back then.~~ Never mind I see that you're asking me to test amd-staging-drm-next with the patch reverted, I'll try to do it.

I believe the issue must be somehow tied to the display resolution or even refresh rate (I run my 2 monitors at 1440p@144Hz) because booting with these kernel parameters video=DP-1:1920x1080@60 video=DP-2:1920x1080@60 also makes the system boot successfully. Disconnecting one of the screens before booting and reconnecting it after is also a workaround for that matter.

Hi @fardragon ,

Could you also test with different display resolutions? I am asking that because I want to find a condition that I can try to reproduce with my setup. Right now, I have 2 4k@60Hz and one 4k widescreen that supports 120Hz.

Ok, I did some testing:

Current amd-staging-drm-next boots without issues
Applying your patch makes it immediately fail:
- Changing desktop resolution in KDE system settings doesn't seem to matter, which I guess is to be expected since the driver fails before even getting to the DM stage. They all fail in the same way as far as I can tell: 2x1440p_60.log 2x1080p_60.log 2x1080p_144.log 2x1440p_144.log
- Disconnecting one of the screens still causes the system to boot successfully
- Setting the resolution in kernel parameters also fixes it, I've tried these three sets and surprisingly they all worked just fine (maybe it doesn't matter what is set as long as anything is set here):
  - video=DP-1:1920x1080@60 video=DP-2:1920x1080@60
  - video=DP-1:1920x1080@144 video=DP-2:1920x1080@144
  - video=DP-1:2560x1440@144 video=DP-2:2560x1440@144
- The only difference (compared to the mainline kernel 3 weeks ago) that I've noticed is that with amd-staging-drm-next the screen doesn't die after ~10 seconds and instead the SMU error message keeps repeating seemingly forever

Applying your patch makes it immediately fail:

Changing desktop resolution in KDE system settings doesn't seem to matter, which I guess is to be expected since the driver fails before even getting to the DM stage. They all fail in the same way as far as I can tell: 2x1440p_60.log 2x1080p_60.log 2x1080p_144.log [2x1440p_144.log]

Correct me if I'm wrong, but you were able to reproduce the hang issue with all of the above configurations, right? I don't know why I'm not able to reproduce this issue...

Could you share the output of the below command?

cat /sys/kernel/debug/dri/0/amdgpu_firmware_info

Also, could you share the edid from your display? You should be able to get it by adapting the below command:

find /sys/devices/ | grep edid (will return paths where edid files are)
cat /sys/devices/<...>/drm/card0/card0-<connector name>/edid > newedid.bin

Yes, after applying the patch, with both displays connected and no kernel params set the issue is 100% reproducible, these logs are output from journalctl -k -b -1 after rebooting to a working configuration.

➜ sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 33, firmware version: 0x00000064
PFP feature version: 33, firmware version: 0x00000097
CE feature version: 33, firmware version: 0x00000025
RLC feature version: 1, firmware version: 0x00000080
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 33, firmware version: 0x00000091
MEC2 feature version: 33, firmware version: 0x00000091
SOS feature version: 0, firmware version: 0x00100550
ASD feature version: 0, firmware version: 0x21000064
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x17000025
TA DTM feature version: 0x00000000, firmware version: 0x1200000b
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x002a4000 (42.64.0)
SDMA0 feature version: 50, firmware version: 0x00000023
SDMA1 feature version: 50, firmware version: 0x00000023
VCN feature version: 0, firmware version: 0x0510e014
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x00000000
TOC feature version: 0, firmware version: 0x00000000
VBIOS version: 111

EDIDs: display1.bin display2.bin

mentioned in issue #1237 (closed)

Hi @fardragon ,

For trying to reproduce this issue, I tried:

Update my firmware to the latest version available at git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
Installed ArchLinux
Tried with 3 different displays (4k@60Hz, 2k@60Hz, etc.)
I emulated your EDID

Nevertheless, I can't reproduce this issue no matter what I try. I'm running out of ideas...

Is it possible for you to try with a different display? I mean, with the .pipe_split_policy = MPC_SPLIT_DYNAMIC,. Maybe we can collect more logs when the issue happens. Could you set the log level to 0x4? You can use:

echo 0x4 > /sys/module/drm/parameters/debug

Or you can set this value in the grub menu by adding this parameter:

drm.debug=0x4

Also, when you see the hang, is it something that you cannot ssh to the machine? Finally, are you using X or Wayland?

I haven't patched anything yet, just stock kernel for now: Linux exia 5.16.14-zen1-1-zen #1 (closed) ZEN SMP PREEMPT Fri, 11 Mar 2022 17:40:33 +0000 x86_64 GNU/Linux

Still occurs (using X). Resolution:

HDMI-A-1 connected primary 3440x1440+0+0 (normal left inverted right x axis y axis) 797mm x 333mm
   3440x1440     49.99 +  99.98*   59.97

No second screen. The bug can still be worked around by unplugging the screen and replugging it after ~ 30 seconds.

Current kernel parameters:

linux	/vmlinuz-linux-zen root=UUID=889bfbb5-f502-4691-bb1c-88fc53c2745a rw amd_iommu=on iommu=pt loglevel=3 quiet amdgpu.runpm=0

Added the drm.debug parameter, will add more info once I have it.

$ sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 34, firmware version: 0x00000064
PFP feature version: 34, firmware version: 0x0000009a
CE feature version: 34, firmware version: 0x00000025
RLC feature version: 1, firmware version: 0x00000080
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 34, firmware version: 0x00000092
MEC2 feature version: 34, firmware version: 0x00000092
SOS feature version: 0, firmware version: 0x00100750
ASD feature version: 0, firmware version: 0x21000071
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x17000028
TA DTM feature version: 0x00000000, firmware version: 0x1200000e
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x002a4000 (42.64.0)
SDMA0 feature version: 50, firmware version: 0x00000023
SDMA1 feature version: 50, firmware version: 0x00000023
VCN feature version: 0, firmware version: 0x05110004
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x00000000
TOC feature version: 0, firmware version: 0x00000000
VBIOS version: xxx-xxx-xxx

edid is empty.

I'll see if I can get the amd staging kernel installed, also need zfs, because of some questionable decisions

gpu shuts off about 10 seconds after boot

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Log files (for system lockups / game freezes / crashes)

Designs

Child items ...

Activity

Hardware description:

System information:

Admin message

Admin message

gpu shuts off about 10 seconds after boot

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Attached files:

Log files (for system lockups / game freezes / crashes)

Activity

Hardware description:

System information: