Failing to start RX6900 failing long memory training
Brief summary of the problem:
Hello there,
I have been trying to get this card (Sapphire 6900XT SE) to work properly for about sometime. The issue is that when i start the card during bios i see some random shifts in the video.
Now when i probe the amdgpu i get this error
[ 766.771954] [drm] amdgpu kernel modesetting enabled.
[ 766.772094] amdgpu: Virtual CRAT table created for CPU
[ 766.772108] amdgpu: Topology: Add CPU node
[ 766.772347] [drm] initializing kernel modesetting (SIENNA_CICHLID 0x1002:0x73BF 0x1DA2:0xF441 0xC0).
[ 766.772359] [drm] register mmio base: 0xFC700000
[ 766.772360] [drm] register mmio size: 1048576
[ 766.778173] [drm] add ip block number 0 <nv_common>
[ 766.778175] [drm] add ip block number 1 <gmc_v10_0>
[ 766.778176] [drm] add ip block number 2 <navi10_ih>
[ 766.778176] [drm] add ip block number 3 <psp>
[ 766.778178] [drm] add ip block number 4 <smu>
[ 766.778178] [drm] add ip block number 5 <dm>
[ 766.778179] [drm] add ip block number 6 <gfx_v10_0>
[ 766.778180] [drm] add ip block number 7 <sdma_v5_2>
[ 766.778181] [drm] add ip block number 8 <vcn_v3_0>
[ 766.778182] [drm] add ip block number 9 <jpeg_v3_0>
[ 766.778197] amdgpu 0000:06:00.0: amdgpu: Fetched VBIOS from VFCT
[ 766.778199] amdgpu: ATOM BIOS: 113-D4121EXT-WL1
[ 766.806470] [drm] VCN(0) decode is enabled in VM mode
[ 766.806473] [drm] VCN(1) decode is enabled in VM mode
[ 766.806474] [drm] VCN(0) encode is enabled in VM mode
[ 766.806475] [drm] VCN(1) encode is enabled in VM mode
[ 766.816493] [drm] JPEG decode is enabled in VM mode
[ 766.816499] amdgpu 0000:06:00.0: vgaarb: deactivate vga console
[ 766.816502] amdgpu 0000:06:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 766.816515] amdgpu 0000:06:00.0: amdgpu: PCIE atomic ops is not supported
[ 766.816535] amdgpu 0000:06:00.0: amdgpu: MEM ECC is not presented.
[ 766.816536] amdgpu 0000:06:00.0: amdgpu: SRAM ECC is not presented.
[ 766.816550] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 766.816559] amdgpu 0000:06:00.0: amdgpu: VRAM: 16368M 0x0000008000000000 - 0x00000083FEFFFFFF (16368M used)
[ 766.816561] amdgpu 0000:06:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 766.816562] amdgpu 0000:06:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[ 766.816579] [drm] Detected VRAM RAM=16368M, BAR=256M
[ 766.816580] [drm] RAM width 256bits GDDR6
[ 766.816676] [drm] amdgpu: 16368M of VRAM memory ready
[ 766.816678] [drm] amdgpu: 3931M of GTT memory ready.
[ 766.816692] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 766.816957] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[ 780.294807] [drm:psp_v11_0_memory_training [amdgpu]] *ERROR* Send long training msg failed.
[ 780.295981] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to process memory training!
[ 780.296157] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block <psp> failed -62
[ 780.296332] amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_init failed
[ 780.296336] amdgpu 0000:06:00.0: amdgpu: Fatal error during GPU init
[ 780.296340] amdgpu 0000:06:00.0: amdgpu: amdgpu: finishing device.
[ 780.296559] ------------[ cut here ]------------
[ 780.296560] WARNING: CPU: 9 PID: 1162 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:615 amdgpu_irq_put+0x46/0x70 [amdgpu]
[ 780.296728] Modules linked in: amdgpu(+) qrtr ccm algif_aead crypto_null cbc des_generic libdes ecb algif_skcipher cmac md4 algif_hash af_alg intel_rapl_msr intel_rapl_common edac_mce_amd kvm snd_hda_codec_realtek irqbypass snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi crct10dif_pclmul snd_hda_intel polyval_clmulni snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep polyval_generic snd_pcm snd_timer gf128mul snd soundcore ghash_clmulni_intel cfg80211 sp5100_tco sha1_ssse3 ccp wmi_bmof i2c_piix4 joydev k10temp rfkill rapl pcspkr gpio_amdpt acpi_cpufreq gpio_generic mac_hid pkcs8_key_parser dm_mod fuse ip_tables x_tables overlay squashfs loop isofs cdrom uas usb_storage usbhid drm_exec amdxcp drm_buddy crc32_pclmul gpu_sched crc32c_intel video sha512_ssse3 i2c_algo_bit sha256_ssse3 drm_suballoc_helper drm_ttm_helper aesni_intel r8169 ttm crypto_simd realtek mdio_devres drm_display_helper cryptd nvme libphy cec nvme_core xhci_pci nvme_common xhci_pci_renesas wmi [last unloaded: amdgpu]
[ 780.296795] CPU: 9 PID: 1162 Comm: modprobe Tainted: G W 6.6.3-arch1-1 #1 6156c717f7d423f5954ce718462aaaaa43b9110d
[ 780.296797] Hardware name: Micro-Star International Co., Ltd. MS-7C56/MPG B550 GAMING PLUS (MS-7C56), BIOS 1.40 10/28/2020
[ 780.296798] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
[ 780.296961] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 e9 7f 62 bf c1 e9 5a fd ff ff <0f> 0b b8 ea ff ff ff e9 6e 62 bf c1 b8 ea ff ff ff e9 64 62 bf c1
[ 780.296963] RSP: 0018:ffffc900012c39f8 EFLAGS: 00010246
[ 780.296965] RAX: ffff888102a2b1f0 RBX: ffff888116400000 RCX: 0000000000000000
[ 780.296966] RDX: 0000000000000000 RSI: ffff888116400c68 RDI: ffff888116400000
[ 780.296967] RBP: ffff888116441502 R08: 0000000000000000 R09: ffffc900012c37d0
[ 780.296968] R10: 0000000000000003 R11: ffff88822f32aee8 R12: ffff888116400010
[ 780.296969] R13: ffff8881164414e2 R14: 000000000000001e R15: 0000000000007301
[ 780.296970] FS: 00007fdae638c740(0000) GS:ffff888226c40000(0000) knlGS:0000000000000000
[ 780.296971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 780.296972] CR2: 00007f013f120000 CR3: 000000011f60c000 CR4: 0000000000f50ee0
[ 780.296973] PKRU: 55555554
[ 780.296974] Call Trace:
[ 780.296976] <TASK>
[ 780.296977] ? amdgpu_irq_put+0x46/0x70 [amdgpu 077fde8cd06fba79c8aeb0897b5bcf4aaec961a1]
[ 780.297140] ? __warn+0x81/0x130
[ 780.297144] ? amdgpu_irq_put+0x46/0x70 [amdgpu 077fde8cd06fba79c8aeb0897b5bcf4aaec961a1]
[ 780.297307] ? report_bug+0x171/0x1a0
[ 780.297311] ? handle_bug+0x3c/0x80
[ 780.297313] ? exc_invalid_op+0x17/0x70
[ 780.297315] ? asm_exc_invalid_op+0x1a/0x20
[ 780.297320] ? amdgpu_irq_put+0x46/0x70 [amdgpu 077fde8cd06fba79c8aeb0897b5bcf4aaec961a1]
[ 780.297482] ? srso_alias_return_thunk+0x5/0x7f
[ 780.297485] gmc_v10_0_hw_fini+0x53/0x80 [amdgpu 077fde8cd06fba79c8aeb0897b5bcf4aaec961a1]
[ 780.297648] amdgpu_device_fini_hw+0x1e8/0x330 [amdgpu 077fde8cd06fba79c8aeb0897b5bcf4aaec961a1]
[ 780.297803] ? blocking_notifier_chain_unregister+0x36/0x50
[ 780.297807] amdgpu_driver_load_kms+0xec/0x190 [amdgpu 077fde8cd06fba79c8aeb0897b5bcf4aaec961a1]
[ 780.297962] amdgpu_pci_probe+0x150/0x440 [amdgpu 077fde8cd06fba79c8aeb0897b5bcf4aaec961a1]
[ 780.298115] local_pci_probe+0x45/0xa0
[ 780.298119] pci_device_probe+0xc1/0x260
[ 780.298122] ? sysfs_do_create_link_sd+0x6e/0xe0
[ 780.298127] really_probe+0x19e/0x3e0
[ 780.298131] ? __pfx___driver_attach+0x10/0x10
[ 780.298133] __driver_probe_device+0x78/0x160
[ 780.298135] driver_probe_device+0x1f/0x90
[ 780.298138] __driver_attach+0xd2/0x1c0
[ 780.298140] bus_for_each_dev+0x88/0xd0
[ 780.298143] bus_add_driver+0x116/0x220
[ 780.298146] driver_register+0x59/0x100
[ 780.298148] ? __pfx_amdgpu_init+0x10/0x10 [amdgpu 077fde8cd06fba79c8aeb0897b5bcf4aaec961a1]
[ 780.298304] do_one_initcall+0x5d/0x320
[ 780.298309] do_init_module+0x60/0x240
[ 780.298313] init_module_from_file+0x89/0xe0
[ 780.298318] idempotent_init_module+0x120/0x2b0
[ 780.298322] __x64_sys_finit_module+0x5e/0xb0
[ 780.298324] do_syscall_64+0x60/0x90
[ 780.298328] ? srso_alias_return_thunk+0x5/0x7f
[ 780.298330] ? srso_alias_return_thunk+0x5/0x7f
[ 780.298332] ? syscall_exit_to_user_mode+0x2b/0x40
[ 780.298335] ? srso_alias_return_thunk+0x5/0x7f
[ 780.298337] ? do_syscall_64+0x6c/0x90
[ 780.298339] ? exit_to_user_mode_prepare+0x132/0x1f0
[ 780.298342] ? srso_alias_return_thunk+0x5/0x7f
[ 780.298344] ? syscall_exit_to_user_mode+0x2b/0x40
[ 780.298346] ? srso_alias_return_thunk+0x5/0x7f
[ 780.298348] ? do_syscall_64+0x6c/0x90
[ 780.298349] ? srso_alias_return_thunk+0x5/0x7f
[ 780.298351] ? do_syscall_64+0x6c/0x90
[ 780.298353] ? do_syscall_64+0x6c/0x90
[ 780.298355] ? exc_page_fault+0x7f/0x180
[ 780.298358] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 780.298360] RIP: 0033:0x7fdae649f73d
[ 780.298374] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 95 0c 00 f7 d8 64 89 01 48
[ 780.298375] RSP: 002b:00007fff66c5be48 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 780.298377] RAX: ffffffffffffffda RBX: 00005573a80d6020 RCX: 00007fdae649f73d
[ 780.298378] RDX: 0000000000000004 RSI: 00005573a80d77e0 RDI: 0000000000000003
[ 780.298379] RBP: 00005573a80d77e0 R08: 0000000000000070 R09: ffffffffffffff88
[ 780.298380] R10: 0000000000000050 R11: 0000000000000246 R12: 0000000000040000
[ 780.298381] R13: 00005573a80d6150 R14: 00005573a80d77e0 R15: 00005573a80d7810
[ 780.298385] </TASK>
[ 780.298385] ---[ end trace 0000000000000000 ]---
[ 780.298544] amdgpu: probe of 0000:06:00.0 failed with error -62
[ 780.298835] [drm] amdgpu: ttm finalized
Question: Is there a way to get the actual address that is failing to train?
Hardware description:
- CPU: Ryzen 5 5600X
- GPU: Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] [1002:73bf]
- System Memory: 8GB DDR4
- Display(s): Several
- Type of Display Connection: DP,HDMI
System information:
- Distro name and Version: Arch
- Kernel version: Linux archiso 6.6.3-arch1-1
- Custom kernel: N/A
- AMD official driver version: N/A
Attached files:
Screenshots/video files
i can see this when nomodesetting
Edited by Waheed Barghouthi