Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
On an Ally X unit, running on a 6.11.4 kernel variant with the following patches on top of kernel-ark, after suspend and within 20 minutes of gameplay (in this case GTA IV), gamescope displays a black screen and restarts
dmesg reports:
[ 4063.692766] amdgpu 0000:64:00.0: amdgpu: 000000008ac73718 pin failed[ 4063.692774] [drm:amdgpu_dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
gamescope reports:
Could not resolve keysym XF86KbdInputAssistNextgrouErrors from xkbcomp are not fatal to the X serverErrors from xkbcomp are not fatal to the X server(EE)(EE) failed to read Wayland events: Broken pipefailed to read Wayland events: Broken pipe[gamescope] [Error] drm: drmModeRmFB failed: Bad file descriptor[gamescope] [Error] drm: drmModeRmFB failed: Bad file descriptor[gamescope] [Error] drm: drmModeRmFB failed: Bad file descriptor[gamescope] [Error] drm: drmModeRmFB failed: Bad file descriptor
Hardware description:
CPU: Z1 Extreme
GPU: 780M
System Memory: 24GB (512MB is vram; AutoUMA)
Display(s): 1080p VRR built in display, 120hz.
System information:
Distro name and Version: Bazzite Unstable on F41
Kernel version: 6.11.4 See here. 6.10 had a similar issue and was skipped (w/ AutoUMA as well).
How to reproduce the issue:
See above.
Log files (for system lockups / game freezes / crashes)
By the way, Mario, I tried over 5 times to get the Ally to break due to Modern Standby Assistant and could not. Seems like it does not wake up and works fine. Which is why I did not email you. It is a standard feature on most Asus TUF laptops so it is worth investigating nontheless.
Due to the setup required to diagnose the issue (gamescope + game + steam) and the time it takes to reproduce the bug, it is unfortunately not possible to bisect the issue. It would take too long.
Which is very unfortunate.
Any pointers on what to try to revert/change would be appreciated.
There have been the following amdgpu commits between kernel versions 6.9 and 6.10:
❯ git log v6.9..v6.10 --format="%h: %s" drivers/gpu/drm/amd/amdgpu48880f9686b1: drm/amdgpu: Don't show false warning for reg listbcfa48ff785b: drm/amdgpu: avoid using null object of framebuffer74fa02c4a5ea: drm/amdgpu: Fix pci state save during mode-1 resetf6f49dda49db: drm/amdgpu/atomfirmware: fix parsing of vram_infoed5a4484f074: drm/amdgpu: init TA fw for psp v14e356d321d024: drm/amdgpu: cleanup MES11 command submission8bd82363e2ee: drm/amdgpu: revert "take runtime pm reference when we attach a buffer" v249c9ffabde55: drm/amdgpu: Indicate CU havest info to CP84801d4f1e4f: drm/amdgpu: fix locking scope when flushing tlb14731a640e55: Merge drm/drm-fixes into drm-misc-fixes31849bf07e0f: drm/amdgpu: Fix the BO release clear memory warningbb61cf46b66a: Merge tag 'amd-drm-fixes-6.10-2024-05-30' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixesa9bc5a19e495: drm/amdgpu: Make CPX mode auto default in NPS41f327dfc846a: drm/amdkfd: simplify APU VRAM handlinga0cf36546cc2: drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parentba46b3bda296: drm/amdgpu: Adjust logic in amdgpu_device_partner_bandwidth()56fb6f92854f: Merge tag 'drm-next-2024-05-25' of https://gitlab.freedesktop.org/drm/kernelec58991054e8: drm/amdgpu: correct hbm field in boot status2c92ca849fcc: tracing/treewide: Remove second parameter of __assign_str()e64e8f7c178e: drm/amdgpu/atomfirmware: add intergrated info v2.3 tablef0bae243b2bc: Merge tag 'pci-v6.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pcieb853413d02c: drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs2a705f3e49d2: drm/amdkfd: handle duplicate BOs in reserve_bo_and_cond_vmsff9a79307f89: Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuilddb5d28c0bfe5: Merge tag 'drm-next-2024-05-15' of https://gitlab.freedesktop.org/drm/kernelb1992c3772e6: kbuild: use $(src) instead of $(srctree)/$(src) for source directory05b8b6dd225d: Revert "drm: Switch DRM_DISPLAY_HELPER to depends on"7fe302ae198a: Revert "drm: Switch DRM_DISPLAY_DP_HELPER to depends on"95734469533c: Revert "drm: Switch DRM_DISPLAY_HDCP_HELPER to depends on"d7c128cb775e: Revert "drm: Switch DRM_DISPLAY_HDMI_HELPER to depends on"4a56c0ed5aa0: Merge tag 'amd-drm-next-6.10-2024-04-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-nextb84bc948528e: Merge v6.9-rc6 into drm-nextea686fef5489: drm/amdgpu: fix the warning about the expression (int)size - len029c2b03892b: drm/amdgpu/mes: add mes mapping legacy queue support9a5f15d2a29d: drm/amdgpu: fix uninitialized scalar variable warning2a8f7464d33c: drm/amdgpu: skip ip dump if devcoredump flag is sete362b7c8f8c7: drm/amdgpu: Modify the contiguous flags behaviourbd31e5026dc3: drm/amdkfd: Enable SQ watchpoint for gfx10acce6479e30f: drm/amdgpu: Fix buffer size in gfx_v9_4_3_init_ cp_compute_microcode() and rlc_microcode()6f3b69139c3c: drm/amdgpu: Fix ras mode2 reset failure in ras aca mode506c245f3f1c: drm/amdgpu: fix double free err_addr pointer warnings7bfd16d0ec37: drm/amdgpu: initialize the last_jump_jiffies in atom_exec_context2d10c3dbde07: drm/amdgpu: add check before free wb entrycd48b97ce778: drm/amdgpu: add return result for amdgpu_i2c_{get/put}_byte8b2faf1a4f3b: drm/amdgpu: add error handle to avoid out-of-bounds2e55bcf3d742: drm/amdgpu: Initialize timestamp for some legacy SOCs48fa90718b2a: drm/amdgpu: Use new interface to reserve bad pagebcc093488503: drm/amdgpu: Fix address translation defecte02387408117: drm/amdgpu: support ACA logging ecc errors370fbff4cc6f: drm/amdgpu: add poison consumption handlerbfa579b38b86: drm/amdgpu: prepare to handle pasid poison consumption314c38cde687: drm/amdgpu: retire bad pages for umc v12_0e74313be5a71: drm/amdgpu: add condition check for amdgpu_umc_fill_error_record2cf8e50ec381: drm/amdgpu: Add delay work to retire bad pagesf27defca6882: drm/amdgpu: umc v12_0 logs ecc errorsb2aa6b108dd3: drm/amdgpu: umc v12_0 converts error address95b4063de4f4: drm/amdgpu: add interface to update umc v12_0 ecc statusa734adfbcdb0: drm/amdgpu: add poison creation handlerf493dd64ee66: drm/amdgpu: prepare for logging ecc errors98b5bc878d4b: drm/amdgpu: add message fifo to handle RAS poison events88a9a467c548: drm/amdgpu: Using uninitialized value *size when calling amdgpu_vce_cs_reloceef016ba8986: drm/amdgpu/mes11: Use a separate fence per transaction497d7cee2457: drm/amdgpu: add a spinlock to wb allocation754c366e41d2: drm/amdgpu: update fw_share for VCN58e1d1905951d: drm/amdgpu: Fix VRAM memory accountingc551316e150b: drm/amdgpu: update jpeg max decode resolutionaf8644121e3e: drm/amdgpu: add ip dump for each ip in devcoredumpe043a35dc244: drm/amdgpu: dump ip state before reset for each ipc8732c80debb: drm/amdgpu: add support for gfx v10 print40356542c361: drm/amdgpu: add protype for print ip statec395dbb68b29: drm/amdgpu: add support of gfx10 register dumpe21d253bd74b: drm/amdgpu: add prototype for ip dumpaf730e082035: drm/amdgpu: Add interface to reserve bad page60c448439f3b: drm/amdgpu: Fix uninitialized variable warningsf88da7fbf665: drm/amdgpu/mes: fix use-after-free issuee0a9bbeea002: drm/amdgpu/sdma5.2: use legacy HDP flush for SDMA2/3a16b95158644: drm/amdgpu: Update CGCG settings for GFXIP 9.4.3ea9238a81b3a: drm/amdgpu: replace tmz flag into buffer flag92ed1e9cd5f6: drm/amdgpu: init microcode chip name from ip versions26de73bc0a73: drm/amdgpu: Fix the ring buffer size for queue VM flusha522ec528cc7: drm/amdgpu/umsch: don't execute umsch test when GPU is in reset/suspend6927b0168059: drm/amdgpu: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACYe76691f45a60: drm/amdgpu: Update BO eviction priorities6e042cee748f: drm/amdgpu/vcn: fix unitialized variable warnings3f0664110a40: drm/amdgpu/mes11: print MES opcodes rather than numbers2476c6bd950e: drm/amdgpu/vpe: fix vpe dpm setup failed7b19f1f3466f: drm/amdgpu: Assign correct bits for SDMA HDP flush939c4751819b: drm/amdgpu: Support setting reset_method at runtimec058e7a8f8af: Merge drm/drm-next into drm-misc-nexta68c7eaa7a8f: drm/amdgpu: Enable clear page functionality96950929eb23: drm/buddy: Implement tracking clear page feature377b5b397d07: Merge tag 'amd-drm-next-6.10-2024-04-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next81bf14519a8c: drm/amdkfd: make sure VM is ready for updating operationse53a1713de31: drm/amdgpu: Fix leak when GPU memory allocation fails6a009ca1bf94: drm/amdgpu: remove virt_init_data_exchange from poison consumption handler8954c3fbe764: drm/amdgpu: Change AID detection logic93522c19488e: drm/amdgpu: enable redirection of irq's for IH V6.1ea137071ada1: drm/amdgpu: Skip the coredump collection on reset during driver reloadca0afa2f4161: drm/amdgpu: enable redirection of irq's for IH V6.05adcd78fa2bc: drm:amdgpu: enable IH ring1 for IH v6.1eefc85a2779d: drm:amdgpu: enable IH RB ring1 for IH v6.034633158b8eb: Merge tag 'amd-drm-next-6.10-2024-04-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next0c1195ca0d02: drm/amd/swsmu: support smu block discovery for smu v14577cbed31818: drm/amdgpu: rename DBG_DRV to HAD_DRV for psp v141347853271ed: drm/amdgpu: refactoring the runtime pm mode detection code6c6acc5f33ab: drm/amdgpu: Load ipkeymgr drv for psp v1412b8b4e68510: drm/amdgpu: Add missing space to DRM_WARN() message959056982a9b: drm/amdgpu: Fix discovery initialization failure during pci rescan394ae0603a67: drm/amdgpu: fix visible VRAM handling during faults98856136c485: drm/amdgpu: validate the parameters of bo mapping operations more clearlyf23558627f2b: drm/amdgpu: add new aca smu callback func parse_error_code()f7c161a4c250: drm/amdgpu: increase mes submission timeout3c858cf65e9a: drm/amdgpu: add missing vbios version from devcoredump8b9130bae048: drm/amdgpu/gfx11: properly handle regGRBM_GFX_CNTL in soft resetc8962679af35: drm/amdgpu: remove invalid resource->start check v2a0e002cdac42: drm/amdgpu/sdma6: set sdma hang watchdog6b0d78032f98: drm/amd/amdgpu: Update PF2VF Header526b184e8883: drm/amdgpu: differentiate external rev id for gfx 11.5.0d6d6561f936b: drm/amdgpu: fix incorrect number of active RBs for gfx11d1999b4017d4: amd/amdgpu: improve VF recover timeb41f742d6fa6: drm/amdgpu: Set fatal errror detected flag earlier05e40141685f: drm/amdgpu: clear set_q_mode_offs when VM changed4b0cb230bdb7: drm/amdgpu: retire UMC v12 mca_addr_to_paf6ac0842364a: drm/amd/amdgpu: support MES command SET_HW_RESOURCE1 in sriov9ecef5b2d0a0: drm/amdgpu: update check condition for XGMI ACA UE91bc86011661: drm/amdgpu: Fix VCN allocation in CPX partitionfcc0735b0087: drm/amdgpu: Add support for BAMACO mode checking327eec542746: drm/amdgpu: Bypass asd if display hw is not availableb2207dc6989f: drm/amdgpu/pm: Add support for MACO flag checking7c1d9e10e664: drm/amd/pm: fix the high voltage issue after unload3e2dacca5406: drm/amdgpu: use vm_update_mode=0 as default in sriov for gfx10.3 onwardsb7a1a0ef12b8: drm/amd/amdgpu: add pipe1 hardware support0453e5f2202e: drm/amdgpu: select HDP ref/mask according to gfx ring pipe029faefb7302: drm/amdgpu: implement IRQ_STATE_ENABLE for SDMA v4.4.2f5a3507c4abf: drm/amdgpu: add smu 14.0.1 discovery support8966c3167402: drm/amdgpu : Increase the mes log buffer size as per new MES FW versione58acb7613aa: drm/amdgpu : Add mes_log_enable to control mes log feature81d96e8b5a85: drm/amdgpu: refine function signature of amdgpu_aca_get_error_data()df3c7dc5c58b: drm/amdgpu: Reset dGPU if suspend got aborted6a0e1bafd70f: drm/amdgpu: add IP's FW information to devcoredump108ab31be9d5: drm/amdgpu/umsch: reinitialize write pointer in hw initcd409dbc6986: drm/amdgpu: Refine IB schedule error loggingfee54d08bc83: Merge tag 'drm-misc-next-2024-03-28' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next0d21364c6e8d: Merge drm/drm-next into drm-misc-nextf6d2dc03fa85: drm: Switch DRM_DISPLAY_HDMI_HELPER to depends on3166e7e6d935: drm: Switch DRM_DISPLAY_HDCP_HELPER to depends on0323287de87d: drm: Switch DRM_DISPLAY_DP_HELPER to depends one075e496f516: drm: Switch DRM_DISPLAY_HELPER to depends ond7f148764355: drm/amdgpu: always force full reset for SOC21c25d09bcb79f: drm/amdgpu: fix deadlock while reading mqd from debugfsb9a8aee136b7: drm/amdgpu: enable UMSCH 4.0.6f3e698978cfb: drm/amdgpu/umsch: update UMSCH 4.0 FW interface0355b24bdec3: drm/amd: Flush GFXOFF requests in prepare stageeb4f6eca2632: drm/amdgpu: Fix truncations in gfx_v11_0_init_microcode()8e4617c25d53: drm/amdgpu: simplify convert_error_address interface for UMC v12539ff12ee5e4: drm/amdgpu: Fix truncation issues in gfx_v9_0.c927a8a800ebb: drm/amdgpu: Fix truncation in gfx_v10_0_init_microcode20fd14460f45: drm/amdgpu: Fix 'fw_name' buffer size to prevent truncations in amdgpu_mes_init_microcode7c2bc34ab926: drm/amdgpu: Fix format character cut-off issues in amdgpu_vcn_early_init()8b3495eafb4d: drm/amdgpu: add socket id parameter for psp query address cmdf88a7dd06ab4: drm/amdgpu: Add a NULL check for freeing root PT9022f01b9709: drm/amdgpu: refactor code to split devcoredump code9ddafd1d1404: drm/amdgpu/vpe: power on vpe when hw_init31fd330b97ba: drm/amdgpu: add ras event id support for ACAbd15bf742f6d: drm/amdgpu: avoid update aca bank multi times during ras isrf7bcfb7a56b2: drm/amdgpu: retrieve umc odecc error count for aca umc v12.0b6c4f90b3819: drm/amdgpu: sync page table freeing with tlb flusha61e2ce9d425: drm/amdgpu: Enable smuio v14_0_2 callbacksd80e44a34e25: drm/amdgpu: Add smuio callback to get gpu clk counter2d93151de890: drm/amdgpu: Add smuio v14_0_2 ip block supportb93d759f540a: drm/amdgpu: add umc v12.0.0 deferred error support865d3397630b: drm/amdgpu: add aca deferred error type support2fc46e0b2fe8: drm/amdgpu: make reset method configurable for RAS poisone3d4de8d8b24: drm/amdgpu: retire unused aca_bank_report data structuref26c4e3fc999: drm/amdgpu: Update setting EEPROM table version69bf42fbb227: drm/amdgpu: refine aca error cache for umc v12.087428b405437: drm/amdgpu: refine aca error cache for sdma v4.4.262d2aaa7d466: drm/amdgpu: refine aca error cache for xgmi v6.4.0d8070c424108: drm/amdgpu: support utcl2 RAS poison query for mmhub176c3e89567f: drm/amdgpu: add utcl2 RAS poison query for mmhub5275114a7043: drm/amdgpu: refine aca error cache for mmhub v1.8d8a3f0a0348d: drm/amdgpu: implement TLB flush fencee6136150cd26: drm/amdgpu: refine aca error cache for gfx v9.4.3949899cbacf5: drm/amdgpu: add new api to save error count into aca cacheabc3b5d21d34: drm/amdgpu: add new aca_smu_type support6fe4dab331a7: drm/amdgpu: remove the adev check for NULL3cfaadbe0fcb: drm/amdgpu: add support for atom fw version v3_5765bea0d73b1: drm/amdgpu: Apply retry to IP discovery v2 and v49dc57c2adf2c: drm/amdgpu: add ras event id supportab66c832847f: drm/amdgpu: trigger flr_work if reading pf2vf data failedd72e2bdac4ad: drm/amdgpu: add the hw_ip version of all IP's6bb89d134042: drm/amdgpu: Skip virt_exchange_init on SDMA poison consumptiondfe9c3cde229: drm/amdgpu: Do a basic health check before reset97d5aa60306d: drm/amdgpu: cleanup unused variable0c501d3c11bb: drm/amdgpu: skip GFX FED error in page fault handling71a8d61ebc38: drm/amdgpu: retire gfx ras query_utcl2_poison_status3eb899c40a61: drm/amdgpu: add ring buffer information in devcoredump583681d4a417: drm/amdgpu: add vm fault information to devcoredumpfb0f5f541475: drm/amdgpu: add utcl2 poison query for gfxhubdc406d92a097: drm/amdgpu: add recent pagefault info in vm_managerd6eb77731c45: Merge drm/drm-next into drm-misc-next216c1282dde3: drm/amdgpu: use GTT only as fallback for VRAM|GTT
From those, the following look suspicious:
31849bf07e0f: drm/amdgpu: Fix the BO release clear memory warning1f327dfc846a: drm/amdkfd: simplify APU VRAM handlingeb853413d02c: drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs2a705f3e49d2: drm/amdkfd: handle duplicate BOs in reserve_bo_and_cond_vms216c1282dde3: drm/amdgpu: use GTT only as fallback for VRAM|GTT394ae0603a67: drm/amdgpu: fix visible VRAM handling during faultse76691f45a60: drm/amdgpu: Update BO eviction priorities8e1d1905951d: drm/amdgpu: Fix VRAM memory accounting8b2faf1a4f3b: drm/amdgpu: add error handle to avoid out-of-bounds
And from those, the following have a description that seems relevant
216c1282dde3: drm/amdgpu: use GTT only as fallback for VRAM|GTTeb853413d02c: drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs394ae0603a67: drm/amdgpu: fix visible VRAM handling during faultse76691f45a60: drm/amdgpu: Update BO eviction priorities8e1d1905951d: drm/amdgpu: Fix VRAM memory accounting
And from those, the following have a content that is very relevant:
216c1282dde3: drm/amdgpu: use GTT only as fallback for VRAM|GTT
eb853413d02c: drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs
394ae0603a67: drm/amdgpu: fix visible VRAM handling during faults
8e1d1905951d: drm/amdgpu: Fix VRAM memory accounting
eb853413d02c: drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs only affects ROCm applications so probably not that one. @ckoenig I think there may have been an issue on APUs caused by the fix for handling buffer migration back to VRAM after suspend on dGPUs for vulkan. That should not be necessary on APUs. Do you remember whether the details on that?
Ok messing with a revert of 216c1282dde3: drm/amdgpu: use GTT only as fallback for VRAM|GTT
Cannot get it to crash with AutoUMA for the last hour. Seems like that was it, perhaps combined with something else. Full patch series below. We will continue testing it ofc.
edit: if you can still reproduce, would you be able to attach the memory usage statistics before/after suspend?
(= save the content of /sys/class/drm/renderD128/device/mem_info_gtt_used, /sys/class/drm/renderD128/device/mem_info_vram_used and /sys/class/drm/renderD128/device/mem_info_vis_vram_used)
Hi Pierre,
Yes. I think I hit it once without even suspending or just being in SteamUI.
The pattern was that randomly within 2 hours after booting there would be a crash. I think suspending made it happen a lot faster (within 1-2 minutes).
I cannot reproduce the bug anymore with the new kernel. I played GTA IV for 2 days with dozens of suspends without rebooting and getting that crash. Then got a hard lock with no logs which is very likely for a completely different reason.
Here is what you asked for with the new kernel (after the revert).
Before suspend:
bazzite@antheas-ally-x:~$ for f in /sys/class/drm/renderD128/device/mem_*; do echo "$f:"; sudo cat $f; done /sys/class/drm/renderD128/device/mem_info_gtt_total:12784238592/sys/class/drm/renderD128/device/mem_info_gtt_used:1611788288/sys/class/drm/renderD128/device/mem_info_preempt_used:0/sys/class/drm/renderD128/device/mem_info_vis_vram_total:536870912/sys/class/drm/renderD128/device/mem_info_vis_vram_used:504123392/sys/class/drm/renderD128/device/mem_info_vram_total:536870912/sys/class/drm/renderD128/device/mem_info_vram_used:500355072
After Suspend:
bazzite@antheas-ally-x:~$ for f in /sys/class/drm/renderD128/device/mem_*; do echo "$f:"; sudo cat $f; done /sys/class/drm/renderD128/device/mem_info_gtt_total:12784238592/sys/class/drm/renderD128/device/mem_info_gtt_used:1611382784/sys/class/drm/renderD128/device/mem_info_preempt_used:0/sys/class/drm/renderD128/device/mem_info_vis_vram_total:536870912/sys/class/drm/renderD128/device/mem_info_vis_vram_used:479014912/sys/class/drm/renderD128/device/mem_info_vram_total:536870912/sys/class/drm/renderD128/device/mem_info_vram_used:473935872
If you think it is worth it I could test with the broken kernel, given it would take me an hour or so.
Yes, ideally reporting these numbers with a broken kernel would be useful.
Capturing: /sys/kernel/debug/dri/0/amdgpu_vram_mm and /sys/kernel/debug/dri/0/amdgpu_gtt_mm could also be helpful to identify a fragmentation issue.
(even better, if possible, would be to log these files regularly, so we can see their evolution until the "Failed to pin" error)
Here you go. Performance up to the crash was significantly worse on the broken kernel.
Two folders: kernel-working and kernel-broken + the capture script
Inside each folder there is a log containing summary information about the capture and in the folders are the files you asked. You can monitor the wakeups to see when the suspend happens. Before the suspend, wakeup count increases by one, and after by 2-3.
adev->gmc.xgmi.connected_to_cpu is set for Instinct boards connected to the CPU directly via XGMI (infinity fabric) rather than PCIe. adev->gmc.is_app_apu is set for big APUs like MI300A which have no VRAM at all.
Yeah, the TTM stuff is still on my TODO list to fix up completely. But I clearly remember that we discussed this on my call and I pushed the patch separately as result. Maybe it got stuck in CI or something.
Anyway let's fix that issue by making the proposed change.