WX7500: screen frozen and [CRTC:56:crtc-0] flip_done timed out
[Symptom]:
screen is frozen during booting up.
there's a error message in dmesg.
[CRTC:56:crtc-0] flip_done timed out
[Misc]
- VP, with the latest drmtip.
- VP, tried to remove xserver-xorg-video-amdgpu.
- VP, update the latest gc_11_0_2_mes_2.bin, dcn_3_1_4_dmcub.bin and
dcn_3_1_5_dmcub.bin
-> this relief the issue. before arrive the desktop/login page, the error didn't show up
but leave the machine idle for a while, the screen is frozen again. - VNP, disable pcie_aspm or amdgpu.aspm.
lspci_wx5700_202306280930
dmesg_amdgpuAspmOff_202306280921
lspci_all_202306281215
dmesg_flipDoneTimedOut_202306281230
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- kobako changed the description
changed the description
- Owner
Does appending amdgpu.ppfeaturemask=0xfff7bffb on the kernel command line help? That disables PCIe DPM. I think some intel platforms have problems with that.
@agd5f amdgpu.ppfeaturemask=0xfff7bffb doesn't help to relieve the issue.
$ sudo dmesg | grep amdgpu [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.4.0-060400rc1drmtip20230510-generic root=UUID=8fdb0200-74b5-48e2-a5f5-2968850794ab ro debug splash i10nm_edac.dyndbg=+p amdgpu.ppfeaturemask=0xfff7bffb vt.handoff=7 [ 0.037482] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.4.0-060400rc1drmtip20230510-generic root=UUID=8fdb0200-74b5-48e2-a5f5-2968850794ab ro debug splash i10nm_edac.dyndbg=+p amdgpu.ppfeaturemask=0xfff7bffb vt.handoff=7 [ 8.333166] [drm] amdgpu kernel modesetting enabled. [ 8.333806] amdgpu: CRAT table not found [ 8.333815] amdgpu: Virtual CRAT table created for CPU [ 8.333910] amdgpu: Topology: Add CPU node [ 8.334239] amdgpu 0000:57:00.0: enabling device (0146 -> 0147) [ 8.341816] amdgpu 0000:57:00.0: amdgpu: Fetched VBIOS from VFCT [ 8.341818] amdgpu: ATOM BIOS: 113-D7491200-100 [ 8.356455] amdgpu 0000:57:00.0: amdgpu: CP RS64 enable [ 8.375166] amdgpu 0000:57:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode [ 8.379898] amdgpu 0000:57:00.0: vgaarb: deactivate vga console [ 8.379900] amdgpu 0000:57:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported [ 8.379989] amdgpu 0000:57:00.0: BAR 2: releasing [mem 0x202ff0000000-0x202ff01fffff 64bit pref] [ 8.379991] amdgpu 0000:57:00.0: BAR 0: releasing [mem 0x202fe0000000-0x202fefffffff 64bit pref] [ 8.380019] amdgpu 0000:57:00.0: BAR 0: assigned [mem 0x202000000000-0x2021ffffffff 64bit pref] [ 8.380027] amdgpu 0000:57:00.0: BAR 2: assigned [mem 0x202200000000-0x2022001fffff 64bit pref] [ 8.380077] amdgpu 0000:57:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used) [ 8.380079] amdgpu 0000:57:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF [ 8.380080] amdgpu 0000:57:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF [ 8.380198] [drm] amdgpu: 8176M of VRAM memory ready [ 8.380200] [drm] amdgpu: 7610M of GTT memory ready. [ 8.383665] amdgpu 0000:57:00.0: amdgpu: Will use PSP to load VCN firmware [ 8.545659] amdgpu 0000:57:00.0: amdgpu: RAS: optional ras ta ucode is not available [ 8.553833] amdgpu 0000:57:00.0: amdgpu: RAP: optional rap ta ucode is not available [ 8.553836] amdgpu 0000:57:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available [ 8.621452] amdgpu 0000:57:00.0: amdgpu: SMU is initialized successfully! [ 8.660695] snd_hda_intel 0000:57:00.1: bound 0000:57:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu]) [ 8.920126] amdgpu 0000:57:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully. [ 8.959064] amdgpu: HMM registered 8176MB device memory [ 8.960541] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 8.960544] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1 [ 8.960582] amdgpu: Virtual CRAT table created for GPU [ 8.960642] amdgpu: Topology: Add dGPU node [0x7489:0x1002] [ 8.960644] kfd kfd: amdgpu: added device 1002:7489 [ 8.960655] amdgpu 0000:57:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 8, active_cu_number 28 [ 8.960792] amdgpu 0000:57:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ 8.960794] amdgpu 0000:57:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 [ 8.960794] amdgpu 0000:57:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 [ 8.960795] amdgpu 0000:57:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0 [ 8.960796] amdgpu 0000:57:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0 [ 8.960796] amdgpu 0000:57:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0 [ 8.960797] amdgpu 0000:57:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0 [ 8.960798] amdgpu 0000:57:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0 [ 8.960798] amdgpu 0000:57:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0 [ 8.960799] amdgpu 0000:57:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 [ 8.960800] amdgpu 0000:57:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0 [ 8.960801] amdgpu 0000:57:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8 [ 8.960801] amdgpu 0000:57:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8 [ 8.960802] amdgpu 0000:57:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0 [ 8.963821] amdgpu 0000:57:00.0: amdgpu: Using BACO for runtime pm [ 8.964737] [drm] Initialized amdgpu 3.54.0 20150101 for 0000:57:00.0 on minor 0 [ 8.969905] fbcon: amdgpudrmfb (fb0) is primary device [ 9.221296] amdgpu 0000:57:00.0: [drm] fb0: amdgpudrmfb frame buffer device [ 28.745206] amdgpu 0000:57:00.0: [drm] *ERROR* [CRTC:72:crtc-0] flip_done timed out
Collapse replies - Developer
On an unpatched build but with this kernel command line option can you please read out
pp_features
from sysfs? Some of those bits might not be effective at module load time, want to confirm what happened. @superm1 this kernel is unpatched one.
$ sudo cat /sys/module/amdgpu/parameters/ppfeaturemask 0xfff7bfff $ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-6.4.0-060400rc1drmtip20230510-generic root=UUID=0e821db1-a455-40e9-829c-5969f50d575a ro quiet splash amdgpu.dcdebugmask=0x10 vt.handoff=7
- Developer
When you used
0xfff7bffb
on cmdline does it also reflect in there? $ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-6.4.0-060400rc1drmtip20230510-generic root=UUID=0e821db1-a455-40e9-829c-5969f50d575a ro quiet splash amdgpu.ppfeaturemask=0xfff7bffb vt.handoff=7 $ sudo cat /sys/module/amdgpu/parameters/ppfeaturemask 0xfff7bffb $ sudo dmesg -w | grep -ie flip_done [ 67.655894] amdgpu 0000:57:00.0: [drm] *ERROR* [CRTC:72:crtc-0] flip_done timed out
- Developer
Wait; this is the wrong thing you're reading. I'm talking about specifically the sysfs file pp_features. Do you have that in the GPU's sysfs directories?
- Developer
/sys/bus/pci/drivers/amdgpu/$BDF
$ sudo cat /sys/bus/pci/drivers/amdgpu/0000\:57\:00.0/pp_features features high: 0x0001a3bd low: 0x71ffffff No. Feature Bit : State 00. FW_DATA_READ ( 0) : enabled 01. DPM_GFXCLK ( 1) : enabled 02. DPM_GFX_POWER_OPTIMIZER ( 2) : enabled 03. DPM_UCLK ( 3) : enabled 04. DPM_FCLK ( 4) : enabled 05. DPM_SOCCLK ( 5) : enabled 06. DPM_MP0CLK ( 6) : enabled 07. DPM_LINK ( 7) : enabled 08. DPM_DCN ( 8) : enabled 09. VMEMP_SCALING ( 9) : enabled 10. VDDIO_MEM_SCALING (10) : enabled 11. DS_GFXCLK (11) : enabled 12. DS_SOCCLK (12) : enabled 13. DS_FCLK (13) : enabled 14. DS_LCLK (14) : enabled 15. DS_DCFCLK (15) : enabled 16. DS_UCLK (16) : enabled 17. GFX_ULV (17) : enabled 18. FW_DSTATE (18) : enabled 19. GFXOFF (19) : enabled 20. BACO (20) : enabled 21. MM_DPM (21) : enabled 22. SOC_MPCLK_DS (22) : enabled 23. BACO_MPCLK_DS (23) : enabled 24. THROTTLERS (24) : enabled 25. SMARTSHIFT (25) : disabled 26. GTHR (26) : disabled 27. ACDC (27) : disabled 28. VR0HOT (28) : enabled 29. FW_CTF (29) : enabled 30. FAN_CONTROL (30) : enabled 31. GFX_DCS (31) : disabled 32. GFX_READ_MARGIN (32) : enabled 33. LED_DISPLAY (33) : disabled 34. GFXCLK_SPREAD_SPECTRUM (34) : enabled 35. OUT_OF_BAND_MONITOR (35) : enabled 36. OPTIMIZED_VMIN (36) : enabled 37. GFX_IMU (37) : enabled 38. BOOT_TIME_CAL (38) : disabled 39. GFX_PCC_DFLL (39) : enabled 40. SOC_CG (40) : enabled 41. DF_CSTATE (41) : enabled 42. GFX_EDC (42) : disabled 43. BOOT_POWER_OPT (43) : disabled 44. CLOCK_POWER_DOWN_BYPASS (44) : disabled 45. DS_VCN (45) : enabled 46. BACO_CG (46) : disabled 47. MEM_TEMP_READ (47) : enabled 48. ATHUB_MMHUB_PG (48) : enabled 49. SOC_PCC (49) : disabled
- Developer
That was with the value
0xfff7bffb
set forppfeaturemask
, right? - Developer
In that case I think this is because the PSP on this dGPU sets
adev->scpm_enabled
. So the attempts to turn off DPM via this method don't work. 1 - Developer
MP0/PSP is a microcontroller on the dGPU. This policy is queried from PSP, and it controls what driver is actually able to change.
So this explains why the kernel module parameter doesn't work in this case to turn off DPM.
1 - Developer
OK another way to try instead. Can you populate the values for
amdgpu.pcie_lane_cap
andamdgpu.pcie_gen_cap
to match the max of what the platform can support in this slot?You can find the definitions in https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/include/amd_pcie.h
For example if it is a Gen3 x16 slot:
amdgpu.pcie_lane_cap=0x00200000 amdgpu.pcie_gen_cap=0x00000004
I would expect this will prevent operating at Gen1 or Gen2 and x8 or x4 width while in ASPM. If this works, this could be another way for quirking.
@superm1 i configured the highest value and got VNP with 5 reboot. do i understand correctly?
$ sudo cat /sys/module/amdgpu/parameters/pcie_*_cap 16 4194304 $ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-6.4.0-060400rc1drmtip20230510-generic root=UUID=0e821db1-a455-40e9-829c-5969f50d575a ro quiet splash amdgpu.pcie_lane_cap=0x00400000 amdgpu.pcie_gen_cap=0x00000010 vt.handoff=7
- Developer
Yes; this might be root cause then. I'll work out a different patch for you to try and send later on.
- Developer
OK - have a try with this patch.0001-drm-amd-For-Intel-platforms-cap-lane-width-and-speed.patchIf this works, I would like you to also:1) Do an experiment where you tear out the hunk for setting just the width cap and see if that works (keep the speed cap)2) Do an experiment where you tear out the hunk for setting just the speed cap and see if that works (keep the width cap)(Scratched out in favor of branch)
Edited by Mario Limonciello - Developer
If you haven't already tried, I might have a more scalable solution. Please try this branch instead. https://gitlab.freedesktop.org/superm1/linux/-/tree/mlimonci/block-lane-switching-intel?ref_type=heads
- Developer
@koba can you please try the branch I shared with no parameters?
@superm1 unfortunaly, VP against block-lane-switching-intel w/o parameters
$ sudo cat /sys/module/amdgpu/parameters/pcie_*_cap 0 0 cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-6.4.0-060400rc1drmtip20230511-generic root=UUID=0e821db1-a455-40e9-829c-5969f50d575a ro quiet splash vt.handoff=7 $sudo dmesg | grep -ie flip_done [ 31.814910] amdgpu 0000:57:00.0: [drm] ERROR [CRTC:72:crtc-0] flip_done timed out
- Developer
Did you cherry pick those two patches? Otherwise that looks like wrong kernel branch.
- Developer
OK thanks. I was going to suggest I'll add some extra
dev_info
prints around the numbers sent if that doesn't work. @superm1 cherry-picked two patches from branch block-lane-switching-intel,
- 9b3766ad4266e - drm/amd/pm: conditionally disable pcie lane/speed switching for SMU13 (16 hours ago) <Evan Quan> - eb132a8bdf809 - drm/amd/pm: share the code around SMU13 pcie parameters update (16 hours ago) <Evan Quan>
VNP, tried 5 times reboot.
cat /sys/module/amdgpu/parameters/pcie_*_cap 0 0 cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-6.4.0-060400rc1drmtip20230510-generic root=UUID=c789b261-e0b9-49a9-bca0-e26d4221cf00 ro quiet splash vt.handoff=7 sudo dmesg | grep -ie flip_done
- Developer
OK. I'll send them out to M/L then for review.
Can you please try to cherry pick these back to OEM-6.1 as well? I expect they should also work there but want to make sure.
@superm1 tried against oem-6.1,
VNP. reboot 30 times
dmesg_oemkReboot30_202307091819- Developer
Thanks, we'll land this then. Feel free to cherry-pick it OEM-6.1 earlier if you need it sooner.
- Developer
Yeah Alex will include it an upcoming 6.5-fixes PR. You can take the patches now if you want though, they're not going to change.
1 - Developer
Here's the SHA from the drm-fixes-6.5 PR.
Edited by Mario Limonciello - Owner
yes.
- Developer
This patch expands the ASPM disablement for Intel across all the dGPUs and adds your system (sapphire rapids). Can you confirm this works?
Collapse replies - Developer
In the past it has been because of things like Intel not supporting dynamic lane width switching but AMD using this while ASPM is enabled. But I would have thought that testing PCIe DPM turned off should have avoided that.
The patch I proposed is a pretty big hammer though, I'm not sure if we should go that route.
@superm1
Reboot 30 times, don't find the error message, 'flip_done timeout'.
dmesg_reboot30_202307041629- Developer
Thanks, I CC'ed you on the submission for this. I think we'll do this for now.
- Mario Limonciello added 7000 dGPU series hang/freeze labels
added 7000 dGPU series hang/freeze labels
- Mario Limonciello marked this issue as related to #2667
marked this issue as related to #2667
- Mario Limonciello added aspm label
added aspm label
- Evan Quan mentioned in commit superm1/linux@9b3766ad
mentioned in commit superm1/linux@9b3766ad
- Mario Limonciello mentioned in commit agd5f/linux@bf6880df
mentioned in commit agd5f/linux@bf6880df
- Mario Limonciello mentioned in commit agd5f/linux@c2e3f5b5
mentioned in commit agd5f/linux@c2e3f5b5
- Mario Limonciello mentioned in commit agd5f/linux@31c7a3b3
mentioned in commit agd5f/linux@31c7a3b3
- Developer
This is merged in Linus' tree at 6.5-rc2, closing.
- Mario Limonciello closed
closed
- Mario Limonciello mentioned in commit nouveau@bd8cd38d
mentioned in commit nouveau@bd8cd38d
- Mario Limonciello mentioned in commit nouveau@a924e0fa
mentioned in commit nouveau@a924e0fa