Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
We've noticed that, irrespective of kernel version (5.13, 6.0, 6.3, etc), when setting profile mode to profile_peak, the APU freezes in a way it's not even possible to collect logs. The follow command reproduces it reliably for me:
while (true); do echo profile_peak > /sys/class/drm/card0/device/power_dpm_force_performance_level; sleep 1; echo auto > /sys/class/drm/card0/device/power_dpm_force_performance_level; sleep 1; done
It also can be reproduced by using the ioctl (AMDGPU_CTX_OP_SET_STABLE_PSTATE) to set the profile_peak mode.
I'll add more details in the comments, and @bnieuwenhuizen / @lostgoat / @hakzsam are pretty much aware (and affected) and can elaborate about the usecase of this feature.
EDIT: forgot to mention the full path for the file power_dpm_force_performance_level - thanks @mwen for the heads-up!
Edited
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Found an interesting data point - a friend has a Rembrandt laptop [0] and by checking the ip_discovery sysfs entries, seems his gfx_v10 IP block revision is 10,3,3 - very similar to Vangogh's 10,3,1.
When trying to set the profile_peak, he instantly got the freeze - at least in the Deck we can set it sometimes (my while test usually survives ~20 seconds at least), but in Rembrandt seems it freezes the APU right in the first attempt (reproduced all times he tried).
[0] dmesg snippet with version information
[drm] add ip block number 0 <nv_common>[drm] add ip block number 1 <gmc_v10_0>[drm] add ip block number 2 <navi10_ih>[drm] add ip block number 3 <psp>[drm] add ip block number 4 <smu>[drm] add ip block number 5 <dm>[drm] add ip block number 6 <gfx_v10_0>[drm] add ip block number 7 <sdma_v5_2>[drm] add ip block number 8 <vcn_v3_0>[drm] add ip block number 9 <jpeg_v3_0>amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from VFCTamdgpu: ATOM BIOS: 113-REMBRANDT-X35[...][drm] Found VCN firmware Version ENC: 1.21 DEC: 2 VEP: 0 Revision: 10[...][drm] Display Core initialized with v3.2.207![drm] DMUB hardware initialized: version=0x04000022[...]kfd kfd: amdgpu: added device 1002:1681[...][drm] Initialized amdgpu 3.49.0 20150101 for 0000:05:00.0 on minor 0
So, through some voodoo magic experiments I was able to change the code in a way my test is now reliable on Deck and I don't see freezes anymore. What is more difficult is to both explain why this seems to fix the issue, and if this change modifies the "behavior" of the profile_peak setting, specially related to the graphics profiling mentioned by @themaister .
So, I'd like to ask the involved folks here if you could test it and see if helps (and doesn't break the "semantics" of the profile_peak thing). After that, we could discuss with AMD in case the change seems really to help.
In order to test it, one just need to grab the tarball at https://people.igalia.com/gpiccoli/gitlab_issue_2545/ and also the "install_kernel.sh" script - this script just unpacks the tarball, putting kernel/modules in the right places and created the initramfs image for the Deck, updating grub finally. This kernel is based on amd-staging-drm-next, head at 21c10a781572 - notice we needed to apply the mailing-list patches 0001 and 0002 on top of it (both present in the link above) to be able to boot the kernel, it's a bug related with a recent change (gfxhub/mmhub new layout stuff). My patch is also present in the link above as 0003.
Once this kernel is booted, one can check if it's running the modified amdgpu by checking "dmesg | grep gitlab", which should output 2 messages. For reference I've also included the amdgpu.ko without my change in the tarball, if users wish to test it to reproduce the issue, just play with the ko files on /lib/modules/6.1.11-gpiccoli-amdnext/kernel/drivers/gpu/drm/amd/amdgpu.
Finally, worth mentioning that I've tested it with the while loop reproducer here playing Cuphead and Streets of Rage 4, no issues. But I've noticed a (recovered) GPU reset when running the test while I kept using Gamescope interface to check my game library...lemme know your results, and thanks in advance!
Nice find! I rummaged around a bit more and looks like GFX9 had an interesting pattern here:
When disabling powergating: Disable gfxoff then powergating. When enabling powergating: enable powergating then gfxoff.
I applied that to GFX10 and I can run the repro script for 10 minutes now (stopped for other reasons, not for triggering), and it doesn't look like we need the duration change. Did you need it for yours?
(FWIW I'm testing on a slightly older amdgpu-staging-drm-next, but it has the issue without fix too)
Nice Bas, thanks! My patch worked as is, but I can try yours and see if I still face the eventual GPU resets when testing with Gamescope..will get back to you soon =)
The patch makes sense to me based on my understanding of how the hardware works. Accessing gfx registers when gfxoff is active can cause a hang. I suspect it sort of works today based on a race between the firmware (dynamically dis/enabling gfxoff) and software. can you send the patch to amd-gfx?
Hi @bnieuwenhuizen , I've tested that and found an interesting behavior.
First of all, your patch seems correct...not much different than mine (except for the delay thing, that I'm not convinced is required at all), but with the bonus that it's based on some already existing code (gfx_v9)!
Your patch survived fine the tests of having a game running or keeping Deck idle, the problem is that when I kept messing in the Gamescope UI, I faced a non-recoverable reset with your patch more consistently than with mine (which in the majority of times was recoverable and took more time to appear), see the attached dmesg. I also tested with the delay + your patch, same results...
I'm not really sure what is the best approach or if this reset is really related to that, I'll defer the decision to @agd5f / AMD folks of course. Also, notice we should maybe fix the same gfx_off/powergating call pattern on gfx11, as the attached diff - it does make sense, right?
If you end-up sending your patch to ML, you can add:
I just asked the team to look into it and they were not able to repro the issue on a steamdeck. Interesting find about the gfxoff and powergating. I'll ask the power team for more clarification.
Thanks Alex! I've noticed the reproducer is way better if we "use" the GPU while the while runs (really weird english sentence heh).
So, maybe play a GPU-bound game or even keep playing in the UI, changing menus, etc...at least in my Deck, with the while loop above and messing the UI I can reliably repro in ~30 sec.
Also, if you have a Rembrandt APU around, as I mentioned before, seems it just freezes at the first write to the dpm file...