Setting performance mode to "profile_peak" freezes the full APU (Vangogh)

This is critically important for Radeon Graphics Profiler.

changed the description

FWIW, I didn't get anything wrong when running this command in a Ryzen 5000 series (Cezanne). So, I think it's something specific of Vangogh.

(BTW, Sorry if I get anything wrong)

Found an interesting data point - a friend has a Rembrandt laptop [0] and by checking the ip_discovery sysfs entries, seems his gfx_v10 IP block revision is 10,3,3 - very similar to Vangogh's 10,3,1.

When trying to set the profile_peak, he instantly got the freeze - at least in the Deck we can set it sometimes (my while test usually survives ~20 seconds at least), but in Rembrandt seems it freezes the APU right in the first attempt (reproduced all times he tried).

[0] dmesg snippet with version information

[drm] add ip block number 0 <nv_common>
[drm] add ip block number 1 <gmc_v10_0>
[drm] add ip block number 2 <navi10_ih>
[drm] add ip block number 3 <psp>
[drm] add ip block number 4 <smu>
[drm] add ip block number 5 <dm>
[drm] add ip block number 6 <gfx_v10_0>
[drm] add ip block number 7 <sdma_v5_2>
[drm] add ip block number 8 <vcn_v3_0>
[drm] add ip block number 9 <jpeg_v3_0>
amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from VFCT
amdgpu: ATOM BIOS: 113-REMBRANDT-X35
[...]
[drm] Found VCN firmware Version ENC: 1.21 DEC: 2 VEP: 0 Revision: 10
[...]
[drm] Display Core initialized with v3.2.207!
[drm] DMUB hardware initialized: version=0x04000022
[...]
kfd kfd: amdgpu: added device 1002:1681
[...]
[drm] Initialized amdgpu 3.49.0 20150101 for 0000:05:00.0 on minor 0

So, through some ~~voodoo magic~~ experiments I was able to change the code in a way my test is now reliable on Deck and I don't see freezes anymore. What is more difficult is to both explain why this seems to fix the issue, and if this change modifies the "behavior" of the profile_peak setting, specially related to the graphics profiling mentioned by @themaister .

So, I'd like to ask the involved folks here if you could test it and see if helps (and doesn't break the "semantics" of the profile_peak thing). After that, we could discuss with AMD in case the change seems really to help.

In order to test it, one just need to grab the tarball at https://people.igalia.com/gpiccoli/gitlab_issue_2545/ and also the "install_kernel.sh" script - this script just unpacks the tarball, putting kernel/modules in the right places and created the initramfs image for the Deck, updating grub finally. This kernel is based on amd-staging-drm-next, head at 21c10a781572 - notice we needed to apply the mailing-list patches 0001 and 0002 on top of it (both present in the link above) to be able to boot the kernel, it's a bug related with a recent change (gfxhub/mmhub new layout stuff). My patch is also present in the link above as 0003.

Once this kernel is booted, one can check if it's running the modified amdgpu by checking "dmesg | grep gitlab", which should output 2 messages. For reference I've also included the amdgpu.ko without my change in the tarball, if users wish to test it to reproduce the issue, just play with the ko files on /lib/modules/6.1.11-gpiccoli-amdnext/kernel/drivers/gpu/drm/amd/amdgpu.

Finally, worth mentioning that I've tested it with the while loop reproducer here playing Cuphead and Streets of Rage 4, no issues. But I've noticed a (recovered) GPU reset when running the test while I kept using Gamescope interface to check my game library...lemme know your results, and thanks in advance!

Nice find! I rummaged around a bit more and looks like GFX9 had an interesting pattern here:

When disabling powergating: Disable gfxoff then powergating. When enabling powergating: enable powergating then gfxoff.

I applied that to GFX10 and I can run the repro script for 10 minutes now (stopped for other reasons, not for triggering), and it doesn't look like we need the duration change. Did you need it for yours?

(FWIW I'm testing on a slightly older amdgpu-staging-drm-next, but it has the issue without fix too)

0001-drm-amdgpu-Disable-gfxoff-before-disabling-powergati.patch

Nice Bas, thanks! My patch worked as is, but I can try yours and see if I still face the eventual GPU resets when testing with Gamescope..will get back to you soon =)

The patch makes sense to me based on my understanding of how the hardware works. Accessing gfx registers when gfxoff is active can cause a hang. I suspect it sort of works today based on a race between the firmware (dynamically dis/enabling gfxoff) and software. can you send the patch to amd-gfx?

Hi @bnieuwenhuizen , I've tested that and found an interesting behavior.

First of all, your patch seems correct...not much different than mine (except for the delay thing, that I'm not convinced is required at all), but with the bonus that it's based on some already existing code (gfx_v9)!

Your patch survived fine the tests of having a game running or keeping Deck idle, the problem is that when I kept messing in the Gamescope UI, I faced a non-recoverable reset with your patch more consistently than with mine (which in the majority of times was recoverable and took more time to appear), see the attached dmesg. I also tested with the delay + your patch, same results...

I'm not really sure what is the best approach or if this reset is really related to that, I'll defer the decision to @agd5f / AMD folks of course. Also, notice we should maybe fix the same gfx_off/powergating call pattern on gfx11, as the attached diff - it does make sense, right?

If you end-up sending your patch to ML, you can add:

Tested-by: Guilherme G. Piccoli gpiccoli@igalia.com

Thanks!

dmesg-bas.txt

0001-drm-amdgpu-gfx11-Adjust-gfxoff-before-powergating-on.patch

Patches sent: https://lore.kernel.org/amd-gfx/20230509164947.455753-1-bas@basnieuwenhuizen.nl/

Thanks a lot Bas!

I just asked the team to look into it and they were not able to repro the issue on a steamdeck. Interesting find about the gfxoff and powergating. I'll ask the power team for more clarification.

Thanks Alex! I've noticed the reproducer is way better if we "use" the GPU while the while runs (really weird english sentence heh).

So, maybe play a GPU-bound game or even keep playing in the UI, changing menus, etc...at least in my Deck, with the while loop above and messing the UI I can reliably repro in ~30 sec.

Also, if you have a Rembrandt APU around, as I mentioned before, seems it just freezes at the first write to the dpm file...

mentioned in commit agd5f/linux@8173cab3

mentioned in commit gfx-ci/linux@4177bdc9

mentioned in commit superm1/linux@abfe2ffc

added hang/freeze label

closed

mentioned in commit nouveau@c971ca2b

Setting performance mode to "profile_peak" freezes the full APU (Vangogh)

Designs

Child items ...

Activity

Admin message

Admin message

Setting performance mode to "profile_peak" freezes the full APU (Vangogh)

Activity