Unrecoverable GPU crash when playing a 4K H.265 HDR10+ video

changed the description

Setting vo to vo=gpu from vo=gpu-next seems to work around the issue. gpu-next is, unfortunately, required for some videos to playback correctly.

I spoke too soon. Changing vo=gpu-next to vo=gpu does not really fix the problem. What does help is switching the video profile from profile=gpu-hq to profile=default. gpu-hq enables some additional and more expensive postprocessing filters etc. but it any case it shouldn't crash the GPU.

I'm getting a very similar problem to this. Though in my case I can't seem to trigger the issue on demand.

A few reports were submitted to the redhat bugzilla, and they seem EXTREMELY similar to this:

https://bugzilla.redhat.com/show_bug.cgi?id=2192072

FYI for me sometimes it completely hard-freezes the computer, and sometimes it makes the whole computer act really strangely.

Not an AMD dev but judging by the backtrace, do you have any runtime power management tweaks enabled? I recall that Fedora comes with tuned service that controls PM. Any chance you can disable it or force to not apply any power saving features?

I just checked, tuned is not currently installed on my system (sudo dnf list installed | grep tuned).

added VCN label

This commit should help some of the backtrace: https://patchwork.freedesktop.org/patch/535084/

Here is a log with 6.3.1 and the patch applied. I let the computer running for a while once it got to the compromised state to make sure that at least something makes it into the log.

crash_long.log

Are you sure it's applied? That should have cleared up the gmc_v_9_0 irq_put warning (the others are separate).

Nonetheless the warnings are red herrings because they only happen because the GPU reset is attempted and fails. The real problem is the first ring timeout:

kvě 04 11:05:59 Sad-Silke kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=2303, emitted seq=2304
kvě 04 11:05:59 Sad-Silke kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 969 thread kwin_wayla:cs0 pid 1020

Which amdgpu tries to recover from by doing a GPU reset, but fails. Can you see if you can still reproduce this with amd_iommu=off?

My apologies. I got distracted and didn't modify the building script correctly. This time I made sure that the patch got applied. FTR, the relevant part of the code now looks like this:


static int gmc_v9_0_hw_fini(void *handle)
{
	struct amdgpu_device *adev = (struct amdgpu_device *)handle;

	gmc_v9_0_gart_disable(adev);

	if (amdgpu_sriov_vf(adev)) {
		/* full access mode, so don't touch any GMC register */
		DRM_DEBUG("For SRIOV client, shouldn't do anything.\n");
		return 0;
	}

	/*
	 * Pair the operations did in gmc_v9_0_hw_init and thus maintain
	 * a correct cached state for GMC. Otherwise, the "gate" again
	 * operation on S3 resuming will fail due to wrong cached state.
	 */
	if (adev->mmhub.funcs->update_power_gating)
		adev->mmhub.funcs->update_power_gating(adev, false);

	amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);

	return 0;
}

I assume this is correct...?

I ran the test again with the patch applied and amd_iommu=off in the kernel command line. The GPU still gets stuck in an unrecoverable state and I seem to be getting the same warnings as before.

[    0.003333] kernel: Kernel command line: cryptdevice=UUID=ec0ca2d0-9e69-4dcb-a0b8-ec7a4021b7d0:cryptroot root=/dev/mapper/cryptroot rw initrd=\amd-ucode.img initrd=\initramfs-linux.img amd_iommu=off

kvě 05 11:42:28 Sad-Silke systemd[965]: Started kitty.
kvě 05 11:42:43 Sad-Silke kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=26896, emitted seq=26898
kvě 05 11:42:43 Sad-Silke kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 1028 thread kwin_wayla:cs0 pid 1079
kvě 05 11:42:43 Sad-Silke kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
kvě 05 11:42:43 Sad-Silke kernel: ------------[ cut here ]------------
kvě 05 11:42:43 Sad-Silke kernel: WARNING: CPU: 4 PID: 3083 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:599 amdgpu_irq_put+0x46/0x70 [amdgpu]
kvě 05 11:42:43 Sad-Silke kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device ccm algif_aead des_generic libdes ecb md4 cmac algif_hash algif_skcipher af_alg bnep snd_acp3x_rn snd_acp3x_pdm_dma snd_soc_dmic snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci iwlmvm snd_sof_xtensa_dsp amdgpu snd_sof snd_sof_utils mac80211 snd_ctl_led sn>
kvě 05 11:42:43 Sad-Silke kernel:  platform_profile cfg80211 videodev mdio_devres ttm typec_ucsi snd_timer snd_soc_acpi irqbypass videobuf2_common sp5100_tco ipmi_devintf drm_display_helper typec ecdh_generic video rapl psmouse mc crc16 k10temp snd_pci_acp3x i2c_piix4 cec snd ipmi_msghandler rfkill libphy roles soundcore wmi mousedev joydev i2c_scmi acpi_cpufreq serial_multi_>
kvě 05 11:42:43 Sad-Silke kernel: CPU: 4 PID: 3083 Comm: kworker/u32:15 Not tainted 6.3.1-arch1-1 #1 5518c76b0c02ba75077b5b1e47d164164ea6f0b2
kvě 05 11:42:43 Sad-Silke kernel: Hardware name: LENOVO 20UDS02D00/20UDS02D00, BIOS R1BET74W(1.43 ) 03/01/2023
kvě 05 11:42:43 Sad-Silke kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
kvě 05 11:42:43 Sad-Silke kernel: RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
kvě 05 11:42:43 Sad-Silke kernel: Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 e9 43 10 02 f0 e9 5a fd ff ff <0f> 0b b8 ea ff ff ff e9 32 10 02 f0 b8 ea ff ff ff e9 28 10 02 f0
kvě 05 11:42:43 Sad-Silke kernel: RSP: 0018:ffffbb3c88773c98 EFLAGS: 00010246
kvě 05 11:42:43 Sad-Silke kernel: RAX: ffff9eeb6481eee0 RBX: 0000000000000001 RCX: 0000000000000000
kvě 05 11:42:43 Sad-Silke kernel: RDX: 0000000000000000 RSI: ffff9eeb83e903b0 RDI: ffff9eeb83e80000
kvě 05 11:42:43 Sad-Silke kernel: RBP: ffff9eeb83e80000 R08: 000000000003ac80 R09: 0000000000000006
kvě 05 11:42:43 Sad-Silke kernel: R10: ffff9ef24f33bd80 R11: 0000000000000000 R12: ffff9eeb83e903b0
kvě 05 11:42:43 Sad-Silke kernel: R13: ffff9eeb83e989a0 R14: ffff9eec5da91e00 R15: 0000000000000000
kvě 05 11:42:43 Sad-Silke kernel: FS:  0000000000000000(0000) GS:ffff9ef22f900000(0000) knlGS:0000000000000000
kvě 05 11:42:43 Sad-Silke kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kvě 05 11:42:43 Sad-Silke kernel: CR2: 00007f7bf4002008 CR3: 00000004b9a20000 CR4: 0000000000350ee0
kvě 05 11:42:43 Sad-Silke kernel: Call Trace:
kvě 05 11:42:43 Sad-Silke kernel:  <TASK>
kvě 05 11:42:43 Sad-Silke kernel:  sdma_v4_0_hw_fini+0x3c/0xa0 [amdgpu 6adf2ae3ca1229f157d814b809abc4af30950e21]
kvě 05 11:42:43 Sad-Silke kernel:  amdgpu_device_ip_suspend_phase2+0x107/0x1a0 [amdgpu 6adf2ae3ca1229f157d814b809abc4af30950e21]
kvě 05 11:42:43 Sad-Silke kernel:  ? amdgpu_device_ip_suspend_phase1+0x71/0xe0 [amdgpu 6adf2ae3ca1229f157d814b809abc4af30950e21]
kvě 05 11:42:43 Sad-Silke kernel:  amdgpu_device_ip_suspend+0x36/0x70 [amdgpu 6adf2ae3ca1229f157d814b809abc4af30950e21]
kvě 05 11:42:43 Sad-Silke kernel:  amdgpu_device_pre_asic_reset+0xd3/0x2b0 [amdgpu 6adf2ae3ca1229f157d814b809abc4af30950e21]
kvě 05 11:42:43 Sad-Silke kernel:  amdgpu_device_gpu_recover+0x4c7/0xd60 [amdgpu 6adf2ae3ca1229f157d814b809abc4af30950e21]
kvě 05 11:42:43 Sad-Silke kernel:  amdgpu_job_timedout+0x18d/0x240 [amdgpu 6adf2ae3ca1229f157d814b809abc4af30950e21]
kvě 05 11:42:43 Sad-Silke kernel:  ? lock_timer_base+0x61/0x80
kvě 05 11:42:43 Sad-Silke kernel:  drm_sched_job_timedout+0x7a/0x110 [gpu_sched 08d5485d7bd381678bad83fecab17b2b94c10c1b]
kvě 05 11:42:43 Sad-Silke kernel:  process_one_work+0x1c7/0x3d0
kvě 05 11:42:43 Sad-Silke kernel:  worker_thread+0x51/0x390
kvě 05 11:42:43 Sad-Silke kernel:  ? __pfx_worker_thread+0x10/0x10
kvě 05 11:42:43 Sad-Silke kernel:  kthread+0xde/0x110
kvě 05 11:42:43 Sad-Silke kernel:  ? __pfx_kthread+0x10/0x10
kvě 05 11:42:43 Sad-Silke kernel:  ret_from_fork+0x2c/0x50
kvě 05 11:42:43 Sad-Silke kernel:  </TASK>
kvě 05 11:42:43 Sad-Silke kernel: ---[ end trace 0000000000000000 ]---

Furthermore, it seems that reducing the fanciness of MPVs postprocessing filters greatly reduces the likelihood of this crash to happen but it doesn't completely eliminate it.

I assume this is correct...?

Yeah that looks right.

Furthermore, it seems that reducing the fanciness of MPVs postprocessing filters greatly reduces the likelihood of this crash to happen but it doesn't completely eliminate it.

The other SDMA warning is known right now and mentioned in a few issues, there isn't a patch for it yet but as I mentioned before it's red herring.

As you mentioned this seems to be a new issue with 6.2, is it possible that you also happened to upgrade GPU firmware around the time it showed up? If so; could you revert to older GPU firmware to see if it improves?

Is the GPU firmware a part of what Arch Linux ships as the linux-firmware (https://archlinux.org/packages/core/any/linux-firmware/) package? The last update was on 23-04-04, apparently. I can roll back to 23-03-10 and see how that goes.

Mind you, I don't have the same GPU as OP, but I get a very similar crash and for me, this has also started occuring recently (Guessing around the release of 6.2 on fedora)

I manually built an older version of the FW package (20230310). This time I had to play the test video twice to get the GPU to crash. I'm not sure if that's significant because there is no specific point in the video that would trigger the crash so maybe it was just a fluke. Assuming that I'm doing the right thing should I go back to an even older FW?

It might not be caused by the firmware, but as both of your distros track latest kernel and latest firmware it is worth trying to identify which one caused it.

If you're sure it's kernel and not GPU firmware can either of you guys possibly bisect back to a point that it was stable to identify the first problematic commit?

I'll see what I can do over the weekend. TBH I'm not sure if 6.2 is really the first problematic kernel because playing super high quality videos is not something I'd regularly do. If it helps, disabling VAAPI decoding in MPV had no effect, dialing down the postprocessing quality did. I'll get back to you if I manage to come up with something bisecting. Thanks!

@madcatx1 I'll let you try it once as you seem to be able to reproduce the problem easily. In my case, I gotta let the system run for a while and it happens randomly (As far as I can see).

I'll try to play with it too this weekend though, to see if I can reproduce it more frequently/accurately.

One more piece of information before I attempt to bisect this. I've reverted my system to a "standard" configuration - meaning the latest available firmware package and no additional kernel options - and tried the following:

To make sure that this is indeed a regression that can be tracked down within a reasonable span of kernel versions, I switched to kernel 6.1 which Arch Linux conveniently packages as linux-lts. The exact version was 6.1.27. My test video played fine 3 times in a row. Then I booted back into 6.3.1, expecting a problem. Naturally, there was no crash and everything seemed to work okay. Then I realized that I probably had Firefox running every time I experienced the crash. The moment I launched Firefox while the video was still playing, the GPU froze. I've tried quite hard to reproduce this under 6.1 but 6.1 seems okay. The gpu-hq profile MPV setting was probably a false clue. This is all happening under Wayland if that makes any difference.

We have a winner. I used Greg's repository as the source of stable kernels (https://github.com/gregkh/linux/) and went over all commits between 6.1.9 and 6.2.0. Since that would be almost 17k commits, I narrowed down the scope only to commits that mentioned "drm/amd" in the commit message. I eventually bisected the problem to this particular change:

3f4c175d62d89819121cbbd5a0a30f4b80862025 drm/amdgpu: MCBP based on DRM scheduler (v9)

Then I speculated that the preemption logic introduced by this commit could issue instructions to the GPU in some invalid order. I applied this patch:

 drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
index 62079f0e3ee8..4063fac6e85f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
@@ -28,7 +28,7 @@
 #include "amdgpu.h"
 
 #define AMDGPU_MUX_RESUBMIT_JIFFIES_TIMEOUT (HZ / 2)
-#define AMDGPU_MAX_LAST_UNSIGNALED_THRESHOLD_US 10000
+#define AMDGPU_MAX_LAST_UNSIGNALED_THRESHOLD_US 200000
 
 static const struct ring_info {
        unsigned int hw_pio;

on top of 6.3.1 and tried to reproduce the problem. As far as I can tell, none of my test cases can crash the GPU any longer with this haxx in place.

FTR, I attached my "log" of the bisection progress.all_drm_amd_commits.txt

@farchord If my understanding is correct, than this issue is specific to GFX9 class of chips but your 6600M is a GFX10. Therefore, I think that you're dealing with a different problem.

The preemption would discard low priority ibs(in this case the hw acced video) when high priority ib comes, then resubmit those skipped ibs again. It looks like the preemption/resubmission breaks some dependency of the video ib lists.

@MadCatX The workaround disables the preemption. We hope to find out the guilty ib packages. I am trying to reproduce it on my side but I cannot download the video file on mega.nz Is there any other video could be used to reproduce the hang?

Would adding an option to Mesa to not create any high priority context would effectively disable preemption?

If yes I could implement it to be able to easily disable mcbp.

It's nice to have the option as we could disable mcbp easily when the hang happens.

@JiadongZhu Thanks for looking into this. I figured that setting the threshold high enough would effectively disable the preemption.

I think that any high bitrate video would do because I can reproduce the problem with multiple video files. I'll try to get another shareable video for you, in the meantime, here is my setup that can reproduce the problem with almost 100 % reliability.

mpv player (0.35.1) set as follows: vo=gpu-next profile=gpu-hq gpu-context=wayland hwdec=auto

KDE 5.27.4, KWin_wayland

2 screens, internal laptop screen and an external screen, both FHD. (I think I could repro this even with just the laptop screen.)

To repro, I

Start the video
Launch Firefox and flick through the tabs a bit
If the GPU doesn’t freeze, I close Firefox and launch it again.

Let me know if you need any more help.

I am trying to reproduce it on my side but I cannot download the video file on mega.nz Is there any other video could be used to reproduce the hang? Let me know if you need any more help.

Can you post the file somewhere else perhaps that can be accessible? AMD I/T blocks mega.nz

https://drive.google.com/file/d/1w8e24mX3jGstmgf8c5Id7hSc6pzk96vy/view?usp=sharing

I set it to be viewable by anyone, so you shouldn't need to login.

Thanks!

No problem. I can't reliably repro this on my system as it's not using the same GPU as OP. But I'm kinda hoping that you guys fixing this fixes it for me too crosses fingers

mentioned in issue #2544

This issue is firmware related. After the preemption happens, kmd would reset preempt register with a write_data command, cp waits on mmCP_VMID_PREEMPT all zero to finish the preemption. Sometimes the write_data cmd is not working. The hang comes out if preemption happens more frequently.

root@amd-Majolica-RN:~# umr -r renoir.gfx930.mmCP_VMID_PREEMPT gfx930.mmCP_VMID_PREEMPT => 0x0000ffff

The mec version on ubuntu 23.04 is 1d4 root@amd-Majolica-RN:/home/strix# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info |grep MEC MEC feature version: 53, firmware version: 0x000001d4

you might have a try using an old version of mec.bin (version 1d0 is working on my side)

I had the latest linux-firmware package installed which contais Renoir MEC firmware 1d4. I went back to an unpatched kernel and 1d0 MEC firmware (appears to be from Feb 2022) and the issue seems to be gone. Hopefully you can sort this out with the firmware guys.

Thanks a lot for figuring this out so quickly!

the firmware issue is a timing issue which could be solved by this patch, https://patchwork.freedesktop.org/series/118260/ You might have a try with the patch built in together with the latest firmware.

Thanks, I'll give it a go!

The patch needed a trivial backport for 6.3.7.

 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index c54d05bdc2d8..e90d04b78a7d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -5364,10 +5364,6 @@ static int gfx_v9_0_ring_preempt_ib(struct amdgpu_ring *ring)
        amdgpu_ring_alloc(ring, 13);
        gfx_v9_0_ring_emit_fence(ring, ring->trail_fence_gpu_addr,
                                 ring->trail_seq, AMDGPU_FENCE_FLAG_EXEC | AMDGPU_FENCE_FLAG_INT);
-       /*reset the CP_VMID_PREEMPT after trailing fence*/
-       amdgpu_ring_emit_wreg(ring,
-                             SOC15_REG_OFFSET(GC, 0, mmCP_VMID_PREEMPT),
-                             0x0);
 
        /* assert IB preemption, emit the trailing fence */
        kiq->pmf->kiq_unmap_queues(kiq_ring, ring, PREEMPT_QUEUES_NO_UNMAP,
@@ -5389,7 +5385,10 @@ static int gfx_v9_0_ring_preempt_ib(struct amdgpu_ring *ring)
                r = -EINVAL;
                DRM_WARN("ring %d timeout to preempt ib\n", ring->idx);
        }
-
+       /*reset the CP_VMID_PREEMPT after trailing fence*/
+       amdgpu_ring_emit_wreg(ring,
+                             SOC15_REG_OFFSET(GC, 0, mmCP_VMID_PREEMPT),
+                             0x0);
        amdgpu_ring_commit(ring);
 
        /* deassert preemption condition */

I ran a few tests as things are looking good so far!

I'd really like to also check this one but I can't get this patch applied:

[root@4de1bbf3d81e linux-6.3.7]# patch -p1 < drm_Reset_CP_VMID_PREEMPT_ad.patch 
patching file drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
Hunk #1 FAILED at 5364.
Hunk #2 FAILED at 5389.
2 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c.rej
[root@4de1bbf3d81e linux-6.3.7]# cat drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c.rej
--- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -5364,10 +5364,6 @@ static int gfx_v9_0_ring_preempt_ib(struct amdgpu_ring *ring)
        amdgpu_ring_alloc(ring, 13);
        gfx_v9_0_ring_emit_fence(ring, ring->trail_fence_gpu_addr,
                                 ring->trail_seq, AMDGPU_FENCE_FLAG_EXEC | AMDGPU_FENCE_FLAG_INT);
-       /*reset the CP_VMID_PREEMPT after trailing fence*/
-       amdgpu_ring_emit_wreg(ring,
-                             SOC15_REG_OFFSET(GC, 0, mmCP_VMID_PREEMPT),
-                             0x0);
 
        /* assert IB preemption, emit the trailing fence */
        kiq->pmf->kiq_unmap_queues(kiq_ring, ring, PREEMPT_QUEUES_NO_UNMAP,
@@ -5389,7 +5385,10 @@ static int gfx_v9_0_ring_preempt_ib(struct amdgpu_ring *ring)
                r = -EINVAL;
                DRM_WARN("ring %d timeout to preempt ib\n", ring->idx);
        }
-
+       /*reset the CP_VMID_PREEMPT after trailing fence*/
+       amdgpu_ring_emit_wreg(ring,
+                             SOC15_REG_OFFSET(GC, 0, mmCP_VMID_PREEMPT),                                                                                                                    12-Jun-23
+                             0x0);
        amdgpu_ring_commit(ring);
 
        /* deassert preemption condition */

any suggestions @MadCatX?

Are you sure you're patching vanilla 6.3.7 with no other patches applied to gfx_v9_0.c?

Same here, patching fails. I'm using Arch's 6.3.7 sources.

applying patch gpu.patch... patching file drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c Hunk #1 FAILED at 5364. Hunk #2 FAILED at 5389. 2 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c.rej

FWIW I'm using this file in my PKGBUILD and it works for me... 0001-Fix-Reset-CP_VMID_PREEMPT-backport-to-6.3.7.patch

Thanks, that patch worked.

+1 that patch can be applied.

[root@4de1bbf3d81e linux-6.3.7]# patch -p1 < 0001-Fix-Reset-CP_VMID_PREEMPT-backport-to-6.3.7.patch 
patching file drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c

Kernel is building

34 minutes and no issues so far with that patch. But currently not able to test with my usual setup. Will test longer and report back

Okay seems to work fine and successfully avoids crashing @MadCatX @agd5f @JiadongZhu

Getting the expected log output in dmesg when stressing the APU/'GPU'.

[  +3.437095] [drm] ring 0 timeout to preempt ib
[  +1.724453] [drm] ring 0 timeout to preempt ib
[  +0.667181] [drm] ring 0 timeout to preempt ib
[  +6.961722] [drm] ring 0 timeout to preempt ib
[ +11.361640] [drm] ring 0 timeout to preempt ib
[  +0.167330] [drm] ring 0 timeout to preempt ib
[  +5.683224] [drm] ring 0 timeout to preempt ib
[Jun12 14:25] [drm] ring 0 timeout to preempt ib

Currently testing with the 'Cube Diorama' from Blenders test files. Using Blender 3.5.1.

The only thing I noticed is pretty hard cursor lagging but I am pretty certain this is due to the integrated APU being completely overloaded

Tested on my usual setup wit 2x4k@60Hz via USB-C docking still works fine and does not crash the system

Only issue is dmesg log is spammed with these drm messages and every time a new line is written the mouse lags / stutters for a few seconds (likely not caused by the logging but rather non functioning preemption).

So the underlying issue must still be fixed via amd-gpu-firmware update by AMD.

"Was nice while it lasted".

Just shy of 72h with zero issues, Parsec crashed my laptop with a similar error:

[ +10.260057] gmc_v9_0_process_interrupt: 553 callbacks suppressed
[  +0.000008] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:6 pasid:32770, for process parsecd pid 4859 thread parsecd pid 4890)
[  +0.000010] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000800104008000 from IH client 0x1b (UTCL2)
[  +0.000007] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00601031
[  +0.000003] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[  +0.000003] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[  +0.000002] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  +0.000002] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[  +0.000002] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  +0.000002] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[...]
[  +0.000002] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:6 pasid:32770, for process parsecd pid 4859 thread parsecd pid 4890)
[  +0.000004] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000080010400c000 from IH client 0x1b (UTCL2)
[  +0.000004] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  +0.000003] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  +0.000002] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  +0.000002] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  +0.000002] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  +0.000001] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  +0.000002] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  +0.002352] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, but soft recovered
[  +0.000958] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, but soft recovered

Though the issue here might be an edge case (changing the resolution of the Parsec host while being connected freezes the whole system for a few seconds until Parsec crashes, had to change the resolution multiple times, after three times both my screens turned (and stayed) black) and I found myself with the log output above.

Non professional (sorry):

Where do we have to escalate this issue so the firmware bugs will finally be addressed? This has been going on for many months and makes devices so unstable that you can't actually use them productively. I thought the patch was finally a good workaround but it's only thin band aid that falls apart after a few days.

I can't imagine the frustration of most end users that don't even know about this bug and have crashing machines all day long

The L2 protection fault errors are likely another issue that is not related to IB preemption. There are numerous issues similar to this one already reported here.

You're right @MadCatX #2627 for example. Hmm I judged to early due to *ERROR* ring gfx_low timeout. I will try to follow the more recent issue then, thanks for pointing that out.

At lease the fix (patch above) is stable regarding all other cases which crashed the system before, which is a huge improvement.

mentioned in issue #2447 (closed)

mentioned in merge request mesa/drm!295 (merged)

mentioned in issue #2574

mentioned in issue #2604

added hang/freeze label

My laptop, running the AMD APU 6800H, had a crash while the screen was in power savings. I'm running Gentoo with kernel: Linux lenny 6.3.4-gentoo-r1 #1 (closed) SMP Mon May 29 07:59:08 PDT 2023 x86_64 AMD Ryzen 7 6800H with Radeon Graphics AuthenticAMD GNU/Linux. There is a built in Nvidia RTX 3060 which is disabled and used on occasion by QEMU/KVM, which is owned by the kernel driver: vfio-pci.

After the crash, X crashes and restarts and everything is okay. I have had something similar while playing windows games on proton on my media PC with my RX6650XT, but I haven't copied syslog, because I thought it was a wine issue until it happened on my laptop.

Also, audio crashes permanently, but I'm not sure that's related.

I'm not sure this is this post belongs in this thread. If it doesn't, let me know and I'll open a new ticket.

Also, please let me know how I might improve the data for you guys.

Here is the log from /var/log/messages:

Jun  8 07:09:40 lenny kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=444930, emitted seq=444932
Jun  8 07:09:40 lenny kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jun  8 07:09:40 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GPU reset begin!
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: MODE2 reset
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GPU reset succeeded, trying to resume
Jun  8 07:09:41 lenny kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F41FC00000).
Jun  8 07:09:41 lenny kernel: [drm] PSP is resuming...
Jun  8 07:09:41 lenny kernel: [drm] reserve 0xa00000 from 0xf41e000000 for PSP TMR
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: SMU is resuming...
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: SMU is resumed successfully!
Jun  8 07:09:41 lenny kernel: [drm] DMUB hardware initialized: version=0x0400002E
Jun  8 07:09:41 lenny kernel: [drm] kiq ring mec 2 pipe 1 q 0
Jun  8 07:09:41 lenny kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Jun  8 07:09:41 lenny kernel: [drm] JPEG decode initialized successfully.
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: recover vram bo from shadow start
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: recover vram bo from shadow done
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GPU reset(1) succeeded!
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103002000 from client 0x1b (UTCL2)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00640051
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103003000 from client 0x1b (UTCL2)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00640051
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103033000 from client 0x1b (UTCL2)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00640051
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103034000 from client 0x1b (UTCL2)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00640051
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103031000 from client 0x1b (UTCL2)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00640051
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103032000 from client 0x1b (UTCL2)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103065000 from client 0x1b (UTCL2)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00640051
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103001000 from client 0x1b (UTCL2)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00640051
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103000000 from client 0x1b (UTCL2)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00640051
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103062000 from client 0x1b (UTCL2)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00640051
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:41 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x1
Jun  8 07:09:53 lenny kernel: gmc_v10_0_process_interrupt: 1930 callbacks suppressed
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103009000 from client 0x1b (UTCL2)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00641051
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: TCP (0x8)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x1
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103008000 from client 0x1b (UTCL2)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103009000 from client 0x1b (UTCL2)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:6 pasid:32769, for process Xorg pid 2131 thread Xorg:cs0 pid 2488)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu:   in page starting at address 0x0000800103008000 from client 0x1b (UTCL2)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Jun  8 07:09:53 lenny kernel: amdgpu 0000:35:00.0: amdgpu: \x09 RW: 0x0
Jun  8 07:09:53 lenny kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
Jun  8 07:10:03 lenny kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=23929877, emitted seq=23929880
Jun  8 07:10:03 lenny kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2131 thread Xorg:cs0 pid 2488
Jun  8 07:10:03 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GPU reset begin!
Jun  8 07:10:03 lenny kernel: amdgpu 0000:35:00.0: amdgpu: MODE2 reset
Jun  8 07:10:03 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GPU reset succeeded, trying to resume
Jun  8 07:10:03 lenny kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F41FC00000).
Jun  8 07:10:03 lenny kernel: [drm] PSP is resuming...
Jun  8 07:10:03 lenny kernel: [drm] reserve 0xa00000 from 0xf41e000000 for PSP TMR
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: SMU is resuming...
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: SMU is resumed successfully!
Jun  8 07:10:04 lenny kernel: [drm] DMUB hardware initialized: version=0x0400002E
Jun  8 07:10:04 lenny kernel: [drm] kiq ring mec 2 pipe 1 q 0
Jun  8 07:10:04 lenny kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Jun  8 07:10:04 lenny kernel: [drm] JPEG decode initialized successfully.
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: recover vram bo from shadow start
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: recover vram bo from shadow done
Jun  8 07:10:04 lenny kernel: amdgpu 0000:35:00.0: amdgpu: GPU reset(4) succeeded!
Jun  8 07:10:04 lenny kernel: [drm] Skip scheduling IBs!
Jun  8 07:10:04 lenny kernel: [drm] Skip scheduling IBs!
Jun  8 07:10:04 lenny kernel: [drm] Skip scheduling IBs!
Jun  8 07:10:04 lenny kernel: [drm] Skip scheduling IBs!
Jun  8 07:10:04 lenny kernel: [drm] Skip scheduling IBs!
Jun  8 07:10:04 lenny kernel: [drm] Skip scheduling IBs!
Jun  8 07:10:04 lenny kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

IIRC 6800H comes with Radeon 680M which is a different class of chip. This issue is specific to gfx9.

Thanks. I'll open another ticket.

Unrecoverable GPU crash when playing a 4K H.265 HDR10+ video

Designs

Child items 0

Activity

Admin message

Admin message

Unrecoverable GPU crash when playing a 4K H.265 HDR10+ video

Activity