100% reproducible kernel v6.6 amdgpu driver crash with amdgpu.mcbp=1

mentioned in issue #2830 (closed)

Additional info: crash never occurs with kernel 6.4.14 but occurres with kernels 6.5.x and 6.6.0

So I'm seeing these UBSAN errors, seems like false alarm:

ubsan.txt

@JiadongZhu can you take a look? 6.6 picked up that other fix (mentioned in #2830 (closed)).

Hi @Sfinx , could you have a try to disable gpu_recovery and collect fence info when the crash happens.

sudo vim /etc/modprobe.d/amdgpu.conf,
add the param 'lockup_timeout=-1,-1'.
sudo update-initramfs -u -k $(uname -r).
sudo reboot.
when crash happens, cat /sys/kernel/debug/dri/0/amdgpu_fence_info

btw, could you show me the mesa version on your machine with "glxinfo|grep version"? Thanks.

Well, now it totally freeze the Xorg - can't even switch to terminal, SAK do not work. Files attached

amdgpu_fence_info.txt

Have to add that I'm having "amdgpu.audio=0 amdgpu.runpm=0 amdgpu.aspm=0 amdgpu.bapm=0 pcie_aspm=off" in kernel cmd and "options amdgpu si_support=1 cik_support=1 lockup_timeout=-1,-1" now in /etc/modprobe.d/amdgpu.conf.

you can use ssh to login the target machine. When the screen is frozen, the console of the client still works.

From amdgpu_fence_info.txt it looks no mcbp triggered when the hang happen. Could you show me some steps how to reproduce the crash?

glxinfo|grep version
server glx version string: 1.4
client glx version string: 1.4
GLX version: 1.4
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.0.4-0ubuntu1~22.04.1
OpenGL core profile shading language version string: 4.60
OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.0.4-0ubuntu1~22.04.1
OpenGL shading language version string: 4.60
OpenGL ES profile version string: OpenGL ES 3.2 Mesa 23.0.4-0ubuntu1~22.04.1
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
    GL_EXT_shader_implicit_conversions, GL_EXT_shader_integer_mix,

This board continues to say that bug-report detected as spam...

dmesg-freeze.txt

ssh do not works too - it just hangs despite the box is still pinging. the fast 100% reproduce steps are simple:

sudo apt install lact
lact

No need to launch its daemon or set some profile - just launch lact from userspace. Sometime it is needed to run it 2-3 times for crash but usually crash happens just from the first start

apt install cannot find lact. I installed lact-0.4.5-0.amd64.ubuntu-2204.deb from github. when launched, there is a dialog shows "Could not connect to demon, running in embedded mode". The other information look good. I am using Renoir gfx9. I will get a vega10 card to have a try.

Box is crashed without above too but it takes several hours to trigger.

Yep, you can download any version from github, v0.4.5 triggers crash too just during this first dialog - no need to press anything later.

Have you tried a newer mesa version (e.g. 23.2.1) this could also be a mesa bug.

I do not thing this is Mesa related, because:

the crash is occurred not in Mesa layer but in amdgpu driver one (see attached dmesg)
I never use any Mesa or 3d app at this box, my app set in 99% is just gmail (browser) and terminal
the bug is definitely disappeared with amdgpu.mcbp=0 for kernels > 6.4 and never appears with kernel 6.4.14 with default kernel cmdline. This means that your logic is broken here - amdgpu.mcbp is kernel amdgpu driver parameter not Mesa one. The same is related to kernel version. If the bug would in Mesa layer then you must relay to Mesa parameters not the kernel driver one. This is official Ubuntu 22.04 and there is no "newer mesa version" available except recent one that is appropriate for this distribution and I think that was wisely selected by its maintainers

Anyway I do not use nor Mesa nor 3d apps at this box. The reproduce steps are described above and they are pretty clear - crash occurring just after reading some driver settings from userspace without any complex 3d app running. At this box (Acer laptop with brain dead BIOS) the crash is occurring every time when starting lact. I can supply any additional info if needed.

Xorg and wayland use mesa libs nowadays.

Yep, but the crash occurring not in Mesa layer but in amdgpu driver one. Second: the crash in never occurred in kernel 6.4.14 with the same Mesa setup. The simple logic tells me that bug is still exists at kernel driver layer.

Actually Canonical has rolled out newer mesa version in the life of Ubuntu 22.04. So it's very possible you tested with a different mesa version back when you tested 6.4.y.

apt update tells me different:

root@Shiva:~# date
Tue Nov  7 16:09:17 EET 2023
root@Shiva:~# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@Shiva:~# apt update
Hit:1 https://brave-browser-apt-release.s3.brave.com stable InRelease
Hit:2 http://us.archive.ubuntu.com/ubuntu jammy InRelease                                                                                                                                                                                        
Hit:4 http://us.archive.ubuntu.com/ubuntu jammy-updates InRelease                                                                                                                                                                                
Hit:5 http://us.archive.ubuntu.com/ubuntu jammy-backports InRelease                                                                                                                                                                              
Hit:6 http://us.archive.ubuntu.com/ubuntu jammy-proposed InRelease                                                                                                                                                                               
Hit:7 https://deb.nodesource.com/node_21.x nodistro InRelease                                                                                                                                                                                    
Hit:8 http://dl.google.com/linux/chrome/deb stable InRelease                                                                                                                                                                                     
Hit:9 https://download.docker.com/linux/ubuntu jammy InRelease                                                                                                                                                                                   
Hit:10 http://ports.ubuntu.com/ubuntu-ports jammy InRelease                                                                                                                                                                                      
Get:11 https://dl.yarnpkg.com/debian stable InRelease [17,1 kB]                                                                                                                                                                                  
Hit:12 http://ports.ubuntu.com/ubuntu-ports jammy-updates InRelease                                                                                                                                                                              
Hit:14 https://ppa.launchpadcontent.net/aglasgall/pipewire-extra-bt-codecs/ubuntu jammy InRelease                                                                                                                                                
Hit:15 http://ports.ubuntu.com/ubuntu-ports jammy-backports InRelease                                                                                                                                                                            
Hit:16 http://security.ubuntu.com/ubuntu jammy-security InRelease                                                                                                                                                  
Hit:18 http://ports.ubuntu.com/ubuntu-ports jammy-security InRelease                                                                                                  
Hit:21 http://ports.ubuntu.com/ubuntu-ports jammy-proposed InRelease                              
Get:25 https://repo.charm.sh/apt * InRelease                                                           
Fetched 23,7 kB in 15s (1.564 B/s)                                                                                                                                                                                                               
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.

https://launchpad.net/ubuntu/+source/mesa is the right place to look. You can see mesa in the Jammy release vs Jammy updates pocket.

Anyways; the reason this is relevant is because the mesa libraries are used for queue up jobs in the GPU. If the GPU driver enables preemption (which is new to recent kernels) then the mesa libraries can exacerbate the issue. So they go hand in hand.

If this is indeed triggered by the mesa version in Ubuntu, one of the actions to solve it may be to backport patches from newer mesa or upgrade to a newer mesa in Ubuntu.

First: Not the Mesa libs is the reason of crash as all is definitely works with kernels < 6.5. The kernel is the trigger

Second: The error is not happening at Mesa level. It happens at kernel level

Third: The kernel driver should not crash if the Mesa driver use some "wrong" access. You must handle this. Moreover this "wrong" access is not exactly described and just hided with some unknown for me reason

GPU state is too complex for the kernel driver to validate every possible valid or invalid combination of state.

First: Not the Mesa libs is the reason of crash as all is definitely works with kernels < 6.5. The kernel is the trigger

Can you bisect the kernel?

BTW: Is the Mesa GPU driver binded software ? Is there exists Mesa for AMD and it must be compatible with some kernel driver ?

I do not have will to bisect the kernel. The 6.6 do not work without mdgpu.mcbp=0 - so bisect this

With command "sudo lact", I could reproduce the hang on vega10(gfx900) with default installed Mesa 23.0.4-0ubuntu1~22.04.1.

When I installed our internal driver amdgpu-build=1681769 (mesa version 23.3.0-devel), the hang disappears. The hang is related with some preamble packages added for mcbp.

I have not found which patch fixed it. Maybe @mareko could give us some clues. Thanks.

@spasswolf might have bisected it (if it's the same issue). I put together a PPA on 23.0.4 + that patch to test and see.

First: Not the Mesa libs is the reason of crash as all is definitely works with kernels < 6.5. The kernel is the trigger

Like I said it's a newer feature in the kernel, but mesa and the kernel work together. The newer feature in the kernel can expose a mesa bug which can cause a GPU fault and manifest as a kernel hang. It's unfortunate but bugs do happen.

I'm pretty happy with amdgpu.mcbp=0 solution. Just thought you will be interested in real reason. You can close it - seems like old h/w must die according to AMD decision.

BTW: lact is not linked with Mesa:

ldd `which lact`|grep -i mesa

ldd `which lact`|grep -i GL
libglib-2.0.so.0 => /lib/x86_64-linux-gnu/libglib-2.0.so.0 (0x00007f0cbcfc5000)
libbwayland-egl.so.1 => /lib/x86_64-linux-gnu/libwayland-egl.so.1 (0x00007f0cbdbf2000)

My box is running Xorg and not Wayland

I installed mesa-23.0.4 (compiled from mesa git) on my debian sid and I get a hang at startup with kernel 6.5.9 if I use amdgpu.vm_debug=1 and amdgpu.mcbp=1, mesa-23.2.1 works fine on kernel 6.5.9.

If I use linux-next-20231108 mesa-23.0.4 works even if I use both amdgpu.vm_debug=1 and amdgpu.mcbp=1.

And this patch fixes mesa-23.0.4 when I run linux-6.5.9 with vm_debug=1 and mcbp=1:

commit bc07b1a0bf22054c9a683a43e9f7f7632446431f
Author: Yogesh Mohan Marimuthu <yogesh.mohanmarimuthu@amd.com>
Date:   Fri Jan 20 12:29:00 2023 +0530

    radeonsi: remove some shadow reg optimization for bf1 game
    
    This patch removes below shadow reg optimization. This is done for
    Vega64 battlefield 1 crash when shadow regs enabled.
    
      + reset only dirty states with buffers in si_pm4_reset_emitted()
      + various draw states in si_begin_new_gfx_cs()
    
    v2: remove first_cs parameter from si_pm4_reset_emitted() (Marek Olšák)
    
    Signed-off-by: Yogesh Mohan Marimuthu <yogesh.mohanmarimuthu@amd.com>
    Reviewed-by: Marek Olšák <marek.olsak@amd.com>
    Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18301>

diff --git a/src/gallium/drivers/radeonsi/si_gfx_cs.c b/src/gallium/drivers/radeonsi/si_gfx_cs.c
index 0c48240a5f0..0a48aee6cef 100644
--- a/src/gallium/drivers/radeonsi/si_gfx_cs.c
+++ b/src/gallium/drivers/radeonsi/si_gfx_cs.c
@@ -415,10 +415,8 @@ void si_begin_new_gfx_cs(struct si_context *ctx, bool first_cs)
 
    si_add_all_descriptors_to_bo_list(ctx);
 
-   if (first_cs || !ctx->shadowed_regs) {
-      si_shader_pointers_mark_dirty(ctx);
-      ctx->cs_shader_state.initialized = false;
-   }
+   si_shader_pointers_mark_dirty(ctx);
+   ctx->cs_shader_state.initialized = false;
 
    if (!ctx->has_graphics) {
       ctx->initial_gfx_cs_size = ctx->gfx_cs.current.cdw;
@@ -434,7 +432,7 @@ void si_begin_new_gfx_cs(struct si_context *ctx, bool first_cs)
    /* set all valid group as dirty so they get reemited on
     * next draw command
     */
-   si_pm4_reset_emitted(ctx, first_cs);
+   si_pm4_reset_emitted(ctx);
 
    /* The CS initialization should be emitted before everything else. */
    if (ctx->cs_preamble_state) {
@@ -460,7 +458,7 @@ void si_begin_new_gfx_cs(struct si_context *ctx, bool first_cs)
 
    /* CLEAR_STATE disables all colorbuffers, so only enable bound ones. */
    bool has_clear_state = ctx->screen->info.has_clear_state;
-   if (has_clear_state || ctx->shadowed_regs) {
+   if (has_clear_state) {
       ctx->framebuffer.dirty_cbufs =
             u_bit_consecutive(0, ctx->framebuffer.state.nr_cbufs);
       /* CLEAR_STATE disables the zbuffer, so only enable it if it's bound. */
@@ -508,22 +506,6 @@ void si_begin_new_gfx_cs(struct si_context *ctx, bool first_cs)
       si_mark_atom_dirty(ctx, &ctx->atoms.s.scissors);
       si_mark_atom_dirty(ctx, &ctx->atoms.s.viewports);
 
-      /* Invalidate various draw states so that they are emitted before
-       * the first draw call. */
-      si_invalidate_draw_constants(ctx);
-      ctx->last_index_size = -1;
-      ctx->last_primitive_restart_en = -1;
-      ctx->last_restart_index = SI_RESTART_INDEX_UNKNOWN;
-      ctx->last_prim = -1;
-      ctx->last_multi_vgt_param = -1;
-      ctx->last_vs_state = ~0;
-      ctx->last_gs_state = ~0;
-      ctx->last_ls = NULL;
-      ctx->last_tcs = NULL;
-      ctx->last_tes_sh_base = -1;
-      ctx->last_num_tcs_input_cp = -1;
-      ctx->last_ls_hs_config = -1; /* impossible value */
-
       if (has_clear_state) {
          si_set_tracked_regs_to_clear_state(ctx);
       } else {
@@ -536,6 +518,22 @@ void si_begin_new_gfx_cs(struct si_context *ctx, bool first_cs)
       memset(ctx->tracked_regs.spi_ps_input_cntl, 0xff, sizeof(uint32_t) * 32);
    }
 
+   /* Invalidate various draw states so that they are emitted before
+    * the first draw call. */
+   si_invalidate_draw_constants(ctx);
+   ctx->last_index_size = -1;
+   ctx->last_primitive_restart_en = -1;
+   ctx->last_restart_index = SI_RESTART_INDEX_UNKNOWN;
+   ctx->last_prim = -1;
+   ctx->last_multi_vgt_param = -1;
+   ctx->last_vs_state = ~0;
+   ctx->last_gs_state = ~0;
+   ctx->last_ls = NULL;
+   ctx->last_tcs = NULL;
+   ctx->last_tes_sh_base = -1;
+   ctx->last_num_tcs_input_cp = -1;
+   ctx->last_ls_hs_config = -1; /* impossible value */
+
    if (ctx->scratch_buffer) {
       si_context_add_resource_size(ctx, &ctx->scratch_buffer->b.b);
       si_mark_atom_dirty(ctx, &ctx->atoms.s.scratch_state);
diff --git a/src/gallium/drivers/radeonsi/si_pm4.c b/src/gallium/drivers/radeonsi/si_pm4.c
index f8454cd302c..280125b6511 100644
--- a/src/gallium/drivers/radeonsi/si_pm4.c
+++ b/src/gallium/drivers/radeonsi/si_pm4.c
@@ -151,23 +151,8 @@ void si_pm4_emit(struct si_context *sctx, struct si_pm4_state *state)
       state->atom.emit(sctx);
 }
 
-void si_pm4_reset_emitted(struct si_context *sctx, bool first_cs)
+void si_pm4_reset_emitted(struct si_context *sctx)
 {
-   if (!first_cs && sctx->shadowed_regs) {
-      /* Only dirty states that contain buffers, so that they are
-       * added to the buffer list on the next draw call.
-       */
-      for (unsigned i = 0; i < SI_NUM_STATES; i++) {
-         struct si_pm4_state *state = sctx->queued.array[i];
-
-         if (state && state->is_shader) {
-            sctx->emitted.array[i] = NULL;
-            sctx->dirty_states |= 1 << i;
-         }
-      }
-      return;
-   }
-
    memset(&sctx->emitted, 0, sizeof(sctx->emitted));
 
    for (unsigned i = 0; i < SI_NUM_STATES; i++) {
diff --git a/src/gallium/drivers/radeonsi/si_pm4.h b/src/gallium/drivers/radeonsi/si_pm4.h
index 4d1770a96d8..486b627d540 100644
--- a/src/gallium/drivers/radeonsi/si_pm4.h
+++ b/src/gallium/drivers/radeonsi/si_pm4.h
@@ -70,7 +70,7 @@ void si_pm4_clear_state(struct si_pm4_state *state);
 void si_pm4_free_state(struct si_context *sctx, struct si_pm4_state *state, unsigned idx);
 
 void si_pm4_emit(struct si_context *sctx, struct si_pm4_state *state);
-void si_pm4_reset_emitted(struct si_context *sctx, bool first_cs);
+void si_pm4_reset_emitted(struct si_context *sctx);
 
 #ifdef __cplusplus
 }

I actually found it some time ago when I tried to figure out a similar error in mesa-22.3.6 (debian bookworm), but didn't report it because I was a) switching to debian sid b) because the error (at least for me) needed both amdgpu.vm_debug=1 (which I assumed next to nobody would use) and amdgpu.mcbp=1 (which is the default since 6.5)

I'm not realyy sure if my issue is related to the topic of this post. mesa-23.0.4 works (for me) with linux-next-20231108 because the amdgpu.vm_debug parameter no longer exists.

To see if it's the same issue you had, here's a mesa 23.0.4 build with just that patch added. @Sfinx can you please test this PPA? Add it to your system and upgrade to it. If the issue goes away we can start a conversation with Canonical on pulling this patch into Ubuntu.

sudo add-apt-repository ppa:superm1/gitlab2971
sudo apt upgrade
sudo reboot

If it doesn't fix the issue then you can revert back to the version in Ubuntu using the ppa-purge package.

sudo apt install ppa-purge
sudo ppa-purge ppa:superm1/gitlab2971

added hang/freeze label

bc07b1a0bf22054c9a683a43e9f7f7632446431f is most likely the fix. I would say that the kernel shouldn't have enabled mcbp by default.

hi, there is a way to set up "amdgpu.mcbp=0" in grup (fedora)?

What Fedora version do you have? Can you please confirm your mesa version in Fedora? It would be better to pull in the correct fix if necessary.

If you want to disable mcbp though for a check you can use grubby (look it up).

mesa driver 23.2.1-2

Fedora 39 you think is better for me to wait until fedora update de package of mesa with the fix(bc07b1a0bf22054c9a683a43e9f7f7632446431f)?

The patch (bc07b1a0...) is already in mesa-23.2.1, so you probably have a different problem.

mmmmmmmm i see.i will update first just to check and if the problem remain i will report my problem in a new bug report.

Note: if the problem is fixed why this issue is still open?

Note: if the problem is fixed why this issue is still open?

The reporter hasn't confirmed that the updated mesa version fixed it or not for them.

so there is a possibility that my problem is the same as the one in this report.

little update on my problem:

It seems to be a gnome problem (more information here: https://gitlab.gnome.org/GNOME/gnome-shell/-/issues/7034)

but I'm not 100% sure since there are lines in the journal that indicate problems with amdgpu (these appear after the gnome error. So could they be a symptom of the gnome error and not the cause?):

`nov 15 23:42:38 fedora-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=6217419, emitted seq=6217421

nov 15 23:42:38 fedora-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Infused-Win64-S pid 191376 thread dxvk-submit pid 1>

nov 15 23:42:38 fedora-pc kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!

nov 15 23:42:47 fedora-pc kernel: amdgpu_cs_ioctl: 8 callbacks suppressed

nov 15 23:42:47 fedora-pc kernel: [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Failed to initialize parser -125!

nov 15 23:42:59 fedora-pc kernel: [drm:atom_op_jump [amdgpu]] ERROR atombios stuck in loop for more than 20secs aborting

nov 15 23:42:59 fedora-pc kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing C9A0 (len 75, WS 0, PS 0) @ 0xC9CF

nov 15 23:43:19 fedora-pc kernel: [drm:atom_op_jump [amdgpu]] ERROR atombios stuck in loop for more than 20secs aborting

nov 15 23:43:19 fedora-pc kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing C9EC (len 62, WS 0, PS 0) @ 0xCA08

nov 15 23:43:19 fedora-pc kernel: amdgpu 0000:01:00.0: amdgpu: PCI CONFIG reset

nov 15 23:43:19 fedora-pc kernel: amdgpu 0000:01:00.0: amdgpu: ASIC reset failed with error, -22 for drm dev, 0000:01:00.0

nov 15 23:43:39 fedora-pc kernel: [drm:atom_op_jump [amdgpu]] ERROR atombios stuck in loop for more than 20secs aborting

nov 15 23:43:39 fedora-pc kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing C9EC (len 62, WS 0, PS 0) @ 0xCA08

nov 15 23:43:39 fedora-pc kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing B694 (len 286, WS 4, PS 0) @ 0xB789

nov 15 23:43:39 fedora-pc kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing B5EE (len 78, WS 12, PS 8) @ 0xB5F6

nov 15 23:43:39 fedora-pc kernel: amdgpu 0000:01:00.0: amdgpu: asic atom init failed!

nov 15 23:43:59 fedora-pc kernel: [drm:atom_op_jump [amdgpu]] ERROR atombios stuck in loop for more than 20secs aborting

nov 15 23:43:59 fedora-pc kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing C9EC (len 62, WS 0, PS 0) @ 0xCA08`

The reporter hasn't confirmed that the updated mesa version fixed it or not for them.

Closing the issue as already said above that happy with amdgpu.mcbp=0. I think it is stupid to make mcbp=1 as default value if you can't fix the bugs at driver level. Sure you can always open the issue for mesa team about crashing kernel driver

closed

mentioned in issue #3006

think it is stupid to make mcbp=1 as default value if you can't fix the bugs at driver level

JFYI, the default policy was changed back.

агинь ;)

mentioned in issue mesa/mesa#11866 (closed)

mentioned in issue mesa/mesa#11508

mentioned in issue mesa/mesa#12235 (closed)

mentioned in issue mesa/mesa#10883 (closed)

100% reproducible kernel v6.6 amdgpu driver crash with amdgpu.mcbp=1

Designs

Child items ...

Activity

Admin message

Admin message

100% reproducible kernel v6.6 amdgpu driver crash with amdgpu.mcbp=1

Activity