Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
100% reproducible kernel v6.6 amdgpu driver crash with amdgpu.mcbp=1
[ +11.079617] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=92653, emitted seq=92655[ +0.000347] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 8601 thread gnome-shel:cs0 pid 8607[ +0.000327] amdgpu 0000:09:00.0: amdgpu: GPU reset begin![ +0.264794] [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)[ +0.028491] amdgpu 0000:09:00.0: amdgpu: MODE1 reset[ +0.000023] amdgpu 0000:09:00.0: amdgpu: GPU mode1 reset[ +0.000125] amdgpu 0000:09:00.0: amdgpu: GPU psp mode1 reset[ +0.486730] acer_wmi: Unknown function number - 8 - 0[ +0.017000] [drm] psp mode1 reset succeed [ +0.019563] amdgpu 0000:09:00.0: amdgpu: GPU reset succeeded, trying to resume[ +0.000544] [drm] PCIE GART of 512M enabled.[ +0.000016] [drm] PTB located at 0x000000F400000000[ +0.000087] [drm] VRAM is lost due to GPU reset![ +0.000011] [drm] PSP is resuming...[ +0.189211] [drm] reserve 0x400000 from 0xf5fec00000 for PSP TMR[ +0.147594] [drm] kiq ring mec 2 pipe 1 q 0[ +0.023112] [drm] UVD and UVD ENC initialized successfully.[ +0.108729] [drm] VCE initialized successfully.[ +0.000034] amdgpu 0000:09:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0[ +0.000023] amdgpu 0000:09:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0[ +0.000054] amdgpu 0000:09:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0[ +0.000018] amdgpu 0000:09:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0[ +0.000017] amdgpu 0000:09:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0[ +0.000016] amdgpu 0000:09:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0[ +0.000016] amdgpu 0000:09:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0[ +0.000016] amdgpu 0000:09:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0[ +0.000015] amdgpu 0000:09:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0[ +0.000016] amdgpu 0000:09:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0[ +0.000015] amdgpu 0000:09:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0[ +0.000016] amdgpu 0000:09:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 13 on hub 0[ +0.000015] amdgpu 0000:09:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8[ +0.000015] amdgpu 0000:09:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 8[ +0.000015] amdgpu 0000:09:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 8[ +0.000014] amdgpu 0000:09:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 8[ +0.000015] amdgpu 0000:09:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 8[ +0.000014] amdgpu 0000:09:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8[ +0.000015] amdgpu 0000:09:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8[ +0.000015] amdgpu 0000:09:00.0: amdgpu: ring vce0 uses VM inv eng 9 on hub 8[ +0.000014] amdgpu 0000:09:00.0: amdgpu: ring vce1 uses VM inv eng 10 on hub 8[ +0.000015] amdgpu 0000:09:00.0: amdgpu: ring vce2 uses VM inv eng 11 on hub 8[ +0.003851] amdgpu 0000:09:00.0: amdgpu: recover vram bo from shadow start[ +0.001305] amdgpu 0000:09:00.0: amdgpu: recover vram bo from shadow done[ +0.000152] [drm] Skip scheduling IBs![ +0.000009] amdgpu 0000:09:00.0: amdgpu: GPU reset(4) succeeded![ +0.000023] [drm] Skip scheduling IBs!
Have to add that I'm having "amdgpu.audio=0 amdgpu.runpm=0 amdgpu.aspm=0 amdgpu.bapm=0 pcie_aspm=off" in kernel cmd and "options amdgpu si_support=1 cik_support=1 lockup_timeout=-1,-1" now in /etc/modprobe.d/amdgpu.conf.
glxinfo|grep versionserver glx version string: 1.4client glx version string: 1.4GLX version: 1.4 Max core profile version: 4.6 Max compat profile version: 4.6 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.2OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.0.4-0ubuntu1~22.04.1OpenGL core profile shading language version string: 4.60OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.0.4-0ubuntu1~22.04.1OpenGL shading language version string: 4.60OpenGL ES profile version string: OpenGL ES 3.2 Mesa 23.0.4-0ubuntu1~22.04.1OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20 GL_EXT_shader_implicit_conversions, GL_EXT_shader_integer_mix,
ssh do not works too - it just hangs despite the box is still pinging. the fast 100% reproduce steps are simple:
sudo apt install lactlact
No need to launch its daemon or set some profile - just launch lact from userspace. Sometime it is needed to run it 2-3 times for crash but usually crash happens just from the first start
apt install cannot find lact. I installed lact-0.4.5-0.amd64.ubuntu-2204.deb from github. when launched, there is a dialog shows "Could not connect to demon, running in embedded mode". The other information look good. I am using Renoir gfx9. I will get a vega10 card to have a try.
the crash is occurred not in Mesa layer but in amdgpu driver one (see attached dmesg)
I never use any Mesa or 3d app at this box, my app set in 99% is just gmail (browser) and terminal
the bug is definitely disappeared with amdgpu.mcbp=0 for kernels > 6.4 and never appears with kernel 6.4.14 with default kernel cmdline. This means that your logic is broken here - amdgpu.mcbp is kernel amdgpu driver parameter not Mesa one. The same is related to kernel version. If the bug would in Mesa layer then you must relay to Mesa parameters not the kernel driver one. This is official Ubuntu 22.04 and there is no "newer mesa version" available except recent one that is appropriate for this distribution and I think that was wisely selected by its maintainers
Anyway I do not use nor Mesa nor 3d apps at this box. The reproduce steps are described above and they are pretty clear - crash occurring just after reading some driver settings from userspace without any complex 3d app running. At this box (Acer laptop with brain dead BIOS) the crash is occurring every time when starting lact. I can supply any additional info if needed.
Yep, but the crash occurring not in Mesa layer but in amdgpu driver one. Second: the crash in never occurred in kernel 6.4.14 with the same Mesa setup. The simple logic tells me that bug is still exists at kernel driver layer.
Actually Canonical has rolled out newer mesa version in the life of Ubuntu 22.04. So it's very possible you tested with a different mesa version back when you tested 6.4.y.
Anyways; the reason this is relevant is because the mesa libraries are used for queue up jobs in the GPU. If the GPU driver enables preemption (which is new to recent kernels) then the mesa libraries can exacerbate the issue. So they go hand in hand.
If this is indeed triggered by the mesa version in Ubuntu, one of the actions to solve it may be to backport patches from newer mesa or upgrade to a newer mesa in Ubuntu.
First: Not the Mesa libs is the reason of crash as all is definitely works with kernels < 6.5. The kernel is the trigger
Second: The error is not happening at Mesa level. It happens at kernel level
Third: The kernel driver should not crash if the Mesa driver use some "wrong" access. You must handle this. Moreover this "wrong" access is not exactly described and just hided with some unknown for me reason
With command "sudo lact", I could reproduce the hang on vega10(gfx900) with default installed Mesa 23.0.4-0ubuntu1~22.04.1.
When I installed our internal driver amdgpu-build=1681769 (mesa version 23.3.0-devel), the hang disappears. The hang is related with some preamble packages added for mcbp.
I have not found which patch fixed it. Maybe @mareko could give us some clues. Thanks.
@spasswolf might have bisected it (if it's the same issue). I put together a PPA on 23.0.4 + that patch to test and see.
First: Not the Mesa libs is the reason of crash as all is definitely works with kernels < 6.5. The kernel is the trigger
Like I said it's a newer feature in the kernel, but mesa and the kernel work together. The newer feature in the kernel can expose a mesa bug which can cause a GPU fault and manifest as a kernel hang. It's unfortunate but bugs do happen.
I'm pretty happy with amdgpu.mcbp=0 solution. Just thought you will be interested in real reason. You can close it - seems like old h/w must die according to AMD decision.
I installed mesa-23.0.4 (compiled from mesa git) on my debian sid and I get a hang at startup with kernel 6.5.9 if I use amdgpu.vm_debug=1 and amdgpu.mcbp=1, mesa-23.2.1 works fine on kernel 6.5.9.
And this patch fixes mesa-23.0.4 when I run linux-6.5.9 with vm_debug=1 and mcbp=1:
commit bc07b1a0bf22054c9a683a43e9f7f7632446431fAuthor: Yogesh Mohan Marimuthu <yogesh.mohanmarimuthu@amd.com>Date: Fri Jan 20 12:29:00 2023 +0530 radeonsi: remove some shadow reg optimization for bf1 game This patch removes below shadow reg optimization. This is done for Vega64 battlefield 1 crash when shadow regs enabled. + reset only dirty states with buffers in si_pm4_reset_emitted() + various draw states in si_begin_new_gfx_cs() v2: remove first_cs parameter from si_pm4_reset_emitted() (Marek Olšák) Signed-off-by: Yogesh Mohan Marimuthu <yogesh.mohanmarimuthu@amd.com> Reviewed-by: Marek Olšák <marek.olsak@amd.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18301>diff --git a/src/gallium/drivers/radeonsi/si_gfx_cs.c b/src/gallium/drivers/radeonsi/si_gfx_cs.cindex 0c48240a5f0..0a48aee6cef 100644--- a/src/gallium/drivers/radeonsi/si_gfx_cs.c+++ b/src/gallium/drivers/radeonsi/si_gfx_cs.c@@ -415,10 +415,8 @@ void si_begin_new_gfx_cs(struct si_context *ctx, bool first_cs) si_add_all_descriptors_to_bo_list(ctx);- if (first_cs || !ctx->shadowed_regs) {- si_shader_pointers_mark_dirty(ctx);- ctx->cs_shader_state.initialized = false;- }+ si_shader_pointers_mark_dirty(ctx);+ ctx->cs_shader_state.initialized = false; if (!ctx->has_graphics) { ctx->initial_gfx_cs_size = ctx->gfx_cs.current.cdw;@@ -434,7 +432,7 @@ void si_begin_new_gfx_cs(struct si_context *ctx, bool first_cs) /* set all valid group as dirty so they get reemited on * next draw command */- si_pm4_reset_emitted(ctx, first_cs);+ si_pm4_reset_emitted(ctx); /* The CS initialization should be emitted before everything else. */ if (ctx->cs_preamble_state) {@@ -460,7 +458,7 @@ void si_begin_new_gfx_cs(struct si_context *ctx, bool first_cs) /* CLEAR_STATE disables all colorbuffers, so only enable bound ones. */ bool has_clear_state = ctx->screen->info.has_clear_state;- if (has_clear_state || ctx->shadowed_regs) {+ if (has_clear_state) { ctx->framebuffer.dirty_cbufs = u_bit_consecutive(0, ctx->framebuffer.state.nr_cbufs); /* CLEAR_STATE disables the zbuffer, so only enable it if it's bound. */@@ -508,22 +506,6 @@ void si_begin_new_gfx_cs(struct si_context *ctx, bool first_cs) si_mark_atom_dirty(ctx, &ctx->atoms.s.scissors); si_mark_atom_dirty(ctx, &ctx->atoms.s.viewports);- /* Invalidate various draw states so that they are emitted before- * the first draw call. */- si_invalidate_draw_constants(ctx);- ctx->last_index_size = -1;- ctx->last_primitive_restart_en = -1;- ctx->last_restart_index = SI_RESTART_INDEX_UNKNOWN;- ctx->last_prim = -1;- ctx->last_multi_vgt_param = -1;- ctx->last_vs_state = ~0;- ctx->last_gs_state = ~0;- ctx->last_ls = NULL;- ctx->last_tcs = NULL;- ctx->last_tes_sh_base = -1;- ctx->last_num_tcs_input_cp = -1;- ctx->last_ls_hs_config = -1; /* impossible value */- if (has_clear_state) { si_set_tracked_regs_to_clear_state(ctx); } else {@@ -536,6 +518,22 @@ void si_begin_new_gfx_cs(struct si_context *ctx, bool first_cs) memset(ctx->tracked_regs.spi_ps_input_cntl, 0xff, sizeof(uint32_t) * 32); }+ /* Invalidate various draw states so that they are emitted before+ * the first draw call. */+ si_invalidate_draw_constants(ctx);+ ctx->last_index_size = -1;+ ctx->last_primitive_restart_en = -1;+ ctx->last_restart_index = SI_RESTART_INDEX_UNKNOWN;+ ctx->last_prim = -1;+ ctx->last_multi_vgt_param = -1;+ ctx->last_vs_state = ~0;+ ctx->last_gs_state = ~0;+ ctx->last_ls = NULL;+ ctx->last_tcs = NULL;+ ctx->last_tes_sh_base = -1;+ ctx->last_num_tcs_input_cp = -1;+ ctx->last_ls_hs_config = -1; /* impossible value */+ if (ctx->scratch_buffer) { si_context_add_resource_size(ctx, &ctx->scratch_buffer->b.b); si_mark_atom_dirty(ctx, &ctx->atoms.s.scratch_state);diff --git a/src/gallium/drivers/radeonsi/si_pm4.c b/src/gallium/drivers/radeonsi/si_pm4.cindex f8454cd302c..280125b6511 100644--- a/src/gallium/drivers/radeonsi/si_pm4.c+++ b/src/gallium/drivers/radeonsi/si_pm4.c@@ -151,23 +151,8 @@ void si_pm4_emit(struct si_context *sctx, struct si_pm4_state *state) state->atom.emit(sctx); }-void si_pm4_reset_emitted(struct si_context *sctx, bool first_cs)+void si_pm4_reset_emitted(struct si_context *sctx) {- if (!first_cs && sctx->shadowed_regs) {- /* Only dirty states that contain buffers, so that they are- * added to the buffer list on the next draw call.- */- for (unsigned i = 0; i < SI_NUM_STATES; i++) {- struct si_pm4_state *state = sctx->queued.array[i];-- if (state && state->is_shader) {- sctx->emitted.array[i] = NULL;- sctx->dirty_states |= 1 << i;- }- }- return;- }- memset(&sctx->emitted, 0, sizeof(sctx->emitted)); for (unsigned i = 0; i < SI_NUM_STATES; i++) {diff --git a/src/gallium/drivers/radeonsi/si_pm4.h b/src/gallium/drivers/radeonsi/si_pm4.hindex 4d1770a96d8..486b627d540 100644--- a/src/gallium/drivers/radeonsi/si_pm4.h+++ b/src/gallium/drivers/radeonsi/si_pm4.h@@ -70,7 +70,7 @@ void si_pm4_clear_state(struct si_pm4_state *state); void si_pm4_free_state(struct si_context *sctx, struct si_pm4_state *state, unsigned idx); void si_pm4_emit(struct si_context *sctx, struct si_pm4_state *state);-void si_pm4_reset_emitted(struct si_context *sctx, bool first_cs);+void si_pm4_reset_emitted(struct si_context *sctx); #ifdef __cplusplus }
I actually found it some time ago when I tried to figure out a similar error in mesa-22.3.6 (debian bookworm), but didn't report it because I was a) switching to debian sid b) because the error (at least for me) needed both amdgpu.vm_debug=1 (which I assumed next to nobody would use) and amdgpu.mcbp=1 (which is the default since 6.5)
I'm not realyy sure if my issue is related to the topic of this post. mesa-23.0.4 works (for me) with linux-next-20231108 because the amdgpu.vm_debug parameter no longer exists.
To see if it's the same issue you had, here's a mesa 23.0.4 build with just that patch added.
@Sfinx can you please test this PPA? Add it to your system and upgrade to it. If the issue goes away we can start a conversation with Canonical on pulling this patch into Ubuntu.
but I'm not 100% sure since there are lines in the journal that indicate problems with amdgpu (these appear after the gnome error. So could they be a symptom of the gnome error and not the cause?):
The reporter hasn't confirmed that the updated mesa version fixed it or not for them.
Closing the issue as already said above that happy with amdgpu.mcbp=0. I think it is stupid to make mcbp=1 as default value if you can't fix the bugs at driver level. Sure you can always open the issue for mesa team about crashing kernel driver