Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Equinix is shutting down its operations with us on April 30, 2025. They have graciously supported us for almost 5 years, but all good things come to an end.
Given the time frame, it's going to be hard to make a smooth transition of the cluster to somewhere else (TBD). Please expect in the next months some hiccups in the service and probably at least a full week of downtime to transfer gitlab to a different place.
NIR splits outputs and inputs in nir_lower_io_to_scalar_early() and unfortunately for FS it means increased number of instructions and increased register pressure.
We need to recombine inputs from scalar to vectors.
Right, I also meet this problem after 18.0 rebase. But due to want to focus on kernel, I haven't solve it from the root. Now I'm doing 18.1 rebase, so will cover these NIR changes in a better way.
In fact, even we don't recombine scalar to vector, ppir should not fail, but only generate longer code. This "regalloc fail" is indeed the ppir need to implement reg spill when out of regs.
As far as I understand we need to reverse engineer how temporaries work first. I see store and load temporary instructions in the doc, but I don't understand where they're stored.
I'll have some time to work on lima again starting by the end of this week. If nobody is currently working on this, I could pick it up and work on it.
From what I understand there are two issues,
recombine scalars to vec4 to restore the earlier behavior and
implement register spilling to avoid this kind of problem in the future
Is that correct?
Anybody working in any of these?
Just to be sure though, are we sure already that we want (1) (as in the title of this issue), even with (2) implemented? Or is it something that we would need to benchmark?
I think you are right. 2 is needed anyway for correctness and 1 is for better performance. But 1 needs to be done also because I wrote ppir for vec4, not for scalar, some refine or recombine may be needed for correctness.
I spent some time trying to reproduce this to see for myself what happens, but I am still unable to reproduce it in lima-18.1.
I tried a lot of things, including creating dozens of varyings with random calculations, with different data sizes, texture lookups, variables with long live range, several variables with random conditional statements, but I was not able to reproduce a single "ppir: regalloc fail".
Can you and/or someone else still reproduce it, maybe there is a difference in our setup?
Below is what I get with kmscube -M rgba on lima-18.1, running on A20.
I know that kmscube -M rgba is not the main problem here, I'm still working on the real issues from the earlier comment.
Just want to check if kmscube -M rgba has actually been fixed by other changes in the latest branch, or there is a difference in our setup that we need to figure out.
Do you have arm64 board? I meet the "kmscube -M rgba" "ppir: regalloc fail" problem only on arm64 board and randomly. I can try to reproduce it to see any difference with yours.
Seems the input is same for two cases. Maybe it's due to the code bug in ppir regalloc appears randomly or the mesa RA random behavior like use pointer hash. BTW. when running some glmark2 tests, I get ppir crash in the mesa RA code, so there should be something we need to fix in the ppir regalloc. Maybe they share the same bug.
@enunes if you can't reproduce this "ppir: regalloc fail" problem, maybe you can run some glmark2 tests like buffer/bump/desktop/ideas which have the ppir regalloc crash problem and can be reproduced on arm32 board.
buffer/bump/desktop don't work for me. Buffer fails due to unsupported nir op 75, both bump and desktop crash in ra_add_node_adjacency():
#0 0x0000ffffbeb30cd0 in ra_add_node_adjacency (g=0xaaaaab0aa460, n1=<optimized out>, n2=2) at register_allocate.c:402 #1 0x0000ffffbeb31578 in ra_add_node_interference (g=0xaaaaab0aa460, n1=1, n2=2) at register_allocate.c:467 #2 0x0000ffffbecc2f3c in ppir_regalloc_prog (comp=comp@entry=0xaaaaaac03ff0) at ir/pp/regalloc.c:342 #3 0x0000ffffbecc06cc in ppir_compile_nir (prog=prog@entry=0xaaaaaabe9980, nir=nir@entry=0xaaaaaabe0680, ra=<optimized out>) at ir/pp/nir.c:482 #4 0x0000ffffbecb7930 in lima_create_fs_state (pctx=<optimized out>, cso=<optimized out>) at lima_program.c:180 #5 0x0000ffffbea8d7c4 in st_create_fp_variant (st=st@entry=0xaaaaaac2f890, stfp=stfp@entry=0xaaaaab08dac0, key=key@entry=0xfffffffff240) at state_tracker/st_program.c:1103 #6 0x0000ffffbea8ee3c in st_get_fp_variant (st=st@entry=0xaaaaaac2f890, stfp=stfp@entry=0xaaaaab08dac0, key=key@entry=0xfffffffff240) at state_tracker/st_program.c:1251 #7 0x0000ffffbea4f268 in st_update_fp (st=0xaaaaaac2f890) at state_tracker/st_atom_shader.c:141 #8 0x0000ffffbea4c37c in st_validate_state (st=st@entry=0xaaaaaac2f890, pipeline=pipeline@entry=ST_PIPELINE_RENDER) at ../../src/util/bitscan.h:104 #9 0x0000ffffbea66b9c in prepare_draw (ctx=0xaaaaaac14550, st=0xaaaaaac2f890) at state_tracker/st_draw.c:123 #10 st_draw_vbo (ctx=0xaaaaaac14550, prims=0xfffffffff3b0, nr_prims=1, ib=0x0, index_bounds_valid=1 '\001', min_index=<optimized out>, max_index=<optimized out>, tfb_vertcount=0x0, stream=0, indirect=0x0) at state_tracker/st_draw.c:153 #11 0x0000ffffbea314b0 in vbo_draw_arrays (ctx=<optimized out>, mode=<optimized out>, start=<optimized out>, count=<optimized out>, numInstances=<optimized out>, baseInstance=<optimized out>, drawID=<optimized out>) at vbo/vbo_exec_array.c:391 #12 0x0000aaaaaab240b8 in Mesh::render_vbo (this=this@entry=0xaaaaaaba3980) at /usr/include/c++/8.1.0/bits/stl_vector.h:805 #13 0x0000aaaaaaacf1cc in SceneBump::draw (this=0xaaaaaaba3880) at ../src/scene-bump.cpp:376 #14 0x0000aaaaaaac2190 in MainLoop::draw (this=0xaaaaaac0e110) at ../src/main-loop.cpp:133 #15 0x0000aaaaaaac2b3c in MainLoop::step (this=0xaaaaaac0e110) at ../src/main-loop.cpp:108 #16 0x0000aaaaaaab81d8 in do_benchmark (canvas=...) at ../src/main.cpp:119 #17 0x0000aaaaaaab5ee8 in main (argc=<optimized out>, argv=<optimized out>) at ../src/main.cpp:214
I pushed a new version at https://gitlab.freedesktop.org/enunes/mesa/commits/lima-18.1-regspill which creates the new instrs and attaches the new nodes (this was missing) and it seems to be minimally working. With my modified gbm-surface-color (to use more registers), the code correctly spills registers, inserts loads and stores, and renders correctly.
There are still artifacts with kmscube -M rgba when I force usage of less registers to force spilling, and that is harder to debug. I'm not sure if the current way of adding instructions doesn't break assumptions previously made by the instruction scheduler. Maybe it would be necessary to rerun the instruction scheduler from scratch (or even earlier steps like ppir_node_to_instr) after adding the load/store nodes to create spilling (but that's not trivial since they add nodes which I don't want to add twice).
Thoughts?
Re-run scheduler or ppir_node_to_instr is not good idea, because current regalloc spill is based on the output of them, if re-run them current situation may change. I think add new load and store won't break anything. Instruction scheduler just re-order instr to reduce the regalloc pressure, it doesn't have any "assumption".
I also see some x11 application fail due to this extension support, like xeyes and xclock. After run xserver with MESA_SHADER_CAPTURE_PATH, I get failed shader used to implement this extension by glamor:
captured when run the hello test, ppir regalloc fail: 21.shader_test.2
captured when run the xeyes, no regalloc fail, but result is not right: 21.shader_test
@yuq825 since the original ppir regalloc issues were fixed at the time with some other commits (some variable initialization and other reports by valgrind if I recall correctly), we never saw regalloc issues anymore, I think the register spilling lost priority and I ended up working on other stuff that I found on the way. For example thay memory leak stuff.
The regalloc work is still on that branch and should be rebaseable to lima-18.3. I probably also have commits in my local branch, like addressing your last review, that I didn't clean up to push to gitlab. It should work and do the store/load correctly on controlled cases. I think things like vec3/vec2 and swizzling need to be further tested and there will probably be bugs.
I am without access to my work environment and won't be able to work on it for a couple of weeks from now. It's good to have some real world examples to try the spilling code on. Are those issues blocking your short term plans for lima?
It's OK, I can wait when you back. Seems we have so many problems to be solved in ppir that I can't complete in a short term.
render extension is frequently used by x11 applications, so it's important for lima x11 support. But it's not working as shown by the above two failed test cases. Besides the regalloc fail, ppir also has other critical issues:
control flow support, regalloc fail occurs in this complicated shader, but this complication comes from the control flow miss (PIPE_SHADER_CAP_MAX_CONTROL_FLOW_DEPTH==0). So ppir has to compute all statements in both if-else branch and use select instr at last.
fake integer support for constant and uniform. I see some int uniform and constant in above shader, but I'm afraid PIPE_SHADER_CAP_INTEGERS/nir_shader_compiler_options.native_integers==0 don't work well for now (I really hope mali can support native int), need more investigation
I got back on this today, here is a quick update with my impressions so far. I can reproduce the ppir regalloc fail with gbm-surface-render-two. My branch is easily rebaseable on lima-18.3. The current register spilling code seems to "solve" the register allocation by spilling two registers, which doesn't seem terrible actually. But the rendering is still not correct. I'm taking a look at what it produced.
gbm-surface-render-two seems to spill correctly, but I don't get the same render as my pc. I get a full textured triangle instead of the cropped one. I wonder if it is a spilling bug or whether it may be something else. Still need to do more testing on that.
https://patchwork.freedesktop.org/patch/268946/ doesn't solve that for me. In fact, applying this patch breaks kmscube for me. So I don't think we can apply this, at least alone. Or do we need to full patchset?
Also still need to generate some dumps again and review the lima_pp_frame_reg part.
I created merge request !77 (merged) for my ppir register spilling implementation.
I think it is more appropriate to move the discussion about spilling to there now.
This Issue is fairly old and has covered many other topics than the "Recombine FS inputs into vectors" in the title. We have vectors now again in lima 18.3 and a configurable option for this so I'm not sure if "Recombining" is still a necessary implementation.
How about we close this Issue and create other ones with more appropriate titles, if necessary?