Recombine FS inputs into vectors

Right, I also meet this problem after 18.0 rebase. But due to want to focus on kernel, I haven't solve it from the root. Now I'm doing 18.1 rebase, so will cover these NIR changes in a better way.

Created by: anarsoul

kmscube -M rgba fails in lima-18.0 branch due to this issue with "ppir: ppir: regalloc fail"

Created by: anarsoul

impl main {
	decl_reg vec4 32 r0
	decl_reg vec2 32 r1
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_1 = intrinsic load_input (ssa_0) () (1, 0) /* base=1 */ /* component=0 */	/* vVaryingColor */
	vec1 32 ssa_2 = intrinsic load_input (ssa_0) () (1, 1) /* base=1 */ /* component=1 */	/* vVaryingColor */
	vec1 32 ssa_3 = intrinsic load_input (ssa_0) () (1, 2) /* base=1 */ /* component=2 */	/* vVaryingColor */
	vec1 32 ssa_4 = intrinsic load_input (ssa_0) () (1, 3) /* base=1 */ /* component=3 */	/* vVaryingColor */
	r0.x = imov ssa_1
	r0.y = imov ssa_2.x
	r0.z = imov ssa_3.x
	r0.w = imov ssa_4.x
	vec1 32 ssa_6 = intrinsic load_input (ssa_0) () (0, 0) /* base=0 */ /* component=0 */	/* vTexCoord */
	vec1 32 ssa_7 = intrinsic load_input (ssa_0) () (0, 1) /* base=0 */ /* component=1 */	/* vTexCoord */
	r1.x = imov ssa_6
	r1.y = imov ssa_7.x
	vec4 32 ssa_9 = tex r1 (coord), 0 (texture) 0 (sampler)
	vec4 32 ssa_10 = fmul r0, ssa_9
	intrinsic store_output (ssa_10, ssa_0) () (0, 15, 0) /* base=0 */ /* wrmask=xyzw */ /* component=0 */	/* gl_FragColor */
	/* succs: block_0 */
	block block_0:
}

========prog========
-------block------
const 0 ssa0
st_col 15 new
  mul 14 ssa10
    mov 5 reg0
      ld_var 1 ssa1
    mov 6 reg0
      ld_var 2 ssa2
    mov 7 reg0
      ld_var 3 ssa3
    mov 8 reg0
      ld_var 4 ssa4
    ld_tex 13 ssa9
      mov 11 reg1
        ld_var 9 ssa6
      mov 12 reg1
        ld_var 10 ssa7
====================
ppir: ppir_lower_texture create load_coords node 16 for 13
========prog========
-------block------
st_col 15 new
  mul 14 ssa10
    mov 5 reg0
      ld_var 1 ssa1
    mov 6 reg0
      ld_var 2 ssa2
    mov 7 reg0
      ld_var 3 ssa3
    mov 8 reg0
      ld_var 4 ssa4
    ld_tex 13 ssa9
      ld_coords 16 new
        mov 11 reg1
          ld_var 9 ssa6
        mov 12 reg1
          ld_var 10 ssa7
====================
ppir: node_to_instr create move 17 from store 15
ppir: insert_load_tex: create move 18 for 13
======ppir instr list======
      vary texl unif vmul smul vadd sadd comb stor const0|1
*000: null null null 14   null 17   null null null | 
 001: null null null null null null 5    null null | 
 002: 1    null null null null null null null null | 
 003: null null null null null null 6    null null | 
 004: 2    null null null null null null null null | 
 005: null null null null null null 7    null null | 
 006: 3    null null null null null null null null | 
 007: null null null null null null 8    null null | 
 008: 4    null null null null null null null null | 
 009: 16   13   null null null 18   null null null | 
 010: null null null null null null 11   null null | 
 011: 9    null null null null null null null null | 
 012: null null null null null null 12   null null | 
 013: 10   null null null null null null null null | 
------------------------
======ppir instr depend======
[0[1[2]][3[4]][5[6]][7[8]][9[10[11]][12[13]]]]
------------------------
ppir: ppir: regalloc fail

In fact, even we don't recombine scalar to vector, ppir should not fail, but only generate longer code. This "regalloc fail" is indeed the ppir need to implement reg spill when out of regs.

Created by: anarsoul

As far as I understand we need to reverse engineer how temporaries work first. I see store and load temporary instructions in the doc, but I don't understand where they're stored.

I guess it's here: https://github.com/yuq/mesa-lima/blob/lima-18.0/include/drm-uapi/lima_drm.h#L107

Each PP can have a memory stack which I guess is used to store tmp.

Created by: anarsoul

There're 2 stack_address, one in lima_pp_frame_reg, another in drm_lima_m400_pp_frame/drm_lima_m450_pp_frame.

I'm not sure what's the difference between them.

lima_pp_frame_reg one is dummy, drm_lima_m400_pp_frame one is used, one for each PP.

Created by: anarsoul

@yuq how do you tell which one is dummy?

In this function: https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L303

LIMA_PP_FRAME & LIMA_PP_STACK are per PP, so the lima_pp_frame_reg will be set to new value before task start.

Created by: anarsoul

@yuq, what about LIMA_PP_STACK_SIZE?

Created by: anarsoul

And why LIMA_PP_STACK is not used for mali450?

LIMA_PP_STACK is used by mali450 here: https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L321

Just because bcast will set same address to all PPs, so I have to set it individually for each PP.

LIMA_PP_STACK_SIZE is same for all PPs, so the lima_pp_frame_reg one is not dummy.

Created by: enunes

I'll have some time to work on lima again starting by the end of this week. If nobody is currently working on this, I could pick it up and work on it.

From what I understand there are two issues,

recombine scalars to vec4 to restore the earlier behavior and
implement register spilling to avoid this kind of problem in the future

Is that correct? Anybody working in any of these?

Just to be sure though, are we sure already that we want (1) (as in the title of this issue), even with (2) implemented? Or is it something that we would need to benchmark?

I think you are right. 2 is needed anyway for correctness and 1 is for better performance. But 1 needs to be done also because I wrote ppir for vec4, not for scalar, some refine or recombine may be needed for correctness.

I spent some time trying to reproduce this to see for myself what happens, but I am still unable to reproduce it in lima-18.1.

I tried a lot of things, including creating dozens of varyings with random calculations, with different data sizes, texture lookups, variables with long live range, several variables with random conditional statements, but I was not able to reproduce a single "ppir: regalloc fail".

Can you and/or someone else still reproduce it, maybe there is a difference in our setup?

Below is what I get with kmscube -M rgba on lima-18.1, running on A20.

impl main {
	decl_reg vec2 32 r0
	decl_reg vec4 32 r1
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_1 = intrinsic load_input (ssa_0) () (0, 0) /* base=0 */ /* component=0 */	/* vVaryingColor */
	vec1 32 ssa_2 = intrinsic load_input (ssa_0) () (0, 1) /* base=0 */ /* component=1 */	/* vVaryingColor */
	vec1 32 ssa_3 = intrinsic load_input (ssa_0) () (0, 2) /* base=0 */ /* component=2 */	/* vVaryingColor */
	vec1 32 ssa_4 = intrinsic load_input (ssa_0) () (0, 3) /* base=0 */ /* component=3 */	/* vVaryingColor */
	vec1 32 ssa_5 = intrinsic load_input (ssa_0) () (1, 0) /* base=1 */ /* component=0 */	/* packed:vTexCoord */
	vec1 32 ssa_6 = intrinsic load_input (ssa_0) () (1, 1) /* base=1 */ /* component=1 */	/* packed:vTexCoord */
	r0.x = imov ssa_5
	r0.y = imov ssa_6.x
	vec4 32 ssa_8 = tex r0 (coord), 0 (texture) 0 (sampler)
	r1.x = fmul ssa_1, ssa_8.x
	r1.y = fmul ssa_2.x, ssa_8.y
	r1.z = fmul ssa_3.x, ssa_8.z
	r1.w = fmul ssa_4.x, ssa_8.w
	intrinsic store_output (r1, ssa_0) () (0, 15, 0) /* base=0 */ /* wrmask=xyzw */ /* component=0 */	/* gl_FragColor */
	/* succs: block_0 */
	block block_0:
}

========prog========
-------block------
const 0 ssa0
st_col 14 new
  mul 10 reg1
    ld_var 1 ssa1
    ld_tex 9 ssa8
      mov 7 reg0
        ld_var 5 ssa5
      mov 8 reg0
        ld_var 6 ssa6
  mul 11 reg1
    ld_var 2 ssa2
    +ld_tex 9 ssa8
  mul 12 reg1
    ld_var 3 ssa3
    +ld_tex 9 ssa8
  mul 13 reg1
    ld_var 4 ssa4
    +ld_tex 9 ssa8
====================
ppir: ppir_lower_texture create load_coords node 15 for 9
========prog========
-------block------
st_col 14 new
  mul 10 reg1
    ld_var 1 ssa1
    ld_tex 9 ssa8
      ld_coords 15 new
        mov 7 reg0
          ld_var 5 ssa5
        mov 8 reg0
          ld_var 6 ssa6
  mul 11 reg1
    ld_var 2 ssa2
    +ld_tex 9 ssa8
  mul 12 reg1
    ld_var 3 ssa3
    +ld_tex 9 ssa8
  mul 13 reg1
    ld_var 4 ssa4
    +ld_tex 9 ssa8
====================
ppir: node_to_instr create move 16 from store 14
ppir: insert_load_tex: create move 17 for 9
======ppir instr list======
      vary texl unif vmul smul vadd sadd comb stor const0|1
*000: null null null null null 16   null null null | 
 001: null null null null 10   null null null null | 
 002: 1    null null null null null null null null | 
 003: null null null null 11   null null null null | 
 004: 2    null null null null null null null null | 
 005: null null null null 12   null null null null | 
 006: 3    null null null null null null null null | 
 007: null null null null 13   null null null null | 
 008: 4    null null null null null null null null | 
 009: 15   9    null null null 17   null null null | 
 010: null null null null null null 7    null null | 
 011: 5    null null null null null null null null | 
 012: null null null null null null 8    null null | 
 013: 6    null null null null null null null null | 
------------------------
======ppir instr depend======
[0[1[2][9[10[11]][12[13]]]][3[4][+9]][5[6][+9]][7[8][+9]]]
------------------------
======ppir regalloc result======
011: (5|0|)
010: (7|0|0)
013: (6|4|)
012: (8|0|4)
009: (15|60|0) (9|56|60) (17|8|56)
002: (1|0|)
001: (10|0|0 8)
004: (2|4|)
003: (11|0|4 8)
006: (3|4|)
005: (12|0|4 8)
008: (4|4|)
007: (13|0|4 8)
000: (16|0|0)
--------------------------
========ppir codegen========
011: 02100083 10103c00 00000000 
010: 02182002 3e400000 
013: 02100083 11143c00 00000000 
012: 02302002 3e410004 
009: 021811c6 3f040004 00000000 39001000 20000e4e 000007cf 
002: 02100083 10003c00 00000000 
001: 02180802 00400800 
004: 02100083 11043c00 00000000 
003: 02180802 00410904 
006: 02100083 11083c00 00000000 
005: 02180802 00420a04 
008: 02100083 110c3c00 00000000 
007: 02180802 00430b04 
000: 00001023 00000e40 000007cf

I know that kmscube -M rgba is not the main problem here, I'm still working on the real issues from the earlier comment. Just want to check if kmscube -M rgba has actually been fixed by other changes in the latest branch, or there is a difference in our setup that we need to figure out.

Do you have arm64 board? I meet the "kmscube -M rgba" "ppir: regalloc fail" problem only on arm64 board and randomly. I can try to reproduce it to see any difference with yours.

Hm, I don't think that it should differ for 64bit. Probably we have a bug in regalloc code somewhere?

Maybe, but I haven't seen this problem on arm32 board.

I get two result for "kmscube -M rgba":

first run fail https://gist.github.com/yuq/5dee3579b4bdacfee2ba7991e30296cc
second run ok https://gist.github.com/yuq/038f0151fe2c1af12f4bf49763dc84ec

Seems the input is same for two cases. Maybe it's due to the code bug in ppir regalloc appears randomly or the mesa RA random behavior like use pointer hash. BTW. when running some glmark2 tests, I get ppir crash in the mesa RA code, so there should be something we need to fix in the ppir regalloc. Maybe they share the same bug.

@enunes if you can't reproduce this "ppir: regalloc fail" problem, maybe you can run some glmark2 tests like buffer/bump/desktop/ideas which have the ppir regalloc crash problem and can be reproduced on arm32 board.

buffer/bump/desktop don't work for me. Buffer fails due to unsupported nir op 75, both bump and desktop crash in ra_add_node_adjacency():

#0 0x0000ffffbeb30cd0 in ra_add_node_adjacency (g=0xaaaaab0aa460, n1=<optimized out>, n2=2) at register_allocate.c:402 #1 0x0000ffffbeb31578 in ra_add_node_interference (g=0xaaaaab0aa460, n1=1, n2=2) at register_allocate.c:467 #2 0x0000ffffbecc2f3c in ppir_regalloc_prog (comp=comp@entry=0xaaaaaac03ff0) at ir/pp/regalloc.c:342 #3 0x0000ffffbecc06cc in ppir_compile_nir (prog=prog@entry=0xaaaaaabe9980, nir=nir@entry=0xaaaaaabe0680, ra=<optimized out>) at ir/pp/nir.c:482 #4 0x0000ffffbecb7930 in lima_create_fs_state (pctx=<optimized out>, cso=<optimized out>) at lima_program.c:180 #5 0x0000ffffbea8d7c4 in st_create_fp_variant (st=st@entry=0xaaaaaac2f890, stfp=stfp@entry=0xaaaaab08dac0, key=key@entry=0xfffffffff240) at state_tracker/st_program.c:1103 #6 0x0000ffffbea8ee3c in st_get_fp_variant (st=st@entry=0xaaaaaac2f890, stfp=stfp@entry=0xaaaaab08dac0, key=key@entry=0xfffffffff240) at state_tracker/st_program.c:1251 #7 0x0000ffffbea4f268 in st_update_fp (st=0xaaaaaac2f890) at state_tracker/st_atom_shader.c:141 #8 0x0000ffffbea4c37c in st_validate_state (st=st@entry=0xaaaaaac2f890, pipeline=pipeline@entry=ST_PIPELINE_RENDER) at ../../src/util/bitscan.h:104 #9 0x0000ffffbea66b9c in prepare_draw (ctx=0xaaaaaac14550, st=0xaaaaaac2f890) at state_tracker/st_draw.c:123 #10 st_draw_vbo (ctx=0xaaaaaac14550, prims=0xfffffffff3b0, nr_prims=1, ib=0x0, index_bounds_valid=1 '\001', min_index=<optimized out>, max_index=<optimized out>, tfb_vertcount=0x0, stream=0, indirect=0x0) at state_tracker/st_draw.c:153 #11 0x0000ffffbea314b0 in vbo_draw_arrays (ctx=<optimized out>, mode=<optimized out>, start=<optimized out>, count=<optimized out>, numInstances=<optimized out>, baseInstance=<optimized out>, drawID=<optimized out>) at vbo/vbo_exec_array.c:391 #12 0x0000aaaaaab240b8 in Mesh::render_vbo (this=this@entry=0xaaaaaaba3980) at /usr/include/c++/8.1.0/bits/stl_vector.h:805 #13 0x0000aaaaaaacf1cc in SceneBump::draw (this=0xaaaaaaba3880) at ../src/scene-bump.cpp:376 #14 0x0000aaaaaaac2190 in MainLoop::draw (this=0xaaaaaac0e110) at ../src/main-loop.cpp:133 #15 0x0000aaaaaaac2b3c in MainLoop::step (this=0xaaaaaac0e110) at ../src/main-loop.cpp:108 #16 0x0000aaaaaaab81d8 in do_benchmark (canvas=...) at ../src/main.cpp:119 #17 0x0000aaaaaaab5ee8 in main (argc=<optimized out>, argv=<optimized out>) at ../src/main.cpp:214

It fails on:

402 g->nodes[n1].q_total += g->regs->classes[n1_class]->q[n2_class];

n1_class and n2_class are both -1.

For no nir op 75, have you got this patch: 1832704c

I didn't, I applied it and now it fails the same way as bump and desktop.

With latest fixes, "regalloc fail" and regalloc crash are gone.

So I've been working on this, I think I have temporaries store and load working, and ppir register spilling somewhat working. The work is on this branch: https://gitlab.freedesktop.org/enunes/mesa/commits/lima-18.1-regspill

Still need to test more, check things like register sizes, and overall refine the implementation.

@yuq825 if you can, please take a look to see if the approach looks fine.

I pushed a new version at https://gitlab.freedesktop.org/enunes/mesa/commits/lima-18.1-regspill which creates the new instrs and attaches the new nodes (this was missing) and it seems to be minimally working. With my modified gbm-surface-color (to use more registers), the code correctly spills registers, inserts loads and stores, and renders correctly.

There are still artifacts with kmscube -M rgba when I force usage of less registers to force spilling, and that is harder to debug. I'm not sure if the current way of adding instructions doesn't break assumptions previously made by the instruction scheduler. Maybe it would be necessary to rerun the instruction scheduler from scratch (or even earlier steps like ppir_node_to_instr) after adding the load/store nodes to create spilling (but that's not trivial since they add nodes which I don't want to add twice). Thoughts?

Re-run scheduler or ppir_node_to_instr is not good idea, because current regalloc spill is based on the output of them, if re-run them current situation may change. I think add new load and store won't break anything. Instruction scheduler just re-order instr to reduce the regalloc pressure, it doesn't have any "assumption".

Current lima_draw.c haven't implement fragment_stack_address/size for PP, I think you need to set them for each PP to store temp.

@enunes how about your work on the regalloc of ppir? I meet ppir regalloc fail when:

start xserver with Xorg -noreset
run test application: https://github.com/yuq/myx/tree/master/hello

This x11 render extension trapezoids draw caused the problem: https://github.com/yuq/myx/blob/master/hello/xrhello.c#L66

I also see some x11 application fail due to this extension support, like xeyes and xclock. After run xserver with MESA_SHADER_CAPTURE_PATH, I get failed shader used to implement this extension by glamor:

captured when run the hello test, ppir regalloc fail: 21.shader_test.2
captured when run the xeyes, no regalloc fail, but result is not right: 21.shader_test

Seems complicated fragment shader.

Two gbm-surface tests to reproduce the problem: https://github.com/yuq/gfx/tree/master/gbm-surface-render https://github.com/yuq/gfx/tree/master/gbm-surface-render-two

@yuq825 since the original ppir regalloc issues were fixed at the time with some other commits (some variable initialization and other reports by valgrind if I recall correctly), we never saw regalloc issues anymore, I think the register spilling lost priority and I ended up working on other stuff that I found on the way. For example thay memory leak stuff.

The regalloc work is still on that branch and should be rebaseable to lima-18.3. I probably also have commits in my local branch, like addressing your last review, that I didn't clean up to push to gitlab. It should work and do the store/load correctly on controlled cases. I think things like vec3/vec2 and swizzling need to be further tested and there will probably be bugs.

I am without access to my work environment and won't be able to work on it for a couple of weeks from now. It's good to have some real world examples to try the spilling code on. Are those issues blocking your short term plans for lima?

It's OK, I can wait when you back. Seems we have so many problems to be solved in ppir that I can't complete in a short term.

render extension is frequently used by x11 applications, so it's important for lima x11 support. But it's not working as shown by the above two failed test cases. Besides the regalloc fail, ppir also has other critical issues:

control flow support, regalloc fail occurs in this complicated shader, but this complication comes from the control flow miss (PIPE_SHADER_CAP_MAX_CONTROL_FLOW_DEPTH==0). So ppir has to compute all statements in both if-else branch and use select instr at last.
fake integer support for constant and uniform. I see some int uniform and constant in above shader, but I'm afraid PIPE_SHADER_CAP_INTEGERS/nir_shader_compiler_options.native_integers==0 don't work well for now (I really hope mali can support native int), need more investigation
vector select instr lower to scalar support

I doubt that utgard supports native integers since it's not required by GLES 2.0

See https://static.docs.arm.com/dui0363/d/DUI0363D_opengl_es_app_dev_guide.pdf

It doesn't mention PP, but looks like GP represents integers using floats:

The vertex shader represents integers using floating-point values.

Also I compiled shader from gbm-surface-render using offline compiler and it uses the same instructions as for floats

Thanks, I doubt too, and think it's too hard to reverse engineer it without a reference. So maybe mesa fake integer support work is needed.

Hi guys, for the integer support problem I asked in the mesa mailing list: https://lists.freedesktop.org/archives/mesa-dev/2019-January/212831.html

there seems some work in progress: https://patchwork.freedesktop.org/patch/268946/

With this patch applied, https://github.com/yuq/gfx/tree/master/gbm-surface-render works.

Although the https://github.com/yuq/gfx/tree/master/gbm-surface-render-two does not work yet (due to reg alloc fail), I can get a basically working X11 xfce4 desktop:

Try to run chromium, the GPU render seems have some texture failure, so may fall back to CPU render, but at least works:

Nice work!

Could you show chrome://gpu page in chrome?

OK, but have to be tomorrow or latter as I left device home.

I got back on this today, here is a quick update with my impressions so far. I can reproduce the ppir regalloc fail with gbm-surface-render-two. My branch is easily rebaseable on lima-18.3. The current register spilling code seems to "solve" the register allocation by spilling two registers, which doesn't seem terrible actually. But the rendering is still not correct. I'm taking a look at what it produced.

Did you applied this patch? https://patchwork.freedesktop.org/patch/268946/

It's needed for both gbm-surface-render and gbm-surface-render-two.

@anarsoul the content of chrome://gpu, as seen in the log, gpu render crashed and fall back to software render. chrome.log

chromium print out log shows it requests some texture format not supported by lima:

../../mesa/src/mesa/main/teximage.c:2849: _mesa_choose_texture_format: Assertion `f != MESA_FORMAT_NONE' failed.

I just pushed a new version to https://gitlab.freedesktop.org/enunes/mesa/tree/lima-18.3-regspill which is starting to look good. Using this patch on top of it 0001-DEBUG-force-ppir-register-spilling.patch I can force a register spilling call in any example, and it seems to work. I tried kmscube, glmark2 and some other examples and the spilling code seems to behave well.

gbm-surface-render-two seems to spill correctly, but I don't get the same render as my pc. I get a full textured triangle instead of the cropped one. I wonder if it is a spilling bug or whether it may be something else. Still need to do more testing on that.

https://patchwork.freedesktop.org/patch/268946/ doesn't solve that for me. In fact, applying this patch breaks kmscube for me. So I don't think we can apply this, at least alone. Or do we need to full patchset?

Also still need to generate some dumps again and review the lima_pp_frame_reg part.

I created merge request !77 (merged) for my ppir register spilling implementation. I think it is more appropriate to move the discussion about spilling to there now. This Issue is fairly old and has covered many other topics than the "Recombine FS inputs into vectors" in the title. We have vectors now again in lima 18.3 and a configurable option for this so I'm not sure if "Recombining" is still a necessary implementation. How about we close this Issue and create other ones with more appropriate titles, if necessary?

@enunes sounds good.

OK.

closed

mentioned in issue #76 (closed)

mentioned in issue #82 (closed)

mentioned in issue #85 (closed)

Recombine FS inputs into vectors

Designs

Child items ...

Activity

Admin message

Admin message

Recombine FS inputs into vectors

Activity