Index draw command stream optimization

Likely it won't reduce memory usage since PP uses contiguous varyings, but it should allow to cut number of VS invocations and memory bandwidth

It's worth noting that it's trade off of CPU cycles for GPU cycles, so we'll need to implement some heuristics that decides when this optimization should be applied.

In your dump gl_Pos location for first VS draw cmd is at 0x10014400, for second 0x10018280, they're exactly 16000 bytes apart (4 components * 4 bytes * 1000 vertices), so it won't save any memory unless we can punch holes in BOs

Dump for [0, 10, 2]

/* ============ VS CMD STREAM BEGIN ============= */
/* 0x10010400 (0x00000000) */	0x10014500 0x30030000	/* UNIFORMS_ADDRESS: address: 0x10014500, size: 48 */
/* 0x10010408 (0x00000008) */	0x10000280 0x40050000	/* SHADER_ADDRESS: address: 0x10000280, size: 80 */
/* 0x10010410 (0x00000010) */	0x00201000 0x10000040	/* SHADER_INFO: prefetch: 2, size: 80 */
/* 0x10010418 (0x00000018) */	0x00000000 0x10000042	/* VARYING_ATTRIBUTE_COUNT: nr_vary: 1, nr_attr: 1 */
/* 0x10010420 (0x00000020) */	0x00000003 0x10000041	/* UNKNOWN_1 */
/* 0x10010428 (0x00000028) */	0x10014540 0x20020000	/* ATTRIBUTES_ADDRESS: address: 0x10014540, size: 1 */
		attribute 1 @ 0x10014580, desc: 0x00006002
/* 0x10010430 (0x00000030) */	0x10014550 0x20020008	/* VARYINGS_ADDRESS: address: 0x10014550, size: 1 */
		varying 1 @ 0x10014400, desc: 0x00008020
/* 0x10010438 (0x00000038) */	0x0b000001 0x00000000	/* DRAW: num: 11, index_draw: true */
/* 0x10010440 (0x00000040) */	0x00000000 0x60000000	/* UNKNOWN_2 */
/* 0x10010448 (0x00000048) */	0x00018000 0x50000000	/* SEMAPHORE_END: index_draw enabled */
/* ============ VS CMD STREAM END =============== */

log.index-10

Seems the only difference is the VS command stream add a draw.

Oh， [0, 10, 2] is just the last one to have one draw, start with [0, 11, 2], there will be two draws.

After more dump with different index count and number, I get some conclusion about the blob driver behavior for index draw:

we still need max_index-min_index+1 memory for output
we can use multi VS DRAW command to only execute vertex shader on vertex which appears in index buffer
when index[v1] - index[v2] <= 8, v1 and v2 will be in same VS DRAW command even there's no index between v1 and v2

So this seems a heavier parser for index buffer than I thought.

So this means, that the index array is sorted and then splitted into different VS DRAWS when index[next] - index[prev] > 8?

Right, more example:

[0,100,2,2,15,0] is divided into draw(0~2)+draw(15)+draw(100)
[0,19,1,2,22,10] is divided into draw(0~~10)+draw(19~~22)

What about something like [0,100,2,2,15,0,50,51,52]? Will it result in 3 vs draw commands?

Actually four: draw(0~~2)+draw(15)+draw(50~~52)+draw(100)

Fyi, i did a little test program to play with indexed draws here https://gitlab.freedesktop.org/rellla/gfx/-/tree/master/gbm-indexed-draw If nobody else already does, i plan to look into that one ...

Please go ahead.

@anarsoul @yuq825 I did some work on that and already got the VS CMD adjusted to split the draws. But i need some help to get forward... :) What are the other places that need to be adjusted? I guess it's lima_update_varying and lima_update_gp_attribute_info?

So basically you're splitting one draw into several draws, see how it's done if we have more than 65535 vertices in lima_draw_vbo_count. I didn't find where it's done for indexed draw. @enunes could you point Andreas to the code where we split indexed draw?

I think we don't do it for indexed draw. If I understood it correctly at the time it was also not done by the vc4 implementation we ported to lima.

Yeah, it's like what we did in lima_draw_vbo_count but not same, because we only need to take care of separating the VS draw cmd and left other parts like PLBU cmd unchanged.

Here is some thoughts:

start from lima_draw_vbo_indexed, efficiently go through the index buffer and get how many VS draws we need, note the index gpu buffer is write combine mapped to CPU, so go through it will take much time, maybe it's better to always have a staged or shadow index buffer in user memory. We may also need to divide the 65535 case in this step but can be left to next MR,
separate the sub VS draw part like the attribute/varying info build and VS draw cmd generation, and call it multi times as step 1 needs

See https://gitlab.freedesktop.org/rellla/mesa/-/commits/lima-opt-indexed-draws This is some poc code.

http://imkreisrum.de/gfx/index_draw_wip/ is the result of https://gitlab.freedesktop.org/rellla/gfx/-/tree/master/gbm-indexed-draw

There are still some essential bits missing. I guess at least VARYINGS and ATTRIBUTES need some work?

@yuq825 Could you give me dump, where there is no "0" in the index array? Meaning, the first used index is > 0? Thanks...

Here is the result for [10,100,2,2,15,10,50,51,52]: log.index-no-zero

@yuq825: thanks, but i think i know now, how it works... https://gitlab.freedesktop.org/rellla/mesa/-/commits/lima-opt-indexed-draws is the next version, which kind of works when there is an index 0 in the array.

The start address for varyings and attributes, which are given in the *_info addresses is always the same. There is no offset for the first draw, even if we don't start at an index != 0. Instead, the blob drops the first unused indices. This can be seen with the start adresses of the varyings and attributes from the second and following draws.

In the above branch it's already implemented. Ugly, but enough to see how it works. What is missing, is, that the attributes and uniforms themselves also have to be dropped.... I didn't look into that yet.

OK, i need some help here:

Assume we have the following vertex buffer:

vertex buffer[
  1.0, -1.0, 0.0, // 0
  0.0, -1.0, 0.0, // 1
  0.0,  0.0, 0.0, // 2
  1.0, -1.0, 0.0, // 3
  0.0, -1.0, 0.0, // 4
  0.0,  0.0, 0.0] // 5

We have the following scenarios (in pseudo code):

Scenario 1) (http://imkreisrum.de/gfx/id_mali_1/)

index buffer [0, 1, 2]
glDrawElements()

As i read the mali dumps, the blob does the following:

/* ATTRIBUTES_ADDRESS: address: 0x100144c0, size: 1 */
		attribute 1 @ 0x10014500, desc: 0x00006002
/* VARYINGS_ADDRESS: varying info @ 0x100144d0, size: 1 */
		varying 1 @ 0x10014400, desc: 0x00008020

@ 0x10014500 i have

0x3f800000, 0xbf800000, 0x00000000, 0x00000000, /* 0x00000000 */ -> 1.0, -1.0, 0.0, 0.0
0xbf800000, 0x00000000, 0x00000000, 0x00000000, /* 0x00000010 */ -> -1.0, 0.0, 0.0, 0.0   /* 0.0 should be at 0x10014520 */

which are the corresponding vertex buffer values for index 0, 1 and 2

Scenario 2) (http://imkreisrum.de/gfx/id_mali_2/)

index buffer [3, 1, 2]
glDrawElements()

The addresses are the same

/* ATTRIBUTES_ADDRESS: address: 0x100144c0, size: 1 */
		attribute 1 @ 0x10014500, desc: 0x00006002
/* VARYINGS_ADDRESS: varying info @ 0x100144d0, size: 1 */
		varying 1 @ 0x10014400, desc: 0x00008020

but @ 0x10014500 we get the

0x00000000, 0xbf800000, 0x00000000, 0x00000000, /* 0x00000000 */ -> 0.0, -1.0, 0.0, 0.0
0x00000000, 0x00000000, 0x3f800000, 0xbf800000, /* 0x00000010 */ -> 0.0, 0.0, 1.0, -1.0   /* 0.0 should be at 0x10014520 */

which are the corresponding vertex buffer values for index 1, 2 and 3. So the buffer is 'left shifted' to drop vertices which are not used in the index buffer (beginning from 0)

In case the draw is splitted, the blob doesn't omit the gap, but sets the attribute_address of the second draw to min_index(draw) - min_index(previous_draw). So it seems, that only the vertex data from 0 up to the first used index is dropped. I guess, this is also done for unused vertex data at the end of the buffer.

My question now is, if, how and where i can adjust that buffer? Because it seems logical for me, I tried setting attribute 1 @ 0x1001450c in the second example in order to address it to the vertex_buffer[1], but that doesn't seem to work.

Any hints are appreciated :)

I think i got it! Will push my code later...

https://gitlab.freedesktop.org/rellla/mesa/-/commits/lima-opt-indexed-draws

Though there is huge space for refactoring and optimization, thats a working version now. I'd appreciate, if you can give me some input where i should look into before doing a MR. Especially the memcpy and qsort should be the part where it could be slowed down?

Thanks

memcpy from mmap gpu buffer is slow. I think we'd better have index array at CPU mem from the beginning. That means when info->has_user_indices case, use the user index directly for sort; when !info->has_user_indices case, support shadow bo: when create resource with PIPE_BIND_INDEX_BUFFER, alloc same CPU mem for this bo, and update both CPU and GPU mem, at last use the CPU mem for sort.

Ok, i'll try that.

If i understand correctly, in the !has_user_indices case we have the index buffer in CPU mem and i can directly sort that one. But in the has_user_indices case, i will still have to copy the user index before sorting, because qsort changes that array?

I think you have to do qsort in another tmp mem in both case because the shadow bo should keep CPU and GPU mem same (it will be reused), so you can't modify the CPU mem when sort.

I pushed a new version: https://gitlab.freedesktop.org/rellla/mesa/-/commits/lima-opt-indexed-draws

It seems to work without issues and causes no regressions in deqp. I still need to find out how to benchmark it... I thought of using https://gitlab.freedesktop.org/anholt/gpu-trace-perf/-/tree/master with apitrace, but for that I have to implement gl_ext_timer_query first ...

Does anyone have a good idea for an alternative way to measure it?

Any comments on the code are welcome anyway :)

[EDIT] What is still missing, is some kind of caching, to skip the memcpy and qsort inf vbo_indexed_draw() if it's not needed...

I think you need to first test the existing benchmark to ensure this overhead not cause any performance regression, like q3a, supertuxkart, glamrk2 and x11perf.

Then write some simple test case to measure the improvement like draw with continues 1000 index and none-continuous 1000 index.

Index draw command stream optimization

Designs

Child items ...

Activity

Admin message

Admin message

Index draw command stream optimization

Activity