Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Equinix is shutting down its operations with us on April 30, 2025. They have graciously supported us for almost 5 years, but all good things come to an end. We are expecting to transition to new infrastructure between late March and mid-April. We do not yet have a firm timeline for this, but it will involve (probably multiple) periods of downtime as we move our services whilst also changing them to be faster and more responsive. Any updates will be posted in freedesktop/freedesktop#2011 as it becomes clear, and any downtime will be announced with further broadcast messages.
Current index draw command stream is not efficient, for example with index array [0, 1000, 2], it needs 1001 VS execution and 1001 varying output space.
But some dump results shows we can cut off this overhead with optimizations in command stream:
Likely it won't reduce memory usage since PP uses contiguous varyings, but it should allow to cut number of VS invocations and memory bandwidth
It's worth noting that it's trade off of CPU cycles for GPU cycles, so we'll need to implement some heuristics that decides when this optimization should be applied.
In your dump gl_Pos location for first VS draw cmd is at 0x10014400, for second 0x10018280, they're exactly 16000 bytes apart (4 components * 4 bytes * 1000 vertices), so it won't save any memory unless we can punch holes in BOs
@anarsoul@yuq825
I did some work on that and already got the VS CMD adjusted to split the draws.
But i need some help to get forward... :) What are the other places that need to be adjusted? I guess it's lima_update_varying and lima_update_gp_attribute_info?
So basically you're splitting one draw into several draws, see how it's done if we have more than 65535 vertices in lima_draw_vbo_count. I didn't find where it's done for indexed draw. @enunes could you point Andreas to the code where we split indexed draw?
Yeah, it's like what we did in lima_draw_vbo_count but not same, because we only need to take care of separating the VS draw cmd and left other parts like PLBU cmd unchanged.
Here is some thoughts:
start from lima_draw_vbo_indexed, efficiently go through the index buffer and get how many VS draws
we need, note the index gpu buffer is write combine mapped to CPU, so go through it will take much time, maybe it's better to always have a staged or shadow index buffer in user memory. We may also need to divide the 65535 case in this step but can be left to next MR,
separate the sub VS draw part like the attribute/varying info build and VS draw cmd generation, and call it multi times as step 1 needs
The start address for varyings and attributes, which are given in the *_info addresses is always the same. There is no offset for the first draw, even if we don't start at an index != 0. Instead, the blob drops the first unused indices. This can be seen with the start adresses of the varyings and attributes from the second and following draws.
In the above branch it's already implemented. Ugly, but enough to see how it works. What is missing, is, that the attributes and uniforms themselves also have to be dropped.... I didn't look into that yet.
0x00000000, 0xbf800000, 0x00000000, 0x00000000, /* 0x00000000 */ -> 0.0, -1.0, 0.0, 0.00x00000000, 0x00000000, 0x3f800000, 0xbf800000, /* 0x00000010 */ -> 0.0, 0.0, 1.0, -1.0 /* 0.0 should be at 0x10014520 */
which are the corresponding vertex buffer values for index 1, 2 and 3.
So the buffer is 'left shifted' to drop vertices which are not used in the index buffer (beginning from 0)
In case the draw is splitted, the blob doesn't omit the gap, but sets the attribute_address of the second draw to min_index(draw) - min_index(previous_draw). So it seems, that only the vertex data from 0 up to the first used index is dropped. I guess, this is also done for unused vertex data at the end of the buffer.
My question now is, if, how and where i can adjust that buffer?
Because it seems logical for me, I tried setting attribute 1 @ 0x1001450c in the second example in order to address it to the vertex_buffer[1], but that doesn't seem to work.
Though there is huge space for refactoring and optimization, thats a working version now.
I'd appreciate, if you can give me some input where i should look into before doing a MR.
Especially the memcpy and qsort should be the part where it could be slowed down?
memcpy from mmap gpu buffer is slow. I think we'd better have index array at CPU mem from the beginning. That means when info->has_user_indices case, use the user index directly for sort; when !info->has_user_indices case, support shadow bo: when create resource with PIPE_BIND_INDEX_BUFFER, alloc same CPU mem for this bo, and update both CPU and GPU mem, at last use the CPU mem for sort.
If i understand correctly, in the !has_user_indices case we have the index buffer in CPU mem and i can directly sort that one.
But in the has_user_indices case, i will still have to copy the user index before sorting, because qsort changes that array?
I think you have to do qsort in another tmp mem in both case because the shadow bo should keep CPU and GPU mem same (it will be reused), so you can't modify the CPU mem when sort.
It seems to work without issues and causes no regressions in deqp. I still need to find out how to benchmark it...
I thought of using https://gitlab.freedesktop.org/anholt/gpu-trace-perf/-/tree/master with apitrace, but for that I have to implement gl_ext_timer_query first ...
Does anyone have a good idea for an alternative way to measure it?
Any comments on the code are welcome anyway :)
[EDIT] What is still missing, is some kind of caching, to skip the memcpy and qsort inf vbo_indexed_draw() if it's not needed...
I think you need to first test the existing benchmark to ensure this overhead not cause any performance regression, like q3a, supertuxkart, glamrk2 and x11perf.