vk/queue: implement optimized submission merging
In the case where no signals are present, queue submits can be merged before they reach the driver, delivering staggering performance increases across the board.
vkoverhead results:
intel/dg2
40, submit_noop, 101336, 100.0%
41, submit_50noop, 2123, 2.1%
42, submit_1cmdbuf, 35372, 34.9%
43, submit_50cmdbuf, 713, 0.7%
44, submit_50cmdbuf_50submit, 707, 0.7%
40, submit_noop, 106065, 100.0%
41, submit_50noop, 105992, 99.9%
42, submit_1cmdbuf, 35110, 33.1%
43, submit_50cmdbuf, 709, 0.7%
44, submit_50cmdbuf_50submit, 702, 0.7%
turnip/a740
40, submit_noop, 1227546, 100.0%
41, submit_50noop, 26194, 2.1%
42, submit_1cmdbuf, 1186327, 96.6%
43, submit_50cmdbuf, 545341, 44.4%
44, submit_50cmdbuf_50submit, 16531, 1.3%
40, submit_noop, 1313550, 100.0%
41, submit_50noop, 1078383, 82.1%
42, submit_1cmdbuf, 1129515, 86.0%
43, submit_50cmdbuf, 329247, 25.1%
44, submit_50cmdbuf_50submit, 484241, 36.9%
lavapipe
40, submit_noop, 1972672, 100.0%
41, submit_50noop, 40334, 2.0%
42, submit_1cmdbuf, 5994597, 303.9%
43, submit_50cmdbuf, 2623720, 133.0%
44, submit_50cmdbuf_50submit, 133453, 6.8%
40, submit_noop, 1980681, 100.0%
41, submit_50noop, 1202374, 60.7%
42, submit_1cmdbuf, 6340872, 320.1%
43, submit_50cmdbuf, 2482127, 125.3%
44, submit_50cmdbuf_50submit, 1165495, 58.8%
radv/gfx11
40, submit_noop, 19569683, 100.0%
41, submit_50noop, 402324, 2.1%
42, submit_1cmdbuf, 51356, 0.3%
43, submit_50cmdbuf, 1840, 0.0%
44, submit_50cmdbuf_50submit, 1031, 0.0%
40, submit_noop, 21008648, 100.0%
41, submit_50noop, 4866415, 23.2%
42, submit_1cmdbuf, 51294, 0.2%
43, submit_50cmdbuf, 1823, 0.0%
44, submit_50cmdbuf_50submit, 1828, 0.0%
In general, cases #41
and #44
see increases of 1000% or more.