venus: optimize template based descriptor set update and push
venus: optimize template based descriptor set update and push
Summary:
- commit 1~2: tiny issue fixes
- commit 3: simplify push descriptor tracking
- commit 4: optimize template data calculation (also make ptr math legit)
- commit 5: optimize descriptor image info fix
- commit 6: use STACK_ARRAY for template based set update and push to get rid of locking
- commit 7: clean up the prior template set update bits (split from commit 6 to ease review, can stash if preferred)
At the bare minimum, there's no regression from commit 6. The overhead for the new vn_descriptor_set_fill_update_with_template
is slightly larger than the prior vn_update_descriptor_set_with_template_locked
(if just revert commit 6, otherwise faster than the prior call before this MR), however, overall I consistently see a reduction in cpu overhead because of the non-trivial lock overhead by itself.
Attaching flamegraphs for vkoverhead test 56 descriptor_template_16combined_sampler
(to hit the suboptimal path of STACK_ARRAY):
Above are collected with:
- debug venus build with asserts disabled
- release build anv and vkr (shortcut the real set update call so that cpu bound on driver side)
An easy way to compare is to check the % changes of an untouched major call in both graphs: vn_async_vkUpdateDescriptorSets
. Before this MR, 92.56%
. After this MR, 95.26%
. The lock overhead would standout more if the engine only does tiny updates. For engines updating descriptor with template and recording cmds in multiple threads, we also hit lock contention making this worse.