Support udmabuf to reduce texture copies
As suggested in the mailing list a while ago, a memfd-backed guest resource can be exported through a udmabuf fd and memory mapped. This can avoid texture copies in some cases, see need_temp
cases in vrend_renderer_transfer_write_iov
and vrend_transfer_send_readpixels
.
I have a prototype for this as an initial point for discussion:
virglerenderer branch
qemu branch (based on 4.0, not master), based on an old branch from @kraxel
In the current implementation, qemu exports udmabuf fds with the iovecs of the virgl texture resource (guest memory has to be backed by memfd). Then virglrenderer memory maps these fds and uses them to avoid temporary copies of textures.
Some more details to answer @gurchetansingh 's points made here:
- yes it is possible to use udmabuf to reduce a copy for textures. I used glbench to benchmark this a while back (see crrev.com/c/1325515), and noticed a 10% --> 30% performance increase for larger textures.
This is what I am seeing for glbench texture related tests. Average of 10 runs for the 2 setups (noudmabuf, udmabuf), all numbers are in mtexel_sec
:
TEST: glbench -notemp -tests texture_upload
noudmabuf udmabuf increase
texture_upload_rgba_teximage2d_32 697.376 690.131 -1.04%
texture_upload_rgba_teximage2d_128 2892.858 3470.183 19.96%
texture_upload_rgba_teximage2d_256 3926.493 4669.321 18.92%
texture_upload_rgba_teximage2d_512 3963.561 4811.793 21.40%
texture_upload_rgba_teximage2d_768 4111.886 4525.643 10.06%
texture_upload_rgba_teximage2d_1024 3708.914 4530.04 22.14%
texture_upload_rgba_teximage2d_1536 3114.532 4165.722 33.75%
texture_upload_rgba_teximage2d_2048 3148.042 3916.839 24.42%
texture_upload_rgba_texsubimage2d_32 702.063 680.485 -3.07%
texture_upload_rgba_texsubimage2d_128 2792.765 3485.92 24.82%
texture_upload_rgba_texsubimage2d_256 3807.996 4688.385 23.12%
texture_upload_rgba_texsubimage2d_512 4038.557 4826.141 19.50%
texture_upload_rgba_texsubimage2d_768 4110.97 4537.566 10.38%
texture_upload_rgba_texsubimage2d_1024 3756.719 4426.343 17.82%
texture_upload_rgba_texsubimage2d_1536 3157.429 4135.161 30.97%
texture_upload_rgba_texsubimage2d_2048 3130.843 3991.238 27.48%
TEST: glbench -notemp -tests texture_update
noudmabuf udmabuf increase
texture_update_rgba_teximage2d_32 2.107 1.467 -30.37%
texture_update_rgba_teximage2d_128 13.554 13.021 -3.93%
texture_update_rgba_teximage2d_256 24.193 24.249 0.23%
texture_update_rgba_teximage2d_512 94.577 94.228 -0.37%
texture_update_rgba_teximage2d_768 194.752 198.76 2.06%
texture_update_rgba_teximage2d_1024 328.036 342.557 4.43%
texture_update_rgba_teximage2d_1536 572.834 642.325 12.13%
texture_update_rgba_teximage2d_2048 821.91 975.586 18.70%
texture_update_rgba_texsubimage2d_32 2.394 1.951 -18.50%
texture_update_rgba_texsubimage2d_128 18.383 12.724 -30.78%
texture_update_rgba_texsubimage2d_256 24.573 24.882 1.26%
texture_update_rgba_texsubimage2d_512 94.098 94.512 0.44%
texture_update_rgba_texsubimage2d_768 195.988 205.507 4.86%
texture_update_rgba_texsubimage2d_1024 331.777 343.664 3.58%
texture_update_rgba_texsubimage2d_1536 564.851 658.82 16.64%
texture_update_rgba_texsubimage2d_2048 835.523 977.861 17.04%
TEST: glbench -notemp -tests texture_reuse
noudmabuf udmabuf increase
texture_reuse_rgba_teximage2d_32 11.471 12.525 9.19%
texture_reuse_rgba_teximage2d_128 47.441 47.625 0.39%
texture_reuse_rgba_teximage2d_256 172.742 175.351 1.51%
texture_reuse_rgba_teximage2d_512 505.036 566.322 12.13%
texture_reuse_rgba_teximage2d_768 898.004 1029.102 14.60%
texture_reuse_rgba_teximage2d_1024 1216.35 1435.157 17.99%
texture_reuse_rgba_teximage2d_1536 1602.588 2139.039 33.47%
texture_reuse_rgba_teximage2d_2048 1890.152 2454.258 29.84%
texture_reuse_rgba_texsubimage2d_32 16.033 15.644 -2.43%
texture_reuse_rgba_texsubimage2d_128 47.719 47.745 0.05%
texture_reuse_rgba_texsubimage2d_256 173.207 179.674 3.73%
texture_reuse_rgba_texsubimage2d_512 510.871 571.642 11.90%
texture_reuse_rgba_texsubimage2d_768 919.509 1022.178 11.17%
texture_reuse_rgba_texsubimage2d_1024 1225.409 1446.504 18.04%
texture_reuse_rgba_texsubimage2d_1536 1608.004 2120.99 31.90%
texture_reuse_rgba_texsubimage2d_2048 1928.608 2481.969 28.69%
Note: I am using guest mesa 19.1.3 in my tests currently. If it's better to be testing with master let me know and I 'll try to rerun.
You probably don't want to use udmabuf always -- just when the cost of the memcpy outweighs the cost of the mapping.
I agree we have to find a sweet spot. The current code attempts to memory map all VREND_RESOURCE_STORAGE_TEXTURE
resources regardless of size, which should be overkill.
The glbench texture_update
tests above show a higher sensitivity to the texture size. Clearly using mmap for texture_update_rgba_teximage2d_32
or texture_update_rgba_teximage2d_128
hurts. texture_upload
or texture_resize
are less negatively affected for small sizes. I haven't looked into the specific glbench tests yet to understand their differences.
Overall benefits for all 3 test types start at texture dimensions of 768
or higher.
But perhaps it would be best to make decisions based on real-world apps/games.
- Use Bioshock Infinite trace (see #109) and see if udmabuf makes a difference. It does 4MB buffer uploads (most games to 50kb at most) and with that we can measure the cost of separate memory copies.
The Bioshock traces run horribly slow even on my hosts (amdgpu and i965) so I haven't tried them on virgl/guest yet. It looks like #109 slowness also happens on host replays for me, not sure why yet. I am using a Relase build of apitrace.
- Run any other benchmark you can think of ;-)
I haven't seen a statistically significant improvement for Unigine Valley and Team Fortress 2 traces (traces taken from #73 (closed), these run fine on hosts and guests), or for glmark2.
Since texture uploads aren't as frequent as buffer uploads in games, it may be difficult to see clear-cut performance impact in real-life games.
I see. I assume buffer object uploads are much smaller than texture uploads, and it wouldn't make sense to look into these. But if you think dmabufs could be beneficial for non-texture resources, let me know.
If we want to continue with such an optimization, the qemu functionality needs to be submitted to upstream qemu of course - i have not attempted that yet.