Support udmabuf to reduce texture copies

As suggested in the mailing list a while ago, a memfd-backed guest resource can be exported through a udmabuf fd and memory mapped. This can avoid texture copies in some cases, see need_temp cases in vrend_renderer_transfer_write_iov and vrend_transfer_send_readpixels.

I have a prototype for this as an initial point for discussion:

virglerenderer branch

qemu branch (based on 4.0, not master), based on an old branch from @kraxel

In the current implementation, qemu exports udmabuf fds with the iovecs of the virgl texture resource (guest memory has to be backed by memfd). Then virglrenderer memory maps these fds and uses them to avoid temporary copies of textures.

Some more details to answer @gurchetansingh 's points made here:

yes it is possible to use udmabuf to reduce a copy for textures. I used glbench to benchmark this a while back (see crrev.com/c/1325515), and noticed a 10% --> 30% performance increase for larger textures.

This is what I am seeing for glbench texture related tests. Average of 10 runs for the 2 setups (noudmabuf, udmabuf), all numbers are in mtexel_sec:

TEST: glbench -notemp -tests texture_upload

           	                        noudmabuf	udmabuf	        increase
texture_upload_rgba_teximage2d_32	 697.376	 690.131	-1.04%
texture_upload_rgba_teximage2d_128	2892.858	3470.183	19.96%
texture_upload_rgba_teximage2d_256	3926.493	4669.321	18.92%
texture_upload_rgba_teximage2d_512	3963.561	4811.793	21.40%
texture_upload_rgba_teximage2d_768	4111.886	4525.643	10.06%
texture_upload_rgba_teximage2d_1024	3708.914	4530.04	        22.14%
texture_upload_rgba_teximage2d_1536	3114.532	4165.722	33.75%
texture_upload_rgba_teximage2d_2048	3148.042	3916.839	24.42%
texture_upload_rgba_texsubimage2d_32	 702.063	 680.485	-3.07%
texture_upload_rgba_texsubimage2d_128	2792.765	3485.92   	24.82%
texture_upload_rgba_texsubimage2d_256	3807.996	4688.385	23.12%
texture_upload_rgba_texsubimage2d_512	4038.557	4826.141	19.50%
texture_upload_rgba_texsubimage2d_768	4110.97	        4537.566	10.38%
texture_upload_rgba_texsubimage2d_1024	3756.719	4426.343	17.82%
texture_upload_rgba_texsubimage2d_1536	3157.429	4135.161	30.97%
texture_upload_rgba_texsubimage2d_2048	3130.843	3991.238	27.48%

TEST: glbench -notemp -tests texture_update

                                        noudmabuf       udmabuf         increase
texture_update_rgba_teximage2d_32       2.107           1.467           -30.37%
texture_update_rgba_teximage2d_128      13.554          13.021          -3.93%
texture_update_rgba_teximage2d_256      24.193          24.249          0.23%
texture_update_rgba_teximage2d_512      94.577          94.228          -0.37%
texture_update_rgba_teximage2d_768      194.752         198.76          2.06%
texture_update_rgba_teximage2d_1024     328.036         342.557         4.43%
texture_update_rgba_teximage2d_1536     572.834         642.325         12.13%
texture_update_rgba_teximage2d_2048     821.91          975.586         18.70%
texture_update_rgba_texsubimage2d_32    2.394           1.951           -18.50%
texture_update_rgba_texsubimage2d_128   18.383          12.724          -30.78%
texture_update_rgba_texsubimage2d_256   24.573          24.882          1.26%
texture_update_rgba_texsubimage2d_512   94.098          94.512          0.44%
texture_update_rgba_texsubimage2d_768   195.988         205.507         4.86%
texture_update_rgba_texsubimage2d_1024  331.777         343.664         3.58%
texture_update_rgba_texsubimage2d_1536  564.851         658.82          16.64%
texture_update_rgba_texsubimage2d_2048  835.523         977.861         17.04%

TEST: glbench -notemp -tests texture_reuse
                                        noudmabuf       udmabuf         increase
texture_reuse_rgba_teximage2d_32        11.471          12.525          9.19%
texture_reuse_rgba_teximage2d_128       47.441          47.625          0.39%
texture_reuse_rgba_teximage2d_256       172.742         175.351         1.51%
texture_reuse_rgba_teximage2d_512       505.036         566.322         12.13%
texture_reuse_rgba_teximage2d_768       898.004         1029.102        14.60%
texture_reuse_rgba_teximage2d_1024      1216.35         1435.157        17.99%
texture_reuse_rgba_teximage2d_1536      1602.588        2139.039        33.47%
texture_reuse_rgba_teximage2d_2048      1890.152        2454.258        29.84%
texture_reuse_rgba_texsubimage2d_32     16.033          15.644          -2.43%
texture_reuse_rgba_texsubimage2d_128    47.719          47.745          0.05%
texture_reuse_rgba_texsubimage2d_256    173.207         179.674         3.73%
texture_reuse_rgba_texsubimage2d_512    510.871         571.642         11.90%
texture_reuse_rgba_texsubimage2d_768    919.509         1022.178        11.17%
texture_reuse_rgba_texsubimage2d_1024   1225.409        1446.504        18.04%
texture_reuse_rgba_texsubimage2d_1536   1608.004        2120.99         31.90%
texture_reuse_rgba_texsubimage2d_2048   1928.608        2481.969        28.69%

Note: I am using guest mesa 19.1.3 in my tests currently. If it's better to be testing with master let me know and I 'll try to rerun.

You probably don't want to use udmabuf always -- just when the cost of the memcpy outweighs the cost of the mapping.

I agree we have to find a sweet spot. The current code attempts to memory map all VREND_RESOURCE_STORAGE_TEXTURE resources regardless of size, which should be overkill.

The glbench texture_update tests above show a higher sensitivity to the texture size. Clearly using mmap for texture_update_rgba_teximage2d_32 or texture_update_rgba_teximage2d_128 hurts. texture_upload or texture_resize are less negatively affected for small sizes. I haven't looked into the specific glbench tests yet to understand their differences.

Overall benefits for all 3 test types start at texture dimensions of 768 or higher.

But perhaps it would be best to make decisions based on real-world apps/games.

Use Bioshock Infinite trace (see #109) and see if udmabuf makes a difference. It does 4MB buffer uploads (most games to 50kb at most) and with that we can measure the cost of separate memory copies.

The Bioshock traces run horribly slow even on my hosts (amdgpu and i965) so I haven't tried them on virgl/guest yet. It looks like #109 slowness also happens on host replays for me, not sure why yet. I am using a Relase build of apitrace.

Run any other benchmark you can think of ;-)

I haven't seen a statistically significant improvement for Unigine Valley and Team Fortress 2 traces (traces taken from #73 (closed), these run fine on hosts and guests), or for glmark2.

Since texture uploads aren't as frequent as buffer uploads in games, it may be difficult to see clear-cut performance impact in real-life games.

I see. I assume buffer object uploads are much smaller than texture uploads, and it wouldn't make sense to look into these. But if you think dmabufs could be beneficial for non-texture resources, let me know.

If we want to continue with such an optimization, the qemu functionality needs to be submitted to upstream qemu of course - i have not attempted that yet.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information