v3d: reduce shader cache lookup overhead
While hacking on my Raspberry Pi over the weekend I noticed some performance issues when changing textures between draw calls (reusing the same fragment / vertex shader).
After digging around, the root cause of the issue was due to the Broadcom driver utilizing the sampler state when compiling shaders, which forces it to check if the shader needs to be recompiled on each draw if a different texture has been bound (https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/gallium/drivers/v3d/v3d_program.c#L490). As part of this check, it generates a rather large ~180 byte key based on various state, hashes it and then checks the shader cache for the hashed key. The default FNV-1a hash function was eating up a significant amount of the run time hashing these keys.
With a small modification to one of the Mesa demos (DEMO.patch) you can easily see the overhead with perf:
# Samples: 33K of event 'cpu-clock:uhH'
# Event count (approx.): 8456250000
#
# Overhead Command Shared Object Symbol
# ........ ............... .................. ..............................................................
#
18.63% glslstateschang vc4_dri.so [.] _mesa_hash_data
4.54% glslstateschang vc4_dri.so [.] v3d_write_uniforms
3.95% glslstateschang vc4_dri.so [.] v3d_draw_vbo
3.27% glslstateschang libarmmem-v7l.so [.] memcmp
3.21% glslstateschang vc4_dri.so [.] set_search
2.37% glslstateschang vc4_dri.so [.] _mesa_reference_texobj_
1.97% glslstateschang vc4_dri.so [.] _mesa_reference_program_
1.76% glslstateschang vc4_dri.so [.] hash_table_search
1.71% glslstateschang vc4_dri.so [.] v3d41_emit_state
1.40% glslstateschang vc4_dri.so [.] st_setup_arrays
By replacing the FNV-1a hash function with xxhash for these large keys, the run time of the key hashing drops to ~1% (see https://aras-p.info/img/blog/2016-08/hash2-pc.png for a comparison of the hash rates vs data size for various popular hash functions):
# Samples: 33K of event 'cpu-clock:uhH'
# Event count (approx.): 8356750000
#
# Overhead Command Shared Object Symbol
# ........ ............... .................. ..................................................................
#
5.18% glslstateschang vc4_dri.so [.] v3d_write_uniforms
4.69% glslstateschang vc4_dri.so [.] v3d_draw_vbo
3.90% glslstateschang libarmmem-v7l.so [.] memcmp
3.61% glslstateschang vc4_dri.so [.] set_search
2.79% glslstateschang vc4_dri.so [.] _mesa_reference_texobj_
2.24% glslstateschang vc4_dri.so [.] _mesa_reference_program_
2.06% glslstateschang vc4_dri.so [.] v3d41_emit_state
1.86% glslstateschang vc4_dri.so [.] hash_table_search
1.77% glslstateschang vc4_dri.so [.] st_setup_arrays
1.76% glslstateschang vc4_dri.so [.] st_update_rasterizer
1.61% glslstateschang libgcc_s.so.1 [.] __aeabi_uidiv
1.44% glslstateschang libarmmem-v7l.so [.] memcpy
1.42% glslstateschang vc4_dri.so [.] _mesa_update_texture_state
1.40% glslstateschang vc4_dri.so [.] cso_hash_find
1.35% glslstateschang vc4_dri.so [.] _mesa_update_vao_derived_arrays
1.22% glslstateschang vc4_dri.so [.] util_set_vertex_buffers_mask
1.20% glslstateschang vc4_dri.so [.] u_vbuf_set_vertex_buffers
1.17% glslstateschang vc4_dri.so [.] _mesa_reference_shader_program_
1.17% glslstateschang vc4_dri.so [.] v3d_update_compiled_shaders
1.16% glslstateschang vc4_dri.so [.] st_convert_sampler
1.15% glslstateschang vc4_dri.so [.] XXH32 (the new xxhash routine)
I didn't do comparisons with other popular hash routines as xxhash seems to be a top contender in most benchmarks for larger chunks of data and the code looks to be well written and portable. I've attached an initial patch which drops in xxhash and updates the shader cache hash routines to use it (XXHASH.patch) if anyone wants to test themselves.
Adding a new dependency to solve this isn't my favorite solution, so I'd love to get some feedback and see if anyone has other ideas.