util_cpu_detect is an anti-pattern: it relies on callers high up in the call chain initializing a local implementation detail. As a real example, I added:
...a Mali compiler unit test ...that called bi_imm_f16() to construct an FP16 immediate ...that calls _mesa_float_to_half internally ...that calls util_get_cpu_caps internally, but only on x86_64! ...that relies on util_cpu_detect having been called before.
As a consequence, this unit test:
...crashes on x86_64 with USE_X86_64_ASM set ...passes on every other architecture ...works on my local arm64 workstation and on my test board ...failed CI which runs on x86_64 ...needed to have a random util_cpu_detect() call sprinkled in.
This is a bad design decision. It pollutes the tree with magic, it causes mysterious CI failures especially for non-x86_64 developers, and it is not justified by a micro-optimization.
Instead, let's call util_cpu_detect directly from util_get_cpu_caps, avoiding the footgun where it fails to be called. This cleans up Mesa's design, simplifies the tree, and avoids a class of a (possibly platform-specific) failures. To mitigate the added overhead, inline util_cpu_detect into util_get_cpu_caps now that it has a single caller.
In principle, this adds only a single check of overhead to the happy path (for the call once). In practice, this overhead might be worse than a single load+branch due to multi-threading. If this is an issue, the CPU caps data structure could be duplicated per thread (thread_local) and populated independently by each thread to avoid that. Nevertheless, given the overwhelmeing design problems with the status quo, this change is required; if you need additional optimization, the onus is on you to do so in a non-invasive way and to provide real data justifying the change.
Bifrost shader-db on my Apple M1 (bare metal Linux) runtime is hurt by <0.5%
Signed-off-by: Alyssa Rosenzweig email@example.com