Skip to content

nir: add infrastructure to precompile OpenCL kernels, and switch Asahi to it

Alyssa Rosenzweig requested to merge alyssa/mesa:nir+agx/precomp into main

Contains !32207 (closed) and !32209 (closed).

Short version: All we have to write is

KERNEL(32)
libagx_predicate_indirect(global uint32_t *out, constant uint32_t *in,
                          constant uint32_t *draw_count, uint32_t stride_el,
                          uint indexed__2)
{
   uint draw = get_global_id(0);
   uint words = indexed__2 ? 5 : 4;
   bool enabled = draw < *draw_count;
   out += draw * words;
   in += draw * stride_el;

   /* Copy enabled draws, zero predicated draws. */
   for (uint i = 0; i < words; ++i) {
      out[i] = enabled ? in[i] : 0;
   }
}

and then in the driver we dispatch that with just

   uint64_t patched = hk_pool_alloc(cmd, out_stride * maxDrawCount, 4).gpu;
   uint64_t in = hk_buffer_address(buffer, offset);
   uint64_t count_addr = hk_buffer_address(count_buffer, countBufferOffset);

   libagx_predicate_indirect(cs, agx_1d(maxDrawCount), indexed, patched, in,
                             count_addr, stride / 4);

...and that's it. No boilerplate, no explicit data layouts, no meta save/restore overheads, no extra indirections. Just write code for the GPU and call it from the driver.

...Now coming to a driver near you!

The long version:

 * This file contains helpers for precompiling OpenCL kernels with a Mesa driver
 * and dispatching them from within the driver. It is a grab bag of utility
 * functions, rather than an all-in-one solution, to give drivers flexibility to
 * customize the compile pipeline. See asahi_clc for how the pieces fit
 * together, and see libagx for real world examples of this infrastructure.
 *
 * Why OpenCL C?
 *
 * 1. Mesa drivers are generally written in C. OpenCL C is close enough to C11
 *    that we can share driver code between host and device. This is the "killer
 *    feature" and enables implementing device-generated commands in a sane way.
 *    Both generated (e.g. GenXML) headers and entire complex driver logic may
 *    be shared for a major maintenance win.
 *
 * 2. OpenCL C has significant better ergonomics than GLSL, particularly around
 *    raw pointers. Plainly, GLSL was never designed as a systems language. What
 *    we need for implementing driver features on-device is a systems language,
 *    not a shading language.
 *
 * 3. OpenCL is the compute standard, and it is supported in Mesa via rusticl.
 *    Using OpenCL in our drivers is a way of "eating our own dog food". If Mesa
 *    based OpenCL isn't good enough for us, it's not good enough for our users
 *    either.
 *
 * 4. OpenCL C has enough affordances for GPUs that it is suitable for GPU use,
 *    whereas pure C11 would probably not.
 *
 * Why precompile?
 *
 * 1. Precompiling lets us do build-time reflection on internal shaders to
 *    generate data layouts and dispatch macros automatically. The precompile
 *    pipeline implemented in this file offers significantly better ergonomics
 *    than handrolling kernels at runtime.
 *
 * 2. Compiling internal shaders at draw-time can introduce jank. Compiling
 *    internal shaders with application shaders slows down application shader
 *    compile time (and might still introduce jank in a hash-and-cache scheme).
 *    Compiling shaders at device creation time slows down initialization. The
 *    only time we can compile with no performance impact is when building the
 *    driver ahead-of-time.
 *
 * 3. Mesa is built (on developer and packager machines) far less often than it
 *    is run (on user machines). Compiling at build-time is simply more
 *    efficient in a global sense.
 *
 * 4. Compiling /all/ internal shaders with the Mesa build can turn runtime
 *    assertion fails into build failures, allowing for backend compilers to be
 *    smoke-tested without hardware testing and hence allowing regressions to be
 *    caught sooner.
 *
 * At a high level, a library of kernels is compiled to SPIR-V. That SPIR-V is
 * then translated to NIR and optimized, leaving many entrypoints. Each NIR
 * entrypoint represents one `kernel` to be precompiled.
 *
 * Kernels generally have arguments. Arguments may be either scalars or
 * pointers. It is not necessary to explicitly define a data layout for the
 * arguments. You simply declare arguments to the OpenCL side kernel:
 *
 *    KERNEL(1) void foo(int x, int y) { .. }
 *
 * The data layout is automatically derived from the function signature
 * (nir_precomp_derive_layout). The data layout is exposed to the CPU as
 * structures (nir_precomp_print_layout_struct).
 *
 *    struct foo_args {
 *       uint32_t x;
 *       uint32_t y;
 *    } PACKED;
 *
 * The data is expected to be mapped to something like Vulkan push constants in
 * the hardware. The driver defines a callback to load an argument given a byte
 * offset (e.g. via load_push_constant intrinsics). When building a variant,
 * nir_precomp_build_variant will load the arguments according to the chosen
 * layout:
 *
 *    %0 = load_push_constant 0
 *    %1 = load_push_constant 4
 *    ...
 *
 * This ensures that data layouts match between CPU and GPU, without any
 * boilerplate, while giving drivers control over exactly how arguments are
 * passed. (This can save an indirection compared to stuffing in a UBO.)
 *
 * To dispatch kernels from the driver, the kernel is "called" like a function:
 *
 *    foo(cmdbuf, grid(4, 4, 1), x, y);
 *
 * This resolves to generated disptach macros
 * (nir_precomp_print_dispatch_macros), which lay out their arguments according
 * to the derived layout and then call the driver-specific dispatch. To
 * implement that mechanism, a driver must implement the following function
 * signature:
 *
 *    MESA_DISPATCH_PRECOMP(context, grid, kernel index, argument pointer,
 *                          size of arguments)
 *
 * The exact types used are determined by the driver. context is something like
 * a Vulkan command buffer. grid represents the 3D dispatch size. kernel index
 * is the index of the precompiled kernel (nir_precomp_index). argument pointer
 * is a host pointer to the sized argument structure, which the driver must
 * upload and bind (e.g. as push constants).
 *
 * Because the types are ambiguous here, the same mechanism works for both
 * Gallium and Vulkan drivers.
 *
 * Although the generated header could be consumed by OpenCL code,
 * MESA_DISPATCH_PRECOMP is not intended to be implemented on the device side.
 * Instead, an analogous mechanism can be implemented for device-side enqueue
 * with automatic data layout handling. Device-side enqueue of precompiled
 * kernels has various applications, most obviously for implementing
 * device-generated commands. (This is not upstream yet but I've written the
 * pass and verified this will work.)
 *
 * All precompiled kernels for a given target are zero-indexed and referenced in
 * an array of binaries. These indices are enum values, generated by
 * nir_precomp_print_program_enum. The array of kernels is generated by
 * nir_precomp_print_binary_map. There is generally an array for each hardware
 * target supported by a driver. On device creation, the driver would select the
 * array of binaries for the probed hardware.
 *
 * Sometimes a single binary can be used for multiple targets. In this case, the
 * driver should compile it only once and remap the binary arrays with the
 * callback passed to nir_precomp_print_binary_map.
 *
 * A single entrypoint may have multiple variants, as a small shader key. To
 * support this, the **last** argument of a kernel MAY be suffixed with __n
 * where n is the desired number of variants. That argument will automatically
 * vary from 0 to n - 1. This is the optional variant index. This mechanism is
 * controlled by nir_precomp_nr_variants. For example:
 *
 *    KERNEL(1) void bar(uchar *x, int variant__4) {
 *       for (uint i = 0; i <= variant__4; ++i)
 *          x[i]++;
 *    }
 *
 * will generate 4 binaries with 1, 2, 3, and 4 additions respectively. This
 * mechanism (sigil suffixing) is kinda ugly, but I can't figure out a nicer way
 * to attach metadata to the argument in standard OpenCL.
 *
 * Kernels must declare their workgroup size with
 * __attribute__((reqd_work_group_size(...))) for two reasons. First, variable
 * workgroup sizes have tricky register allocation problems in several backends,
 * avoided here. Second, it makes more sense to attach the workgroup size to the
 * kernel than to the caller so this improves ergonomics of the dispatch macros.
 */
Edited by Alyssa Rosenzweig

Merge request reports

Loading