CL: Support vload_half and vstore_half (and friends)
Still need to finish running this through the CTS but it's looking good so far.
The vload_half and vstore_half functions are special. They're the only thing that you can use on the half type unless the cl_khr_fp16
extension is supported, and they're required to be supported. They operate on arrays of half, and either convert floats into halves for storage, or halves into float for math. They support all rounding modes.
DXIL has two paths for working with these:
- On hardware which supports native 16bit ops, we can just use the basic
FPTRUNC
opcode. This works on WARP. - On hardware which doesn't support native 16bit ops, we can use the legacy (coming from DXBC) intrinsics which convert to/from fp16 stored in the low 16 bits of an i32 value.
However... DXIL doesn't meet OpenCL's requirements for denorm and rounding support. DXIL's intrinsics are round-to-zero, but denorms are flushed (probably, anyway). This series adds support to the DXIL backend for them anyway, but they're unused for CL right now.
Instead, the CL frontend runs a NIR pass on the shader to convert these f2f16 ops into software conversions. The previous implementation of rounding modes for float->float used float ops (e.g. flt
or nextafter
) as part of their implementation, which can flush denorms, and that's not allowed for fp32->fp16 in CL, so I instead routed them to new dedicated opcodes which we can lower in our backend.
Note that the aligned versions of the functions could be implemented better. They could either add alignment into the loads/stores and use a load/store vectorizer to combine them, or it could directly generate half vector loads instead of scalar loads.
/cc @karolherbst @jekstrand for the vtn bits.