CL: Add optimize/serialize entrypoints for libclc
Fixes #89
Fully-inlined unoptimized libclc shaders can contain things like 16 copies of an unused software fma implementation (sin/cos). By running an optimization pass first, we can trim the fat, so that we only inline code that's actually needed. While there's a lot of functions, none of them are so huge, so optimizing the whole libclc shader can end up being faster than inlining and then trying to optimize.