CL: Add support for global memory vector load/store
New version that tries to split up the lower_{load,store}_global() logic so some of it can be re-used for local and constant memory.
Note that I went through various attempts at simplifying it even more, like splitting the 'lower into sub-16byte accesses', 'lower non-32bit component extraction' and 'lower load/store global to their dxil equivalent' into separate passes, but the code was not much cleaner, so I decided to go back to a 'do it all in one step' approach.
Edited by Daniel Stone