clover: Do direct GPU copies when possible
Almost of the
clEnqueueCopy* functions currently do a CPU map of both the source and destination and memcpy the data between the two. The two exceptions here are
clEnqueueCopyImage. However, image <-> buffer copies should be pretty easily doable with a quick meta-kernel. The extra copies are pretty bad for performance. Even on integrated GPUs, this can lead to as many as three copies instead of the obvious one. It's also often faster to do the copy on the GPU than the CPU due to memory bandwidth and access patterns. On a discrete GPU, it means we're pulling the data across the PCI BAR twice which is just dreadful. If we care about perf, we really should fix this.