clover: Do direct GPU <-> GPU copies when possible
All of the clEnqueueCopy*
functions currently do a CPU map of both the source and destination and memcpy the data between the two. Even on integrated GPUs, this can lead to as many as three copies instead of the obvious one. It's also often faster to do the copy on the GPU than the CPU due to memory bandwidth and access patterns. On a discrete GPU, it means we're pulling the data across the PCI BAR twice which is just dreadful. If we care about perf, we really should fix this.