This is a rebase of an experimental branch I apparently never finished. I'm not likely to work on this any time soon, should anyone want to run with it.
There are tradeoffs here. PutImage is approximately linear in time with the size of the request, so the bigger the request, the more you starve other clients. The socket buffer in the kernel is only so large no matter what you do, so there's always some amount of chunking happening. As the second commit message hints, you don't want to always flush the write through to the socket, pipelining small PutImages is essential (otherwise you spend all your time ping-ponging with the scheduler and your throughput suffers). The exact number for how big of a request to emit is almost certainly machine-dependent based on memory and cache speeds. And obviously all of this is inferior to shared memory for pure bandwidth, but not always is shared memory available or easy for the app to arrange to use.
As written this branch is slower than master. The second commit message tells (at least part of) why: xcb needs to account for partially completed writes, and that ends up being surprisingly expensive (something like 50% of the CPU time in
x11perf -putimage500). So if you want this to be a win you probably need to extend xcb; possibly this should get implemented in xcb-util-image instead.