util: rewrite bitcount functions to make them faster
The native popcnt instruction is used on x86 if it's available.
_mesa_marshal_DrawElements uses util_bitcount once to bind uploaded user buffers. The CPU time spent in there decreases from 11% to 10.6%.