Kernel for building histogram on GPU

Doing global atomic additions from every thread, like that kernel does, is going to be very expensive (read: slow). The link above suggests doing so in shared memory first, which indeed is going to improve performance by a lot.