[ANN] CuCountMap.jl - CUDA.jl-enabled faster `StatsBase.countmap` for small types

See https://github.com/xiaodaigh/CuCountMap.jl

I can get about 3x the performance for small types on the GPU via CUDA.jl vs a purely CPU implementation. This is includes the time to transfer to the GPU

using CuCountMap

v = rand(Int16, 1_000_000)

cucountmap(v) # converts v to cu(v) and then run countmap

using CUDA: cu cuv = cu(v) 

countmap(cuv) # StatsBase.countmap is overloaded for CuArrays