See GitHub - xiaodaigh/CuCountMap.jl: Fast `StatsBase.countmap` for small types on the GPU via CUDA.jl
I can get about 3x the performance for small types on the GPU via CUDA.jl vs a purely CPU implementation. This is includes the time to transfer to the GPU
using CuCountMap
v = rand(Int16, 1_000_000)
cucountmap(v) # converts v to cu(v) and then run countmap
using CUDA: cu cuv = cu(v)
countmap(cuv) # StatsBase.countmap is overloaded for CuArrays