CUDA.jl kernel is half as fast as c++ Kernel

It might be worth seeing if @fastmath makes a notable difference. I don’t know that it will, but it’s worth trying.