Sum is very slow (and I can't figure out why)

I’m building up a Monte Carlo code in CUDA that, as intermediate steps, needs to perform sums of several thousand Floats. The fact is that the numbers are produced by a function V(r,dip) where r is vector of floats and dip is a parameter. The details of the function are irrelevant as I just work with the output it produces. I want to get sum of the vector of floats produced by V(r,dip). The nice thing is that, despite

@btime aux = V($r, $dip)
  33.156 μs (256 allocations: 7.03 KiB)

248000×1 CuArray{Float64,2}:

@btime CUDA.sum($aux)
  49.466 μs (68 allocations: 1.94 KiB)

if I put both things together in a function, times become crazily larger

function Sum_V(r,dip)
    aux = V(r,dip)
@btime Sum_V($r,$dip)
  5.385 ms (324 allocations: 8.97 KiB)

so I understand the problem of computer time increase is due to the allocation of the CuArray aux in the function? But I’m doing the very same operations separately above and the sum of times is less than 100 microseconds. So why this huge execution time penalty in Sum_V?

I’ve tried also declaring aux externally, passing it as a parameter to Sum_V, and replacing the aux = V(r,dip) with aux .= V(r,dip) with the same results.

BTW doing that on CPU takes 2.609ms, half the time it takes with my CUDA function -making it useless at this point.

Any help will be greatly appreciated :slight_smile:


Try CUDA.@time, you are not measuring GPU computation time

Thanks, that makes sense…
…but then my measured times are about the same as what I get on CPU and no CUDA (more or less 7ms), and I thought that would be much faster :frowning:

This is a bad sign, lots of GPUs are much much slower with Float64, and mostly used with Float32.

Most of the time is probably spent evaluating aux = V($r, $dip), not during the sum, and by not doing @btime CUDA.@sync you’re not accurately measuring GPU execution time. The sum does, though, because you’re reducing to a single scalar which needs to be transferred to the CPU (hence this operation is implicitly synchronizing).

