Sum is very slow (and I can't figure out why)

Ferran_Mazzanti · January 3, 2021, 7:15pm

Hi,
I’m building up a Monte Carlo code in CUDA that, as intermediate steps, needs to perform sums of several thousand Floats. The fact is that the numbers are produced by a function V(r,dip) where r is vector of floats and dip is a parameter. The details of the function are irrelevant as I just work with the output it produces. I want to get sum of the vector of floats produced by V(r,dip). The nice thing is that, despite

@btime aux = V($r, $dip)
  33.156 μs (256 allocations: 7.03 KiB)

248000×1 CuArray{Float64,2}:
 1.8357571063480997
 0.4380869765616565
 0.960063385897547
 5.437484961363393
 0.27597788147989816
...

@btime CUDA.sum($aux)
  49.466 μs (68 allocations: 1.94 KiB)
260151.73723948127

if I put both things together in a function, times become crazily larger

function Sum_V(r,dip)
    aux = V(r,dip)
    CUDA.sum(aux)
end;
@btime Sum_V($r,$dip)
  5.385 ms (324 allocations: 8.97 KiB)
260151.73723948127

so I understand the problem of computer time increase is due to the allocation of the CuArray aux in the function? But I’m doing the very same operations separately above and the sum of times is less than 100 microseconds. So why this huge execution time penalty in Sum_V?

I’ve tried also declaring aux externally, passing it as a parameter to Sum_V, and replacing the aux = V(r,dip) with aux .= V(r,dip) with the same results.

BTW doing that on CPU takes 2.609ms, half the time it takes with my CUDA function -making it useless at this point.

Any help will be greatly appreciated

Ferran.

rveltz · January 3, 2021, 7:35pm

Try CUDA.@time, you are not measuring GPU computation time

Ferran_Mazzanti · January 3, 2021, 7:58pm

Thanks, that makes sense…
…but then my measured times are about the same as what I get on CPU and no CUDA (more or less 7ms), and I thought that would be much faster

mcabbott · January 3, 2021, 8:01pm

This is a bad sign, lots of GPUs are much much slower with Float64, and mostly used with Float32.

maleadt · January 4, 2021, 7:25am

Most of the time is probably spent evaluating aux = V($r, $dip), not during the sum, and by not doing @btime CUDA.@sync you’re not accurately measuring GPU execution time. The sum does, though, because you’re reducing to a single scalar which needs to be transferred to the CPU (hence this operation is implicitly synchronizing).

Topic		Replies	Views
What is the optimal way of updating CuArray? GPU cudanative	7	1533	July 5, 2018
Timing square function in CUDA GPU	4	1696	December 11, 2018
Performance regression with GPUArrays subset sum GPU	9	727	December 9, 2020
Help Debugging GPU Performance Issue GPU gpu , debug , cuarrays	5	930	July 1, 2020
Compare julia sum to a cpp implementation - julia is extremely slow?! Performance question	35	1834	October 7, 2019

Sum is very slow (and I can't figure out why)

Related topics