Hi,

I’m building up a Monte Carlo code in CUDA that, as intermediate steps, needs to perform sums of several thousand Floats. The fact is that the numbers are produced by a function V(r,dip) where r is vector of floats and dip is a parameter. The details of the function are irrelevant as I just work with the output it produces. I want to get sum of the vector of floats produced by V(r,dip). The nice thing is that, despite

```
@btime aux = V($r, $dip)
33.156 μs (256 allocations: 7.03 KiB)
248000×1 CuArray{Float64,2}:
1.8357571063480997
0.4380869765616565
0.960063385897547
5.437484961363393
0.27597788147989816
...
@btime CUDA.sum($aux)
49.466 μs (68 allocations: 1.94 KiB)
260151.73723948127
```

if I put both things together in a function, times become crazily larger

```
function Sum_V(r,dip)
aux = V(r,dip)
CUDA.sum(aux)
end;
@btime Sum_V($r,$dip)
5.385 ms (324 allocations: 8.97 KiB)
260151.73723948127
```

so I understand the problem of computer time increase is due to the allocation of the CuArray aux in the function? But I’m doing the very same operations separately above and the sum of times is less than 100 microseconds. So why this huge execution time penalty in Sum_V?

I’ve tried also declaring aux externally, passing it as a parameter to Sum_V, and replacing the `aux = V(r,dip)`

with `aux .= V(r,dip)`

with the same results.

BTW doing that on CPU takes 2.609ms, half the time it takes with my CUDA function -making it useless at this point.

Any help will be greatly appreciated

Ferran.