Most of the time is probably spent evaluating aux = V($r, $dip), not during the sum, and by not doing @btime CUDA.@sync you’re not accurately measuring GPU execution time. The sum does, though, because you’re reducing to a single scalar which needs to be transferred to the CPU (hence this operation is implicitly synchronizing).
1 Like