Most of the time is probably spent evaluating aux = V($r, $dip)
, not during the sum
, and by not doing @btime CUDA.@sync
you’re not accurately measuring GPU execution time. The sum
does, though, because you’re reducing to a single scalar which needs to be transferred to the CPU (hence this operation is implicitly synchronizing).
1 Like