Summing a vector is faster than summing a multi-dimensional array of the same length using CUDA

A MWE:

using BenchmarkTools
using CUDA

CUDA.allowscalar(false)
function foo(dims)
    x = rand(Float32, dims)
    print("\nCPU sum: ")
    @btime sum($x)

    print("CPU vec + sum: ")
    @btime sum(vec($x))

    y = CuArray(x)
    print("CUDA sum: ")
    @btime CUDA.@sync sum($y)

    print("CUDA vec + sum: ")
    @btime CUDA.@sync sum(vec($y))
end

Running foo((50,50,1000)) on my computer yields

CPU sum:   348.099 μs (0 allocations: 0 bytes)
CPU vec + sum:   350.919 μs (2 allocations: 80 bytes)
CUDA sum:   283.126 μs (129 allocations: 4.27 KiB)
CUDA vec + sum:   99.471 μs (131 allocations: 3.97 KiB)

and running foo(50*50*1000) produces

CPU sum:   346.042 μs (0 allocations: 0 bytes)
CPU vec + sum:   341.418 μs (0 allocations: 0 bytes)
CUDA sum:   98.861 μs (129 allocations: 3.89 KiB)
CUDA vec + sum:   99.741 μs (129 allocations: 3.89 KiB)

CUDA.versioninfo():

Summary

CUDA runtime 12.5, artifact installation
CUDA driver 12.2
NVIDIA driver 535.183.1

CUDA libraries:

  • CUBLAS: 12.5.3
  • CURAND: 10.3.6
  • CUFFT: 11.2.3
  • CUSOLVER: 11.6.3
  • CUSPARSE: 12.5.1
  • CUPTI: 23.0.0
  • NVML: 12.0.0+535.183.1

Julia packages:

  • CUDA: 5.4.2
  • CUDA_Driver_jll: 0.9.1+1
  • CUDA_Runtime_jll: 0.14.1+0

Toolchain:

  • Julia: 1.10.4
  • LLVM: 15.0.7

1 device:
0: NVIDIA GeForce GTX 1060 6GB (sm_61, 3.809 GiB / 6.000 GiB available)

I wonder if these results are expected. If so, does calling sum have an advantage over sum(vec())?

Thank you!

Could we narrow the question to CUDA? The difference for CPU seems negible.

My question is, indeed, for CUDA. I’m sorry for the confusion. I just changed the title of my post. :slight_smile: