A MWE:
using BenchmarkTools
using CUDA
CUDA.allowscalar(false)
function foo(dims)
x = rand(Float32, dims)
print("\nCPU sum: ")
@btime sum($x)
print("CPU vec + sum: ")
@btime sum(vec($x))
y = CuArray(x)
print("CUDA sum: ")
@btime CUDA.@sync sum($y)
print("CUDA vec + sum: ")
@btime CUDA.@sync sum(vec($y))
end
Running foo((50,50,1000))
on my computer yields
CPU sum: 348.099 μs (0 allocations: 0 bytes)
CPU vec + sum: 350.919 μs (2 allocations: 80 bytes)
CUDA sum: 283.126 μs (129 allocations: 4.27 KiB)
CUDA vec + sum: 99.471 μs (131 allocations: 3.97 KiB)
and running foo(50*50*1000)
produces
CPU sum: 346.042 μs (0 allocations: 0 bytes)
CPU vec + sum: 341.418 μs (0 allocations: 0 bytes)
CUDA sum: 98.861 μs (129 allocations: 3.89 KiB)
CUDA vec + sum: 99.741 μs (129 allocations: 3.89 KiB)
CUDA.versioninfo():
Summary
CUDA runtime 12.5, artifact installation
CUDA driver 12.2
NVIDIA driver 535.183.1
CUDA libraries:
- CUBLAS: 12.5.3
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.3
- CUSPARSE: 12.5.1
- CUPTI: 23.0.0
- NVML: 12.0.0+535.183.1
Julia packages:
- CUDA: 5.4.2
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0
Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7
1 device:
0: NVIDIA GeForce GTX 1060 6GB (sm_61, 3.809 GiB / 6.000 GiB available)
I wonder if these results are expected. If so, does calling sum
have an advantage over sum(vec())
?
Thank you!