Summing a vector is faster than summing a multi-dimensional array of the same length using CUDA

lance_xwq · July 7, 2024, 7:41am

A MWE:

using BenchmarkTools
using CUDA

CUDA.allowscalar(false)
function foo(dims)
    x = rand(Float32, dims)
    print("\nCPU sum: ")
    @btime sum($x)

    print("CPU vec + sum: ")
    @btime sum(vec($x))

    y = CuArray(x)
    print("CUDA sum: ")
    @btime CUDA.@sync sum($y)

    print("CUDA vec + sum: ")
    @btime CUDA.@sync sum(vec($y))
end

Running foo((50,50,1000)) on my computer yields

CPU sum:   348.099 μs (0 allocations: 0 bytes)
CPU vec + sum:   350.919 μs (2 allocations: 80 bytes)
CUDA sum:   283.126 μs (129 allocations: 4.27 KiB)
CUDA vec + sum:   99.471 μs (131 allocations: 3.97 KiB)

and running foo(50*50*1000) produces

CPU sum:   346.042 μs (0 allocations: 0 bytes)
CPU vec + sum:   341.418 μs (0 allocations: 0 bytes)
CUDA sum:   98.861 μs (129 allocations: 3.89 KiB)
CUDA vec + sum:   99.741 μs (129 allocations: 3.89 KiB)

CUDA.versioninfo():

Summary

CUDA runtime 12.5, artifact installation
CUDA driver 12.2
NVIDIA driver 535.183.1

CUDA libraries:

CUBLAS: 12.5.3
CURAND: 10.3.6
CUFFT: 11.2.3
CUSOLVER: 11.6.3
CUSPARSE: 12.5.1
CUPTI: 23.0.0
NVML: 12.0.0+535.183.1

Julia packages:

CUDA: 5.4.2
CUDA_Driver_jll: 0.9.1+1
CUDA_Runtime_jll: 0.14.1+0

Toolchain:

Julia: 1.10.4
LLVM: 15.0.7

1 device:
0: NVIDIA GeForce GTX 1060 6GB (sm_61, 3.809 GiB / 6.000 GiB available)

I wonder if these results are expected. If so, does calling sum have an advantage over sum(vec())?

Thank you!

mkitti · July 7, 2024, 9:24am

Could we narrow the question to CUDA? The difference for CPU seems negible.

lance_xwq · July 7, 2024, 9:32am

My question is, indeed, for CUDA. I’m sorry for the confusion. I just changed the title of my post.

Topic		Replies	Views
Sum is very slow (and I can't figure out why) GPU	4	927	January 4, 2021
Parallel prefix sum with CUDA.jl GPU	2	1296	June 3, 2021
How to avoid memory allocation while doing sum on a GPU? General Usage cuda , memory-allocation , cudajl	7	125	April 20, 2025
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1330	April 17, 2017
Create a simple CUDA.sum kernel GPU	3	1961	January 3, 2021

Summing a vector is faster than summing a multi-dimensional array of the same length using CUDA

Related topics