Is accumulate() with '+' producing the wrong results?

P570 · March 12, 2023, 5:22am

Hi,

I lately started experimenting a bit with CUDA and was doing some exploratory work through the REPL.
I was doing some addition tests with accumulate() and reduce() when I stumbled on what seems to be an issue with accumulate().

Basically, the problem I came across is that when adding, the last values given by accumulate() (along the chosen dimension) are not always equal to the result given by reduce() along the same dimension. Below are some tests I have performed.

The problem seems to be with accumulating along columns when there are more than 1024 elements.

using CUDA

d1 = 4; d2 = 3;  # Started off with a small simple test, similar to what is in the documentation.

a = CUDA.ones(d1, d2)
# 4×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  1.0  1.0  1.0
#  1.0  1.0  1.0
#  1.0  1.0  1.0
#  1.0  1.0  1.0

all(reduce(+, a, dims = 2) .== view(accumulate(+, a, dims = 2), :, d2))  # do results match?
# true

all(reduce(+, a, dims = 1) .== view(accumulate(+, a, dims = 1), d1, :)')
# true

reduce(+, a, dims = 1)  # the addition is correct for reduce()
# 1×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  4.0  4.0  4.0

accumulate(+, a, dims = 1)  # the addition is correct for accumulate()
# 4×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  1.0  1.0  1.0
#  2.0  2.0  2.0
#  3.0  3.0  3.0
#  4.0  4.0  4.0

d1 = 1027; d2 = 3;  # changed the dimensions
b = CUDA.ones(d1, d2)
# 1027×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  1.0  1.0  1.0
#  1.0  1.0  1.0
#  ⋮         
#  1.0  1.0  1.0
#  1.0  1.0  1.0
#  1.0  1.0  1.0

all(reduce(+, b, dims = 2) .== view(accumulate(+, b, dims = 2), :, d2))  # accumulate() along the rows matches the results from reduce().
# true

all(reduce(+, b, dims = 1) .== view(accumulate(+, b, dims = 1), d1, :)')  # ...however, accumulate() along the columns does not!
# false

reduce(+, b, dims = 1)  # reduce returns 1027.0 for each column as expected
# 1×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  1027.0  1027.0  1027.0

view(accumulate(+, b, dims = 1), 1023:d1, :)  # accumulate() is 'resetting' the count from row 1025 in column 2
# 5×3 view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1023:1027, :) with eltype Float32:
#  1023.0  1023.0  1023.0
#  1024.0  1024.0  1024.0
#  1025.0  1025.0     1.0
#  1026.0  1026.0     2.0
#  1027.0  1027.0     3.0

all(1027f0 .== reduce(+, b', dims = 2))
# true

all(1027f0 .== view(accumulate(+, b', dims = 2), :, d1))  # seems to work on the transpose of b
# true

Performed the same test on the CPU.

c = ones(Float32, d1, d2);
reduce(+, c, dims = 1)
# 1×3 Matrix{Float32}:
#  1027.0  1027.0  1027.0

view(accumulate(+, c, dims = 1), 1023:d1, :)  # ...works fine
# 5×3 view(::Matrix{Float32}, 1023:1027, :) with eltype Float32:
#  1023.0  1023.0  1023.0
#  1024.0  1024.0  1024.0
#  1025.0  1025.0  1025.0
#  1026.0  1026.0  1026.0
#  1027.0  1027.0  1027.0

I also did some other tests with more columns and it turned out that the issue seems to occur on the latter half of the columns in the array. That is, the rightmost ‘floor(size(b)[2] / 2)’ columns.

d1 = 1027; d2 = 5;
b = CUDA.ones(d1, d2)
# 1027×5 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  1.0  1.0  1.0  1.0  1.0
#  1.0  1.0  1.0  1.0  1.0
#  ⋮                   
#  1.0  1.0  1.0  1.0  1.0
#  1.0  1.0  1.0  1.0  1.0

view(accumulate(+, b, dims = 1), 1023:d1, :)
# 5×5 view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1023:1027, :) with eltype Float32:
#  1023.0  1023.0  1023.0  1023.0  1023.0
#  1024.0  1024.0  1024.0  1024.0  1024.0
#  1025.0  1025.0  1025.0     1.0     1.0
#  1026.0  1026.0  1026.0     2.0     2.0
#  1027.0  1027.0  1027.0     3.0     3.0

I also tried different number of rows and a pattern seems to arise, as shown below:

r = []
for i in 1:(1024 * 5)  # running on arrays with different number of rows
    d1 = i; d2 = 3;
    b = CUDA.ones(d1, d2)
    push!(r, all(reduce(+, b, dims = 1) .== view(accumulate(+, b, dims = 1), d1, :)'))
end

x = [1:1]
for i in eachindex(r)  # checking which matched
    if r[x[end][1]] != r[i]
        push!(x, i:i)
    else
        x[end] = x[end][1]:i
    end
end

print(join(["\n`accumulate()` on arrays with $(i[1] == i[end] ? i[1] : "$(i[1]) to $(i[end])") rows $(true==unique(r[i])[1] ? "matched `reduce()`" : "did not match `reduce()`")." for i in x]))

`accumulate()` on arrays with 1 to 1024 rows matched `reduce()`.
`accumulate()` on arrays with 1025 to 2047 rows did not match `reduce()`.
`accumulate()` on arrays with 2048 rows matched `reduce()`.
`accumulate()` on arrays with 2049 to 3071 rows did not match `reduce()`.
`accumulate()` on arrays with 3072 rows matched `reduce()`.
`accumulate()` on arrays with 3073 to 4095 rows did not match `reduce()`.
`accumulate()` on arrays with 4096 rows matched `reduce()`.
`accumulate()` on arrays with 4097 to 5119 rows did not match `reduce()`.
`accumulate()` on arrays with 5120 rows matched `reduce()`.

Since I am quite new to CUDA programming, am I missing something here, or is this really an issue?

These are the details I get from CUDA.versioninfo()

julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.4
NVIDIA driver 470.161.3

Libraries:

CUBLAS: 11.11.3
CURAND: 10.3.0
CUFFT: 10.9.0
CUSOLVER: 11.4.1
CUSPARSE: 11.7.5
CUPTI: 18.0.0
NVML: 11.0.0+470.161.3

Toolchain:

Julia: 1.8.0
LLVM: 13.0.1
PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
0: NVIDIA GeForce GTX 1660 Ti (sm_75, 4.902 GiB / 5.805 GiB available)

maleadt · March 13, 2023, 11:06am

I don’t have the time for an in-depth look right now, but if a simple operation like accumulate(+) behaves differently on CuArray compared to Array, something is off. Could you file an issue with minimal reproducer on the CUDA.jl repository?

Topic		Replies	Views
What is a neutral element (for accumulate on gpu)? GPU	8	1011	April 10, 2021
Inconsistency in `accumulate` between `Array` and `CuArray.` General Usage gpu , gpuarrays , cuda	2	68	March 26, 2025
Reduce/Accumulate with Matrix Matrix Product General Usage accumulate	7	109	August 16, 2024
Performance challenge: can you write a faster sum? Performance simd	31	1751	July 9, 2025
One example from `GPU programming in Julia \| Workshop \| JuliaCon 2021` GPU question , gpu	0	364	April 5, 2022

Is accumulate() with '+' producing the wrong results?

Related topics