Is accumulate() with '+' producing the wrong results?

Hi,

I lately started experimenting a bit with CUDA and was doing some exploratory work through the REPL.
I was doing some addition tests with accumulate() and reduce() when I stumbled on what seems to be an issue with accumulate().

Basically, the problem I came across is that when adding, the last values given by accumulate() (along the chosen dimension) are not always equal to the result given by reduce() along the same dimension. Below are some tests I have performed.

The problem seems to be with accumulating along columns when there are more than 1024 elements.

using CUDA

d1 = 4; d2 = 3;  # Started off with a small simple test, similar to what is in the documentation.

a = CUDA.ones(d1, d2)
# 4×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  1.0  1.0  1.0
#  1.0  1.0  1.0
#  1.0  1.0  1.0
#  1.0  1.0  1.0

all(reduce(+, a, dims = 2) .== view(accumulate(+, a, dims = 2), :, d2))  # do results match?
# true

all(reduce(+, a, dims = 1) .== view(accumulate(+, a, dims = 1), d1, :)')
# true

reduce(+, a, dims = 1)  # the addition is correct for reduce()
# 1×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  4.0  4.0  4.0

accumulate(+, a, dims = 1)  # the addition is correct for accumulate()
# 4×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  1.0  1.0  1.0
#  2.0  2.0  2.0
#  3.0  3.0  3.0
#  4.0  4.0  4.0

d1 = 1027; d2 = 3;  # changed the dimensions
b = CUDA.ones(d1, d2)
# 1027×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  1.0  1.0  1.0
#  1.0  1.0  1.0
#  ⋮         
#  1.0  1.0  1.0
#  1.0  1.0  1.0
#  1.0  1.0  1.0

all(reduce(+, b, dims = 2) .== view(accumulate(+, b, dims = 2), :, d2))  # accumulate() along the rows matches the results from reduce().
# true

all(reduce(+, b, dims = 1) .== view(accumulate(+, b, dims = 1), d1, :)')  # ...however, accumulate() along the columns does not!
# false

reduce(+, b, dims = 1)  # reduce returns 1027.0 for each column as expected
# 1×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  1027.0  1027.0  1027.0

view(accumulate(+, b, dims = 1), 1023:d1, :)  # accumulate() is 'resetting' the count from row 1025 in column 2
# 5×3 view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1023:1027, :) with eltype Float32:
#  1023.0  1023.0  1023.0
#  1024.0  1024.0  1024.0
#  1025.0  1025.0     1.0
#  1026.0  1026.0     2.0
#  1027.0  1027.0     3.0

all(1027f0 .== reduce(+, b', dims = 2))
# true

all(1027f0 .== view(accumulate(+, b', dims = 2), :, d1))  # seems to work on the transpose of b
# true

Performed the same test on the CPU.

c = ones(Float32, d1, d2);
reduce(+, c, dims = 1)
# 1×3 Matrix{Float32}:
#  1027.0  1027.0  1027.0

view(accumulate(+, c, dims = 1), 1023:d1, :)  # ...works fine
# 5×3 view(::Matrix{Float32}, 1023:1027, :) with eltype Float32:
#  1023.0  1023.0  1023.0
#  1024.0  1024.0  1024.0
#  1025.0  1025.0  1025.0
#  1026.0  1026.0  1026.0
#  1027.0  1027.0  1027.0

I also did some other tests with more columns and it turned out that the issue seems to occur on the latter half of the columns in the array. That is, the rightmost ‘floor(size(b)[2] / 2)’ columns.

d1 = 1027; d2 = 5;
b = CUDA.ones(d1, d2)
# 1027×5 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
#  1.0  1.0  1.0  1.0  1.0
#  1.0  1.0  1.0  1.0  1.0
#  ⋮                   
#  1.0  1.0  1.0  1.0  1.0
#  1.0  1.0  1.0  1.0  1.0

view(accumulate(+, b, dims = 1), 1023:d1, :)
# 5×5 view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1023:1027, :) with eltype Float32:
#  1023.0  1023.0  1023.0  1023.0  1023.0
#  1024.0  1024.0  1024.0  1024.0  1024.0
#  1025.0  1025.0  1025.0     1.0     1.0
#  1026.0  1026.0  1026.0     2.0     2.0
#  1027.0  1027.0  1027.0     3.0     3.0

I also tried different number of rows and a pattern seems to arise, as shown below:

r = []
for i in 1:(1024 * 5)  # running on arrays with different number of rows
    d1 = i; d2 = 3;
    b = CUDA.ones(d1, d2)
    push!(r, all(reduce(+, b, dims = 1) .== view(accumulate(+, b, dims = 1), d1, :)'))
end

x = [1:1]
for i in eachindex(r)  # checking which matched
    if r[x[end][1]] != r[i]
        push!(x, i:i)
    else
        x[end] = x[end][1]:i
    end
end

print(join(["\n`accumulate()` on arrays with $(i[1] == i[end] ? i[1] : "$(i[1]) to $(i[end])") rows $(true==unique(r[i])[1] ? "matched `reduce()`" : "did not match `reduce()`")." for i in x]))

`accumulate()` on arrays with 1 to 1024 rows matched `reduce()`.
`accumulate()` on arrays with 1025 to 2047 rows did not match `reduce()`.
`accumulate()` on arrays with 2048 rows matched `reduce()`.
`accumulate()` on arrays with 2049 to 3071 rows did not match `reduce()`.
`accumulate()` on arrays with 3072 rows matched `reduce()`.
`accumulate()` on arrays with 3073 to 4095 rows did not match `reduce()`.
`accumulate()` on arrays with 4096 rows matched `reduce()`.
`accumulate()` on arrays with 4097 to 5119 rows did not match `reduce()`.
`accumulate()` on arrays with 5120 rows matched `reduce()`.

Since I am quite new to CUDA programming, am I missing something here, or is this really an issue?

These are the details I get from CUDA.versioninfo()

julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.4
NVIDIA driver 470.161.3

Libraries:

  • CUBLAS: 11.11.3
  • CURAND: 10.3.0
  • CUFFT: 10.9.0
  • CUSOLVER: 11.4.1
  • CUSPARSE: 11.7.5
  • CUPTI: 18.0.0
  • NVML: 11.0.0+470.161.3

Toolchain:

  • Julia: 1.8.0
  • LLVM: 13.0.1
  • PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
  • Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
0: NVIDIA GeForce GTX 1660 Ti (sm_75, 4.902 GiB / 5.805 GiB available)

I don’t have the time for an in-depth look right now, but if a simple operation like accumulate(+) behaves differently on CuArray compared to Array, something is off. Could you file an issue with minimal reproducer on the CUDA.jl repository?