Hi,
I lately started experimenting a bit with CUDA and was doing some exploratory work through the REPL.
I was doing some addition tests with accumulate() and reduce() when I stumbled on what seems to be an issue with accumulate().
Basically, the problem I came across is that when adding, the last values given by accumulate() (along the chosen dimension) are not always equal to the result given by reduce() along the same dimension. Below are some tests I have performed.
The problem seems to be with accumulating along columns when there are more than 1024 elements.
using CUDA
d1 = 4; d2 = 3; # Started off with a small simple test, similar to what is in the documentation.
a = CUDA.ones(d1, d2)
# 4×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
# 1.0 1.0 1.0
# 1.0 1.0 1.0
# 1.0 1.0 1.0
# 1.0 1.0 1.0
all(reduce(+, a, dims = 2) .== view(accumulate(+, a, dims = 2), :, d2)) # do results match?
# true
all(reduce(+, a, dims = 1) .== view(accumulate(+, a, dims = 1), d1, :)')
# true
reduce(+, a, dims = 1) # the addition is correct for reduce()
# 1×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
# 4.0 4.0 4.0
accumulate(+, a, dims = 1) # the addition is correct for accumulate()
# 4×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
# 1.0 1.0 1.0
# 2.0 2.0 2.0
# 3.0 3.0 3.0
# 4.0 4.0 4.0
d1 = 1027; d2 = 3; # changed the dimensions
b = CUDA.ones(d1, d2)
# 1027×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
# 1.0 1.0 1.0
# 1.0 1.0 1.0
# ⋮
# 1.0 1.0 1.0
# 1.0 1.0 1.0
# 1.0 1.0 1.0
all(reduce(+, b, dims = 2) .== view(accumulate(+, b, dims = 2), :, d2)) # accumulate() along the rows matches the results from reduce().
# true
all(reduce(+, b, dims = 1) .== view(accumulate(+, b, dims = 1), d1, :)') # ...however, accumulate() along the columns does not!
# false
reduce(+, b, dims = 1) # reduce returns 1027.0 for each column as expected
# 1×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
# 1027.0 1027.0 1027.0
view(accumulate(+, b, dims = 1), 1023:d1, :) # accumulate() is 'resetting' the count from row 1025 in column 2
# 5×3 view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1023:1027, :) with eltype Float32:
# 1023.0 1023.0 1023.0
# 1024.0 1024.0 1024.0
# 1025.0 1025.0 1.0
# 1026.0 1026.0 2.0
# 1027.0 1027.0 3.0
all(1027f0 .== reduce(+, b', dims = 2))
# true
all(1027f0 .== view(accumulate(+, b', dims = 2), :, d1)) # seems to work on the transpose of b
# true
Performed the same test on the CPU.
c = ones(Float32, d1, d2);
reduce(+, c, dims = 1)
# 1×3 Matrix{Float32}:
# 1027.0 1027.0 1027.0
view(accumulate(+, c, dims = 1), 1023:d1, :) # ...works fine
# 5×3 view(::Matrix{Float32}, 1023:1027, :) with eltype Float32:
# 1023.0 1023.0 1023.0
# 1024.0 1024.0 1024.0
# 1025.0 1025.0 1025.0
# 1026.0 1026.0 1026.0
# 1027.0 1027.0 1027.0
I also did some other tests with more columns and it turned out that the issue seems to occur on the latter half of the columns in the array. That is, the rightmost ‘floor(size(b)[2] / 2)’ columns.
d1 = 1027; d2 = 5;
b = CUDA.ones(d1, d2)
# 1027×5 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
# 1.0 1.0 1.0 1.0 1.0
# 1.0 1.0 1.0 1.0 1.0
# ⋮
# 1.0 1.0 1.0 1.0 1.0
# 1.0 1.0 1.0 1.0 1.0
view(accumulate(+, b, dims = 1), 1023:d1, :)
# 5×5 view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1023:1027, :) with eltype Float32:
# 1023.0 1023.0 1023.0 1023.0 1023.0
# 1024.0 1024.0 1024.0 1024.0 1024.0
# 1025.0 1025.0 1025.0 1.0 1.0
# 1026.0 1026.0 1026.0 2.0 2.0
# 1027.0 1027.0 1027.0 3.0 3.0
I also tried different number of rows and a pattern seems to arise, as shown below:
r = []
for i in 1:(1024 * 5) # running on arrays with different number of rows
d1 = i; d2 = 3;
b = CUDA.ones(d1, d2)
push!(r, all(reduce(+, b, dims = 1) .== view(accumulate(+, b, dims = 1), d1, :)'))
end
x = [1:1]
for i in eachindex(r) # checking which matched
if r[x[end][1]] != r[i]
push!(x, i:i)
else
x[end] = x[end][1]:i
end
end
print(join(["\n`accumulate()` on arrays with $(i[1] == i[end] ? i[1] : "$(i[1]) to $(i[end])") rows $(true==unique(r[i])[1] ? "matched `reduce()`" : "did not match `reduce()`")." for i in x]))
`accumulate()` on arrays with 1 to 1024 rows matched `reduce()`.
`accumulate()` on arrays with 1025 to 2047 rows did not match `reduce()`.
`accumulate()` on arrays with 2048 rows matched `reduce()`.
`accumulate()` on arrays with 2049 to 3071 rows did not match `reduce()`.
`accumulate()` on arrays with 3072 rows matched `reduce()`.
`accumulate()` on arrays with 3073 to 4095 rows did not match `reduce()`.
`accumulate()` on arrays with 4096 rows matched `reduce()`.
`accumulate()` on arrays with 4097 to 5119 rows did not match `reduce()`.
`accumulate()` on arrays with 5120 rows matched `reduce()`.
Since I am quite new to CUDA programming, am I missing something here, or is this really an issue?
These are the details I get from CUDA.versioninfo()
julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.4
NVIDIA driver 470.161.3
Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+470.161.3
Toolchain:
- Julia: 1.8.0
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
1 device:
0: NVIDIA GeForce GTX 1660 Ti (sm_75, 4.902 GiB / 5.805 GiB available)