Result of inner product of two CuArray with views is incorrect

Hello,

I am encountering a discrepancy when taking the inner product of two CUDA matrices, with and without the use of @view. The results differ by approximately 1% in Float32.

Could someone kindly explain what might be causing this and perhaps the potential solution?

I was initially suspecting this could be due to the reordering of floating point operations under the hood, but to my understanding, this should typically affect results around the magnitude of O(10eps(Float)) only. Furthermore, it only happens when applying @view.

Below is a minimal working example (MWE):

using CUDA, BenchmarkTools, LinearAlgebra

const N = 20
const T = Float32
insideI = CartesianIndices((2:N-1,2:N-1))

# Declare CPU array
A = rand(T,N,N)
B = rand(T,N,N)
# Copy to GPU
Ag = A |> CuArray
Bg = B |> CuArray

println("W/O view:")
# My result some how like: CPU: 79.925385, GPU: 79.925390
# Only small difference.
println("CPU: ",((A[insideI]) β‹… (B[insideI])), ", GPU: ", ((Ag[insideI]) β‹… (Bg[insideI])))
println("W/  view:")
# My result some how like: CPU: 79.925390, GPU: 81.175540 
# Deviate for quite a bit!!!!
println("CPU: ",((@view A[insideI]) β‹… (@view B[insideI])), ", GPU: ", ((@view Ag[insideI]) β‹… (@view Bg[insideI])))

# In case you want to see the benchmark results
# bresultCPUNoView = @benchmark ($A[insideI]) β‹… ($B[insideI])
# bresultCPUView = @benchmark (@view $A[insideI]) β‹… (@view $B[insideI])
# bresultGPUNoView = @benchmark ($Ag[insideI]) β‹… ($Bg[insideI])
# bresultGPUView = @benchmark (@view $Ag[insideI]) β‹… (@view $Bg[insideI])
# display(bresultCPUNoView)
# display(bresultCPUView)
# display(bresultGPUNoView)
# display(bresultGPUView)

My platform is
GPU: NVidia GForce MX350
Julia: 1.11.1
CUDA.jl: v5.5.2
LinearAlgebra.jl: v1.11.0

1 Like

Ah interesting, we’re simply dispatching to cublasXdot, which is invalid in the case of your particular 2D views since they have a nonzero stride along the second dimension. I guess our wrappers should only accept either a contiguous CuArray (which can be reinterpreted as a CuVector) or a strided CuVector (and not a strided CuArray, as it currently does).

Or maybe, for simplicity, we should just only allow these CUBLAS operations on vectors, and implement the generic case using broadcast and mapreduce.

Should be fixed by CUBLAS: Don't use BLAS1 wrappers for strided arrays, only vectors. by maleadt Β· Pull Request #2528 Β· JuliaGPU/CUDA.jl Β· GitHub

3 Likes