Result of inner product of two CuArray with views is incorrect

TzuYaoHuang · October 21, 2024, 4:01pm

Hello,

I am encountering a discrepancy when taking the inner product of two CUDA matrices, with and without the use of @view. The results differ by approximately 1% in Float32.

Could someone kindly explain what might be causing this and perhaps the potential solution?

I was initially suspecting this could be due to the reordering of floating point operations under the hood, but to my understanding, this should typically affect results around the magnitude of O(10eps(Float)) only. Furthermore, it only happens when applying @view.

Below is a minimal working example (MWE):

using CUDA, BenchmarkTools, LinearAlgebra

const N = 20
const T = Float32
insideI = CartesianIndices((2:N-1,2:N-1))

# Declare CPU array
A = rand(T,N,N)
B = rand(T,N,N)
# Copy to GPU
Ag = A |> CuArray
Bg = B |> CuArray

println("W/O view:")
# My result some how like: CPU: 79.925385, GPU: 79.925390
# Only small difference.
println("CPU: ",((A[insideI]) ⋅ (B[insideI])), ", GPU: ", ((Ag[insideI]) ⋅ (Bg[insideI])))
println("W/  view:")
# My result some how like: CPU: 79.925390, GPU: 81.175540 
# Deviate for quite a bit!!!!
println("CPU: ",((@view A[insideI]) ⋅ (@view B[insideI])), ", GPU: ", ((@view Ag[insideI]) ⋅ (@view Bg[insideI])))

# In case you want to see the benchmark results
# bresultCPUNoView = @benchmark ($A[insideI]) ⋅ ($B[insideI])
# bresultCPUView = @benchmark (@view $A[insideI]) ⋅ (@view $B[insideI])
# bresultGPUNoView = @benchmark ($Ag[insideI]) ⋅ ($Bg[insideI])
# bresultGPUView = @benchmark (@view $Ag[insideI]) ⋅ (@view $Bg[insideI])
# display(bresultCPUNoView)
# display(bresultCPUView)
# display(bresultGPUNoView)
# display(bresultGPUView)

My platform is
GPU: NVidia GForce MX350
Julia: 1.11.1
CUDA.jl: v5.5.2
LinearAlgebra.jl: v1.11.0

maleadt · October 22, 2024, 10:38am

Ah interesting, we’re simply dispatching to cublasXdot, which is invalid in the case of your particular 2D views since they have a nonzero stride along the second dimension. I guess our wrappers should only accept either a contiguous CuArray (which can be reinterpreted as a CuVector) or a strided CuVector (and not a strided CuArray, as it currently does).

Or maybe, for simplicity, we should just only allow these CUBLAS operations on vectors, and implement the generic case using broadcast and mapreduce.

maleadt · October 22, 2024, 11:33am

Should be fixed by CUBLAS: Don't use BLAS1 wrappers for strided arrays, only vectors. by maleadt · Pull Request #2528 · JuliaGPU/CUDA.jl · GitHub

Topic		Replies	Views
Dot-product of CuArray views is slow GPU performance , memory-allocation , views	10	1567	May 11, 2021
Using @view with CuArrays GPU	6	1228	September 20, 2023
CuArrays using @views+mul!+transpose+slicing GPU	4	570	April 30, 2021
Performance of view with cuArrays GPU	11	2704	November 11, 2018
Error of view on CuArrays with discrete indices GPU	0	780	January 17, 2019

Result of inner product of two CuArray with views is incorrect

Related topics