# Accelerate solving many matrix problems

Say I have a whole lot of independent small matrix problems (posed as regression problems below). Each individual problem is too small for effective CUDA deployment, yet I am wondering if there is an opportunity for substantial performance gains by executing these operations collectively on the gpu. See below for an example.

Obviously the naive gpu approach doesn’t cut the mustard, but I’m not sure how to construct a CUDA kernel for something like this, or (even better) if there is a high-level interface or package that would work here.

Any thoughts or suggestions on how to proceed?

``````###Simple example
using Revise, LinearAlgebra, BenchmarkTools
function manyregmwe(T=1000,N=5000,K=10)

#generate the data
manysmallX = [rand(rand((N÷2):N),K) for i in 1:T]
manysmally = [rand(size(x,1)) for x in manysmallX]

linest(X,y) = cholesky!(X'*X) \ (X'*y)
bs = [Vector{Float64}(undef, K) for i in 1:T]

#cpu version
function cpurunreg()
bs[i] .= linest(manysmallX[i], manysmally[i])
end
end

#convert the data to gpu arrays
manysmallXcu = cu.(manysmallX)
manysmallycu = cu.(manysmally)
bscu = cu.(bs)
function gpurunreg()
for i in 1:T
bscu[i] .= linest(manysmallXcu[i], manysmallycu[i])
end
end

#bench the methods
@info("cpu: ")
@btime \$cpurunreg()

@info("gpu: ")
@btime \$gpurunreg()

#make sure the solutions are equivelent
@assert reduce(hcat, bs) ≈ reduce(hcat, Vector.(bscu))

return nothing
end

manyregmwe()

``````

Output:

``````[ Info: cpu:
17.911 ms (7038 allocations: 1.27 MiB)
[ Info: gpu:
449.887 ms (321747 allocations: 12.32 MiB)
``````

To be fair, you should probably time also the transfer of the data to the GPU, in my experience, it can take a while. If the data is already on the GPU for some other reason, you can of course skip this.

2 Likes

You can also try different factorizations to see if it was the cholesky implementation that happened to be slow. Try `qr(X) \y; svd(X) \y`
Qr should be quite fast.

`qr(X)\y` and `svd(X)\y` just crashes. For `qr(X)\y` the error is `MethodError: no method matching ldiv!`. I can get it to work on the gpu as a qr decomposition if I use

``````  function linestqr(X,y)
q = qr(X)
return (q.R'*q.R)\(X'*y)
end
``````

But the performance is even worse than with the original cholesky! based function (which is what I would expect on the CPU as well).

There’s no accepted approach to do multiple matrix factorizations concurrently on the gpu?

There’s definitely a cuBLAS function for batched QR factorization, but I’m not familiar enough with CUDA.jl to know how you’d call it from Julia.

*edit: `CUDA.jl/test/cublas.jl` has a few examples:

``````            # generate matrices
A = [rand(elty,m,n) for i in 1:10]
# move to device
d_A = CuArray{elty, 2}[]
for i in 1:length(A)
push!(d_A,CuArray(A[i]))
end
tau, d_A = CUBLAS.geqrf_batched!(d_A)
``````
1 Like

Better yet, CUDA has a function for batched least-squares solutions which you can call from CUDA.jl with `CUBLAS.gels_batched!('N', d_A, d_y)`.

1 Like

Seems promising, though I’m having a hard time getting this to work. I was looking at the CUDA.jl code for it and noticed I was hitting the following snippit:

``````...
for (As,Cs) in zip(A,C)
m,n = size(As)
mC,nC = size(Cs)
if n != mC
throw(DimensionMismatch(""))
end
end
...
``````

My understanding from reading the LAPACK docs is it solves the overdetermined system `Ax=C` or in least squares terms `Xb=y`. So why does the number of columns of X need to equal the number of rows of y? I feel like I am missing something super obvious here…

I think the weird dimensionality is a bug, albeit one also in the CUBLAS docs. I’ve opened a PR here which sketches a way to fix it within CUDA.jl.

So funny thing here: Even the fixed `CUBLAS.gels_batched!` is slower than the threaded CPU baseline. However, I was able to stitch together code snippets of other batched solution methods to get a nice performance bump. Note that the arrays above are of heterogeneous height, so I had to write a routine to pad each array with zeros.

``````#code is borrowed liberally from CUDA.jl/lib/cusolver/wrappers.jl
#                            and CUDA.jl/lib/cublas/wrappers.jl
for (rfname, rsname, T) in
((:cusolverDnSpotrfBatched,:cusolverDnSpotrsBatched,:Float32),
(:cusolverDnDpotrfBatched,:cusolverDnDpotrsBatched,:Float64))
@eval begin
function hyperreg(X::Vector{<:CuMatrix{\$T}},
y::Vector{<:CuMatrix{\$T}})
cuuplo = CUDA.CUBLAS.cublasfill('U')

if length(X) != length(y)
throw(DimensionMismatch(""))
end

mx,nx = size(X[1])
my,ny = size(y[1])

if mx != my
throw(ArgumentError("X Matrix must have same num of rows as y matrix"))
end

for (Xs, ys) in zip(X,y)
if (size(Xs) != (mx, nx)) || (size(ys) != (my, ny))
throw(DimensionMismatch("Dimensions of batched array entries must be invariant"))
end
end

A = CUDA.CUBLAS.gemm_batched('T','N', X, X) #X'X
B = CUDA.CUBLAS.gemm_batched('T','N', X, y) #X'y

#Here we compute cholesky!(X'X)
n = size(A[1],1)
lda = max(1,stride(A[1],2))
Aptrs = CUDA.CUBLAS.unsafe_batch(A)
info  = zero(Cint)
infoarray = CUDA.zeros(Cint, length(A))
CUDA.CUSOLVER.\$rfname(
CUDA.CUSOLVER.dense_handle(), cuuplo, n, Aptrs, lda, infoarray, length(A))
if info != 0
throw(ArgumentError,string("Invalid value at ",-info))
end

# now solve C\(X'y) where C is the cholesky decomposition computed above
nrhs = size(B[1])[2]
ldb = max(1,stride(B[1],2))
Bptrs = CUDA.CUBLAS.unsafe_batch(B)
info  = zero(Cint)
infoarray = CUDA.zeros(Cint, length(A))
CUDA.CUSOLVER.\$rsname(
CUDA.CUSOLVER.dense_handle(), cuuplo, n, nrhs, Aptrs, lda, Bptrs, ldb, infoarray, length(A))
CUDA.CUSOLVER.unsafe_free!(Aptrs)
CUDA.CUSOLVER.unsafe_free!(Bptrs)

if info != 0
throw(ArgumentError,string("Invalid value at ",-info))
end

B
end end end

function pad0(M::TM, trows::Int, ::Type{T} = eltype(TM)) where {TM<:AbstractMatrix, T}
padded = vcat(M, TM(zeros(T, trows-size(M,1), size(M,2))))
end

function pad0(V::TV, trows::Int, ::Type{T} = eltype(TV)) where {TV<:AbstractVector, T}
end

function manyregmwe(T=1000,N=5000,K=10)

#generate the data
manysmallX = [rand(rand((N÷2):N),K) for i in 1:T]
manysmally = [rand(size(x,1)) for x in manysmallX]

linest(X,y) = cholesky!(X'*X) \ (X'*y)
bs = [Vector{Float64}(undef, K) for i in 1:T]

#cpu version
function cpurunreg()
@inbounds bs[i] .= linest(manysmallX[i], manysmally[i])
end
end

function gpurunreg()
bscu = hyperreg(manysmallXcu,manysmallycu)
return bscu
end
cpurunreg()
ys = gpurunreg()

@info("cpu1: ")
@btime \$cpurunreg()

@info("gpu: ")
@btime \$gpurunreg()

#make sure the solutions are equivelent
cpusol = reduce(hcat, bs)
gpusol = reduce(hcat, vec.(Matrix.(ys)))
cpusol ≈ gpusol || error("cpu ≠ gpu")

return nothing
end

manyregmwe()
``````

Output:

``````[ Info: cpu1:
18.275 ms (7038 allocations: 1.27 MiB)
[ Info: gpu:
3.965 ms (16141 allocations: 490.14 KiB)
``````