Accelerate solving many matrix problems

Say I have a whole lot of independent small matrix problems (posed as regression problems below). Each individual problem is too small for effective CUDA deployment, yet I am wondering if there is an opportunity for substantial performance gains by executing these operations collectively on the gpu. See below for an example.

Obviously the naive gpu approach doesn’t cut the mustard, but I’m not sure how to construct a CUDA kernel for something like this, or (even better) if there is a high-level interface or package that would work here.

Any thoughts or suggestions on how to proceed?

###Simple example
using Revise, LinearAlgebra, BenchmarkTools
function manyregmwe(T=1000,N=5000,K=10)

  #generate the data
  manysmallX = [rand(rand((N÷2):N),K) for i in 1:T]
  manysmally = [rand(size(x,1)) for x in manysmallX]

  linest(X,y) = cholesky!(X'*X) \ (X'*y)
  bs = [Vector{Float64}(undef, K) for i in 1:T]

  #cpu version
  function cpurunreg()
    Threads.@threads for i in 1:T
      bs[i] .= linest(manysmallX[i], manysmally[i])
    end
  end

  #convert the data to gpu arrays
  manysmallXcu = cu.(manysmallX)
  manysmallycu = cu.(manysmally)
  bscu = cu.(bs)
  function gpurunreg()
    for i in 1:T
      bscu[i] .= linest(manysmallXcu[i], manysmallycu[i])
    end
  end

  #bench the methods
  @info("cpu: ")
  @btime $cpurunreg()

  @info("gpu: ")
  @btime $gpurunreg()

  #make sure the solutions are equivelent
  @assert reduce(hcat, bs) ≈ reduce(hcat, Vector.(bscu))

  return nothing
end


manyregmwe()

Output:

[ Info: cpu:
  17.911 ms (7038 allocations: 1.27 MiB)
[ Info: gpu:
  449.887 ms (321747 allocations: 12.32 MiB)

To be fair, you should probably time also the transfer of the data to the GPU, in my experience, it can take a while. If the data is already on the GPU for some other reason, you can of course skip this.

2 Likes

You can also try different factorizations to see if it was the cholesky implementation that happened to be slow. Try qr(X) \y; svd(X) \y
Qr should be quite fast.

qr(X)\y and svd(X)\y just crashes. For qr(X)\y the error is MethodError: no method matching ldiv!. I can get it to work on the gpu as a qr decomposition if I use

  function linestqr(X,y)
    q = qr(X)
    return (q.R'*q.R)\(X'*y)
  end

But the performance is even worse than with the original cholesky! based function (which is what I would expect on the CPU as well).

There’s no accepted approach to do multiple matrix factorizations concurrently on the gpu?

There’s definitely a cuBLAS function for batched QR factorization, but I’m not familiar enough with CUDA.jl to know how you’d call it from Julia.

*edit: CUDA.jl/test/cublas.jl has a few examples:

            # generate matrices
            A = [rand(elty,m,n) for i in 1:10]
            # move to device
            d_A = CuArray{elty, 2}[]
            for i in 1:length(A)
                push!(d_A,CuArray(A[i]))
            end
            tau, d_A = CUBLAS.geqrf_batched!(d_A)

Better yet, CUDA has a function for batched least-squares solutions which you can call from CUDA.jl with CUBLAS.gels_batched!('N', d_A, d_y).

Seems promising, though I’m having a hard time getting this to work. I was looking at the CUDA.jl code for it and noticed I was hitting the following snippit:

...
            for (As,Cs) in zip(A,C)
                m,n = size(As)
                mC,nC = size(Cs)
                if n != mC
                    throw(DimensionMismatch(""))
                end
            end
...

My understanding from reading the LAPACK docs is it solves the overdetermined system Ax=C or in least squares terms Xb=y. So why does the number of columns of X need to equal the number of rows of y? I feel like I am missing something super obvious here…

I think the weird dimensionality is a bug, albeit one also in the CUBLAS docs. I’ve opened a PR here which sketches a way to fix it within CUDA.jl.

So funny thing here: Even the fixed CUBLAS.gels_batched! is slower than the threaded CPU baseline. However, I was able to stitch together code snippets of other batched solution methods to get a nice performance bump. Note that the arrays above are of heterogeneous height, so I had to write a routine to pad each array with zeros.

#code is borrowed liberally from CUDA.jl/lib/cusolver/wrappers.jl
#                            and CUDA.jl/lib/cublas/wrappers.jl
for (rfname, rsname, T) in
    ((:cusolverDnSpotrfBatched,:cusolverDnSpotrsBatched,:Float32),
    (:cusolverDnDpotrfBatched,:cusolverDnDpotrsBatched,:Float64))
    @eval begin
function hyperreg(X::Vector{<:CuMatrix{$T}},
                      y::Vector{<:CuMatrix{$T}})
    cuuplo = CUDA.CUBLAS.cublasfill('U')

    if length(X) != length(y)
        throw(DimensionMismatch(""))
    end

    mx,nx = size(X[1])
    my,ny = size(y[1])

    if mx != my
        throw(ArgumentError("X Matrix must have same num of rows as y matrix"))
    end

    for (Xs, ys) in zip(X,y)
        if (size(Xs) != (mx, nx)) || (size(ys) != (my, ny))
            throw(DimensionMismatch("Dimensions of batched array entries must be invariant"))
        end
    end

    A = CUDA.CUBLAS.gemm_batched('T','N', X, X) #X'X
    B = CUDA.CUBLAS.gemm_batched('T','N', X, y) #X'y

    #Here we compute cholesky!(X'X)
    n = size(A[1],1)
    lda = max(1,stride(A[1],2))
    Aptrs = CUDA.CUBLAS.unsafe_batch(A)
    info  = zero(Cint)
    infoarray = CUDA.zeros(Cint, length(A))
    CUDA.CUSOLVER.$rfname(
      CUDA.CUSOLVER.dense_handle(), cuuplo, n, Aptrs, lda, infoarray, length(A))
    if info != 0
        throw(ArgumentError,string("Invalid value at ",-info))
    end

    # now solve C\(X'y) where C is the cholesky decomposition computed above
    nrhs = size(B[1])[2]
    ldb = max(1,stride(B[1],2))
    Bptrs = CUDA.CUBLAS.unsafe_batch(B)
    info  = zero(Cint)
    infoarray = CUDA.zeros(Cint, length(A))
    CUDA.CUSOLVER.$rsname(
      CUDA.CUSOLVER.dense_handle(), cuuplo, n, nrhs, Aptrs, lda, Bptrs, ldb, infoarray, length(A))
    CUDA.CUSOLVER.unsafe_free!(Aptrs)
    CUDA.CUSOLVER.unsafe_free!(Bptrs)

    if info != 0
        throw(ArgumentError,string("Invalid value at ",-info))
    end

    B
end end end


function pad0(M::TM, trows::Int, ::Type{T} = eltype(TM)) where {TM<:AbstractMatrix, T}
  padded = vcat(M, TM(zeros(T, trows-size(M,1), size(M,2))))
  sM = view(padded, 1:size(M,1), 1:size(M,2))
  return (sM, padded)
end

function pad0(V::TV, trows::Int, ::Type{T} = eltype(TV)) where {TV<:AbstractVector, T}
  padded = vcat(V, TV(zeros(T, trows-length(V))))
  sV = view(padded, 1:length(V))
  return (sV, padded)
end

function manyregmwe(T=1000,N=5000,K=10)

  #generate the data
  manysmallX = [rand(rand((N÷2):N),K) for i in 1:T]
  manysmally = [rand(size(x,1)) for x in manysmallX]

  linest(X,y) = cholesky!(X'*X) \ (X'*y)
  bs = [Vector{Float64}(undef, K) for i in 1:T]

  #cpu version
  function cpurunreg()
    Threads.@threads for i in 1:T
      @inbounds bs[i] .= linest(manysmallX[i], manysmally[i])
    end
  end

  manysmallXcu = cu.((X->pad0(X,N)[2]).(manysmallX))
  manysmallycu = cu.((y->Matrix(reshape(pad0(y,N)[2], N, 1))).(manysmally))#cu.(manysmally)

  function gpurunreg()
    bscu = hyperreg(manysmallXcu,manysmallycu)
    return bscu
  end
  cpurunreg()
  ys = gpurunreg()

  @info("cpu1: ")
  @btime $cpurunreg()

  @info("gpu: ")
  @btime $gpurunreg()

  #make sure the solutions are equivelent
  cpusol = reduce(hcat, bs)
  gpusol = reduce(hcat, vec.(Matrix.(ys)))
  cpusol ≈ gpusol || error("cpu ≠ gpu")

  return nothing
end


manyregmwe()

Output:

[ Info: cpu1:
  18.275 ms (7038 allocations: 1.27 MiB)
[ Info: gpu:
  3.965 ms (16141 allocations: 490.14 KiB)