CUDA Matrix inverse

Hello!

I try to inverse large matrix on GPU and try to use this code:

using CUDA, LinearAlgebra
function cuinv(m::Matrix{T}) where T
    if size(m, 1) != size(m, 2) throw(ArgumentError("Matrix not square.")) end
    A = CuArray(m)
    B = CuArray(Matrix{T}(I(size(A,1))))
    A, ipiv = CUDA.CUSOLVER.getrf!(A)
    Matrix{T}(CUDA.CUSOLVER.getrs!('N', A, ipiv, B))
end
m = rand(100, 100)
isapprox(cuinv(m), inv(m))
#true

Is it optimal solution?

I used this example for code above:

function Base.:\(_A, _B)
    A, B = copy(_A), copy(_B)
    A, ipiv = CUDA.CUSOLVER.getrf!(A)
    return CUDA.CUSOLVER.getrs!('N', A, ipiv, B)
end

This isn’t optimal because you are taking the inverse of a matrix. Don’t do this. Instead you should use a solver (if you only need the inverse once) or a factorization. This approach will be faster and or significantly more numerically stable.

Computing the matrix inverse explicitly is often wasteful, compared to computing the solution directly for your right-hand side(s), but it is not numerically unstable.

Why Shouldn't I Invert That Matrix? shows that inverting and multiplying has about 2x the forward error and 10x the backward error of a LU based solve (in the well conditioned case) and almost infinitely worse error for ill conditioned matrices. Furthermore the empirical results are in line with the theoretical ones.

my task is like this: mult x’ * inv(m) * x (where m - not changing, x - many different)

m = rand(10,10)
x = rand(10, 100)

function f1(m, x)
    lud = lu(m)
    for i = 1:100
        x' / lud
    end
end
function f2(m, x)
    im = inv(m)
    for i = 1:100
        x' * im
    end
end
function f3(m, x)
    lud = lu(m)
    for i = 1:100
        lud \ x
    end
end

@benchmark f1($m, $x)
@benchmark f2($m, $x)
@benchmark f3($m, $x)

result:

julia> @benchmark f1($m, $x)
BenchmarkTools.Trial:
  memory estimate:  794.78 KiB
  allocs estimate:  102
  --------------
  minimum time:     946.100 ÎĽs (0.00% GC)
  median time:      1.200 ms (0.00% GC)
  mean time:        1.646 ms (4.78% GC)
  maximum time:     12.848 ms (90.30% GC)
  --------------
  samples:          3037
  evals/sample:     1

julia> @benchmark f2($m, $x)
BenchmarkTools.Trial:
  memory estimate:  799.94 KiB
  allocs estimate:  104
  --------------
  minimum time:     448.300 ÎĽs (0.00% GC)
  median time:      560.500 ÎĽs (0.00% GC)
  mean time:        913.309 ÎĽs (6.89% GC)
  maximum time:     32.566 ms (0.00% GC)
  --------------
  samples:          5468
  evals/sample:     1
julia> 
julia> 

julia> @benchmark f3($m, $x)
BenchmarkTools.Trial:
  memory estimate:  794.78 KiB
  allocs estimate:  102
  --------------
  minimum time:     5.068 ms (0.00% GC)
  median time:      10.902 ms (0.00% GC)
  mean time:        10.238 ms (0.66% GC)
  maximum time:     28.456 ms (0.00% GC)
  --------------
  samples:          489
  evals/sample:     1

julia> 

variant 2 is fastest

Being worse by a constant factor doesn’t mean it is “numerically unstable”. (The definition of numerical stability is insensitive to constant factors.) This article, for example, argues that multiplication by the matrix inverse is ordinarily backwards stable.

1 Like

As a general rule of thumb, computing and re-using an explicit matrix inverse will be more efficient than re-using the LU factors for each solve if you have many more vectors than the dimension of your matrix, assuming the matrices are dense (not sparse or otherwise specially structured). (You should also consider pre-allocating result vectors and using mul! to work in-place.) This was the case in your little benchmark, but is it also true of your real application?

That being said, there might be more efficient methods if you are specifically interested in x^T M^{-1} x. For example, if you are computing the average of x^T M^{-1} x for many random x, then there is an analytical formula for the result in terms of the trace of M^{-1}. More generally, the computation of x^T M^{-1} x is closely related to trace estimation. If M is a large sparse symmetric matrix, for example, there are efficient iterative methods called “Lanczos quadrature” for computing x^T M^{-1} x without computing M^{-1} x.

1 Like

We had that discussion already.

1 Like

I calculate REML function for big matrices. For small dimensions all works fine (I’m using sweep operator). (I use mul! or calculation with preallocations in code). But for large matrix I want to try calculate part on GPU.

logREML(\theta,\beta) = -\frac{N-p}{2} - \frac{1}{2}\sum_{i=1}^nlog|V_{i}|- -\frac{1}{2}log|\sum_{i=1}^nX_i'V_i^{-1}X_i|-\frac{1}{2}\sum_{i=1}^n(y_i - X_{i}\beta)'V_i^{-1}(y_i - X_{i}\beta)

Is there some reason you couldn’t write individual terms of that like this?

function individual_neg_log_likelihood(...)
      V = Symmetric(...) # assemble your covariance matrix
      Vf = cholesky!(Vf) # assuming positive definite
      logdet(Vf) + dot(X, Vf\X) + dot(y-X*beta, Vf\(y - X*beta)) # or something
end

neg_log_lik(...) = sum(individual_neg_log_likelihood, ...)

I’m not being careful with signs or anything (or the fact that the logdet is outside the sum in your second expression), but hopefully you see the point. The operations that you need to do with V_i are certainly doable with a factorization. I’ve also never used a GPU, but I would be pretty shocked if it weren’t possible to compute a Cholesky factorization and do some solves on the GPU.

Quick edit here: If X is a matrix and not a vector, you should change the call to dot in the second term to something like X'*(Vf\X), or something more thoughtful. Should have mentioned that I was assuming X was a vector, which doesn’t make sense.

1 Like

Bumping this to ask if there is any overload of inv for CuArray nowadays? Can we just inv(CuArray(A)) to use the GPU?

No, only for diagonal matrices.

Thank you @maleadt , is that because the functionality isn’t available in the underlying CUDA library or because the Julia package didn’t wrap it yet? :slight_smile:

CUDA provides getri, getrf and getrs, so IIUC it should be possible to implement that operation.