CUDA Matrix inverse

PharmCat · January 14, 2021, 4:23pm

Hello!

I try to inverse large matrix on GPU and try to use this code:

using CUDA, LinearAlgebra
function cuinv(m::Matrix{T}) where T
    if size(m, 1) != size(m, 2) throw(ArgumentError("Matrix not square.")) end
    A = CuArray(m)
    B = CuArray(Matrix{T}(I(size(A,1))))
    A, ipiv = CUDA.CUSOLVER.getrf!(A)
    Matrix{T}(CUDA.CUSOLVER.getrs!('N', A, ipiv, B))
end
m = rand(100, 100)
isapprox(cuinv(m), inv(m))
#true

Is it optimal solution?

I used this example for code above:

function Base.:\(_A, _B)
    A, B = copy(_A), copy(_B)
    A, ipiv = CUDA.CUSOLVER.getrf!(A)
    return CUDA.CUSOLVER.getrs!('N', A, ipiv, B)
end

Oscar_Smith · January 14, 2021, 4:56pm

This isn’t optimal because you are taking the inverse of a matrix. Don’t do this. Instead you should use a solver (if you only need the inverse once) or a factorization. This approach will be faster and or significantly more numerically stable.

stevengj · January 14, 2021, 5:16pm

Computing the matrix inverse explicitly is often wasteful, compared to computing the solution directly for your right-hand side(s), but it is not numerically unstable.

Oscar_Smith · January 14, 2021, 5:24pm

Why Shouldn't I Invert That Matrix? shows that inverting and multiplying has about 2x the forward error and 10x the backward error of a LU based solve (in the well conditioned case) and almost infinitely worse error for ill conditioned matrices. Furthermore the empirical results are in line with the theoretical ones.

PharmCat · January 14, 2021, 6:24pm

my task is like this: mult x’ * inv(m) * x (where m - not changing, x - many different)

m = rand(10,10)
x = rand(10, 100)

function f1(m, x)
    lud = lu(m)
    for i = 1:100
        x' / lud
    end
end
function f2(m, x)
    im = inv(m)
    for i = 1:100
        x' * im
    end
end
function f3(m, x)
    lud = lu(m)
    for i = 1:100
        lud \ x
    end
end

@benchmark f1($m, $x)
@benchmark f2($m, $x)
@benchmark f3($m, $x)

result:

julia> @benchmark f1($m, $x)
BenchmarkTools.Trial:
  memory estimate:  794.78 KiB
  allocs estimate:  102
  --------------
  minimum time:     946.100 μs (0.00% GC)
  median time:      1.200 ms (0.00% GC)
  mean time:        1.646 ms (4.78% GC)
  maximum time:     12.848 ms (90.30% GC)
  --------------
  samples:          3037
  evals/sample:     1

julia> @benchmark f2($m, $x)
BenchmarkTools.Trial:
  memory estimate:  799.94 KiB
  allocs estimate:  104
  --------------
  minimum time:     448.300 μs (0.00% GC)
  median time:      560.500 μs (0.00% GC)
  mean time:        913.309 μs (6.89% GC)
  maximum time:     32.566 ms (0.00% GC)
  --------------
  samples:          5468
  evals/sample:     1
julia> 
julia> 

julia> @benchmark f3($m, $x)
BenchmarkTools.Trial:
  memory estimate:  794.78 KiB
  allocs estimate:  102
  --------------
  minimum time:     5.068 ms (0.00% GC)
  median time:      10.902 ms (0.00% GC)
  mean time:        10.238 ms (0.66% GC)
  maximum time:     28.456 ms (0.00% GC)
  --------------
  samples:          489
  evals/sample:     1

julia>

variant 2 is fastest

stevengj · January 14, 2021, 6:41pm

Being worse by a constant factor doesn’t mean it is “numerically unstable”. (The definition of numerical stability is insensitive to constant factors.) This article, for example, argues that multiplication by the matrix inverse is ordinarily backwards stable.

stevengj · January 14, 2021, 6:57pm

As a general rule of thumb, computing and re-using an explicit matrix inverse will be more efficient than re-using the LU factors for each solve if you have many more vectors than the dimension of your matrix, assuming the matrices are dense (not sparse or otherwise specially structured). (You should also consider pre-allocating result vectors and using mul! to work in-place.) This was the case in your little benchmark, but is it also true of your real application?

That being said, there might be more efficient methods if you are specifically interested in x^T M^{-1} x. For example, if you are computing the average of x^T M^{-1} x for many random x, then there is an analytical formula for the result in terms of the trace of M^{-1}. More generally, the computation of x^T M^{-1} x is closely related to trace estimation. If M is a large sparse symmetric matrix, for example, there are efficient iterative methods called “Lanczos quadrature” for computing x^T M^{-1} x without computing M^{-1} x.

RoyiAvital · January 14, 2021, 7:02pm

We had that discussion already.

PharmCat · January 14, 2021, 7:08pm

I calculate REML function for big matrices. For small dimensions all works fine (I’m using sweep operator). (I use mul! or calculation with preallocations in code). But for large matrix I want to try calculate part on GPU.

logREML(\theta,\beta) = -\frac{N-p}{2} - \frac{1}{2}\sum_{i=1}^nlog|V_{i}|- -\frac{1}{2}log|\sum_{i=1}^nX_i'V_i^{-1}X_i|-\frac{1}{2}\sum_{i=1}^n(y_i - X_{i}\beta)'V_i^{-1}(y_i - X_{i}\beta)

cgeoga · January 14, 2021, 9:04pm

Is there some reason you couldn’t write individual terms of that like this?

function individual_neg_log_likelihood(...)
      V = Symmetric(...) # assemble your covariance matrix
      Vf = cholesky!(Vf) # assuming positive definite
      logdet(Vf) + dot(X, Vf\X) + dot(y-X*beta, Vf\(y - X*beta)) # or something
end

neg_log_lik(...) = sum(individual_neg_log_likelihood, ...)

I’m not being careful with signs or anything (or the fact that the logdet is outside the sum in your second expression), but hopefully you see the point. The operations that you need to do with V_i are certainly doable with a factorization. I’ve also never used a GPU, but I would be pretty shocked if it weren’t possible to compute a Cholesky factorization and do some solves on the GPU.

Quick edit here: If X is a matrix and not a vector, you should change the call to dot in the second term to something like X'*(Vf\X), or something more thoughtful. Should have mentioned that I was assuming X was a vector, which doesn’t make sense.

juliohm · December 6, 2021, 12:41am

Bumping this to ask if there is any overload of inv for CuArray nowadays? Can we just inv(CuArray(A)) to use the GPU?

maleadt · December 6, 2021, 7:54am

No, only for diagonal matrices.

juliohm · December 6, 2021, 10:02pm

Thank you @maleadt , is that because the functionality isn’t available in the underlying CUDA library or because the Julia package didn’t wrap it yet?

maleadt · December 7, 2021, 6:47am

CUDA provides getri, getrf and getrs, so IIUC it should be possible to implement that operation.

Topic		Replies	Views
Large Tikhonov inverse of Matrix -> CUDA? General Usage cuda	0	247	September 9, 2022
Alternatives to inv() in Julia Numerics question , performance	4	300	July 3, 2024
Using getrf_batched to find matrix inverses GPU question , gpu	2	28	August 7, 2025
Batched Matrix solve in CUDA.jl GPU blas	3	1635	February 1, 2023
Inverse with destination? General Usage	35	4247	May 19, 2021

CUDA Matrix inverse

Related topics