Linear system solution not working in CUDA

eveningsilverfox · February 27, 2025, 6:29pm

I am trying to do a linear solve using LU decomposition and ldiv! on my GPU. However, the results from the GPU do not match those from the CPU. Any ideas where I am going wrong?

EDIT: I tried using the simple backslash solve too. Even that is giving wrong results.

using MKL
using LinearAlgebra
using CUDA


function cpulusolve(X,V)
    LUc = lu(X, check = false);
    ldiv!(LUc, V);
    return V
end

function gpulusolve(X,V)
    LUg = lu(cu(X), check = false);
    Vg = cu(V);
    ldiv!(LUg, Vg);
    return Array(Vg)
end


function cpusolve(X,V)
    V = X\V;
    return V
end

function gpusolve(X,V)
    Vg = cu(X)\cu(V);
    return Array(Vg)
end

N = 4000;
X = rand(ComplexF64, N,N);
V = rand(ComplexF64, N);

Vcpu = cpulusolve(X,V);
Vgpu = gpulusolve(X,V);

println(norm(Vgpu-Vcpu))

Vcpu1 = cpusolve(X,V);
Vgpu1 = gpusolve(X,V);

println(norm(Vgpu1-Vcpu1))

ChrisRackauckas · March 1, 2025, 9:09am

What do you mean by wrong? LU implementations will likely be off by an amount that is related to the condition number, so you shouldn’t expect that difference to be zero.

cwittens · March 1, 2025, 10:55am

Maybe try something like X = I + 1e-3 * rand(ComplexF64, N,N); to get a better conditioned matrix. Then the divinations should be closer to machine precision.

eveningsilverfox · March 1, 2025, 11:07am

UPDATE: So I was getting big errors on using “cu” to upload the arrays to the gpu. But on changing to “CuArray”, the errors vanished.
But the code now runs much slower. So much so that I am not seeing any speedup even for large matrices for size 10000*10000 and above.
I am trying this.
Note: I did not use @btime because it takes too long for the large sizes, but I find that the conclusions regarding time do remain unchanged.

using MKL
using LinearAlgebra
using CUDA
using BenchmarkTools
using Plots

function cpulusolve(X,V)
    LUc = lu(X, check = false);
    ldiv!(LUc, V);
    return V
end

function gpulusolve(X,V)
    dX = CuArray(X);
    dV = CuArray(V);

    LUc = CUSOLVER.getrf!(dX);
    CUSOLVER.getrs!('N',LUc[1],LUc[2],dV)

    return Array(dV)
end

function cpusolve(X,V)
    V = X\V;
    return V
end

function gpusolve(X,V)
    V =  Array(CuArray(X) \ CuArray(V)); #cu is faster but gives wrong results. CuArray gives no speedup but gives right results
    return Array(V)
end

NN = 10;
NNar = trunc.(Int, range(1000,13000,NN));
cputar = zeros(Float64, NN);
gputar = zeros(Float64, NN);
for hi = 1:NN
    println(hi)

    N = NNar[hi];
    X = rand(ComplexF64, N,N);
    V = rand(ComplexF64, N);

    t5 = time();
    Vcpu1 = cpusolve(X,V);
    t6 = time();
    
    t7 = time();
    Vgpu1 = gpusolve(X,V);
    t8 = time();    

    cputar[hi] = t6-t5;
    gputar[hi] = t8-t7;
end

plot(NNar, cputar, lc=:blue, label="cpu")
plot!(NNar, gputar, lc=:red, label="gpu")

ChrisRackauckas · March 1, 2025, 11:51am

cu changes to Float32 arrays.

Topic		Replies	Views
Solves the linear system using CuArrays.jl GPU	3	1613	December 27, 2019
(fast) sparse solve on GPU General Usage question , gpu	7	1688	December 28, 2019
Sparse GPU linear solve from documentation fails Numerics	2	136	June 17, 2025
Cannot solve Ax=B on GPU with A CuSparseMatrixCSC New to Julia cuda , sparse	9	1103	May 20, 2021
Linear system problems GPU	35	2490	February 4, 2021

Linear system solution not working in CUDA

Related topics