Linear system solution not working in CUDA

I am trying to do a linear solve using LU decomposition and ldiv! on my GPU. However, the results from the GPU do not match those from the CPU. Any ideas where I am going wrong?

EDIT: I tried using the simple backslash solve too. Even that is giving wrong results.

using MKL
using LinearAlgebra
using CUDA


function cpulusolve(X,V)
    LUc = lu(X, check = false);
    ldiv!(LUc, V);
    return V
end

function gpulusolve(X,V)
    LUg = lu(cu(X), check = false);
    Vg = cu(V);
    ldiv!(LUg, Vg);
    return Array(Vg)
end


function cpusolve(X,V)
    V = X\V;
    return V
end

function gpusolve(X,V)
    Vg = cu(X)\cu(V);
    return Array(Vg)
end

N = 4000;
X = rand(ComplexF64, N,N);
V = rand(ComplexF64, N);

Vcpu = cpulusolve(X,V);
Vgpu = gpulusolve(X,V);

println(norm(Vgpu-Vcpu))

Vcpu1 = cpusolve(X,V);
Vgpu1 = gpusolve(X,V);

println(norm(Vgpu1-Vcpu1))

What do you mean by wrong? LU implementations will likely be off by an amount that is related to the condition number, so you shouldn’t expect that difference to be zero.

Maybe try something like X = I + 1e-3 * rand(ComplexF64, N,N); to get a better conditioned matrix. Then the divinations should be closer to machine precision.

UPDATE: So I was getting big errors on using “cu” to upload the arrays to the gpu. But on changing to “CuArray”, the errors vanished.
But the code now runs much slower. So much so that I am not seeing any speedup even for large matrices for size 10000*10000 and above.
I am trying this.
Note: I did not use @btime because it takes too long for the large sizes, but I find that the conclusions regarding time do remain unchanged.

using MKL
using LinearAlgebra
using CUDA
using BenchmarkTools
using Plots

function cpulusolve(X,V)
    LUc = lu(X, check = false);
    ldiv!(LUc, V);
    return V
end

function gpulusolve(X,V)
    dX = CuArray(X);
    dV = CuArray(V);

    LUc = CUSOLVER.getrf!(dX);
    CUSOLVER.getrs!('N',LUc[1],LUc[2],dV)

    return Array(dV)
end

function cpusolve(X,V)
    V = X\V;
    return V
end

function gpusolve(X,V)
    V =  Array(CuArray(X) \ CuArray(V)); #cu is faster but gives wrong results. CuArray gives no speedup but gives right results
    return Array(V)
end

NN = 10;
NNar = trunc.(Int, range(1000,13000,NN));
cputar = zeros(Float64, NN);
gputar = zeros(Float64, NN);
for hi = 1:NN
    println(hi)

    N = NNar[hi];
    X = rand(ComplexF64, N,N);
    V = rand(ComplexF64, N);

    t5 = time();
    Vcpu1 = cpusolve(X,V);
    t6 = time();
    
    t7 = time();
    Vgpu1 = gpusolve(X,V);
    t8 = time();    

    cputar[hi] = t6-t5;
    gputar[hi] = t8-t7;
end

plot(NNar, cputar, lc=:blue, label="cpu")
plot!(NNar, gputar, lc=:red, label="gpu")

cu changes to Float32 arrays.

1 Like