I am trying to do a linear solve using LU decomposition and ldiv! on my GPU. However, the results from the GPU do not match those from the CPU. Any ideas where I am going wrong?
EDIT: I tried using the simple backslash solve too. Even that is giving wrong results.
using MKL
using LinearAlgebra
using CUDA
function cpulusolve(X,V)
LUc = lu(X, check = false);
ldiv!(LUc, V);
return V
end
function gpulusolve(X,V)
LUg = lu(cu(X), check = false);
Vg = cu(V);
ldiv!(LUg, Vg);
return Array(Vg)
end
function cpusolve(X,V)
V = X\V;
return V
end
function gpusolve(X,V)
Vg = cu(X)\cu(V);
return Array(Vg)
end
N = 4000;
X = rand(ComplexF64, N,N);
V = rand(ComplexF64, N);
Vcpu = cpulusolve(X,V);
Vgpu = gpulusolve(X,V);
println(norm(Vgpu-Vcpu))
Vcpu1 = cpusolve(X,V);
Vgpu1 = gpusolve(X,V);
println(norm(Vgpu1-Vcpu1))
What do you mean by wrong? LU implementations will likely be off by an amount that is related to the condition number, so you shouldn’t expect that difference to be zero.
Maybe try something like X = I + 1e-3 * rand(ComplexF64, N,N);
to get a better conditioned matrix. Then the divinations should be closer to machine precision.
UPDATE: So I was getting big errors on using “cu” to upload the arrays to the gpu. But on changing to “CuArray”, the errors vanished.
But the code now runs much slower. So much so that I am not seeing any speedup even for large matrices for size 10000*10000 and above.
I am trying this.
Note: I did not use @btime because it takes too long for the large sizes, but I find that the conclusions regarding time do remain unchanged.
using MKL
using LinearAlgebra
using CUDA
using BenchmarkTools
using Plots
function cpulusolve(X,V)
LUc = lu(X, check = false);
ldiv!(LUc, V);
return V
end
function gpulusolve(X,V)
dX = CuArray(X);
dV = CuArray(V);
LUc = CUSOLVER.getrf!(dX);
CUSOLVER.getrs!('N',LUc[1],LUc[2],dV)
return Array(dV)
end
function cpusolve(X,V)
V = X\V;
return V
end
function gpusolve(X,V)
V = Array(CuArray(X) \ CuArray(V)); #cu is faster but gives wrong results. CuArray gives no speedup but gives right results
return Array(V)
end
NN = 10;
NNar = trunc.(Int, range(1000,13000,NN));
cputar = zeros(Float64, NN);
gputar = zeros(Float64, NN);
for hi = 1:NN
println(hi)
N = NNar[hi];
X = rand(ComplexF64, N,N);
V = rand(ComplexF64, N);
t5 = time();
Vcpu1 = cpusolve(X,V);
t6 = time();
t7 = time();
Vgpu1 = gpusolve(X,V);
t8 = time();
cputar[hi] = t6-t5;
gputar[hi] = t8-t7;
end
plot(NNar, cputar, lc=:blue, label="cpu")
plot!(NNar, gputar, lc=:red, label="gpu")
cu
changes to Float32 arrays.
1 Like