I’m analyzing the performance of julia linear solver (A,B) vs the Matlab solver.
Julia is fast only on the second call of the solver and only for small (A) matrix. Over size of A 500x500, julia solver is always slower than Matlab.

Anyone did a similar performance comparison ?

I looked for a chart of the julia’s solver (A,B) but i found nothing.
Which file of julia source contain the solver algorithm ? I can reconstruct a flow chart of the algorithm.

My testing platform is :
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: NT (x86_64-w64-mingw32)
CPU: Intel(R) Core™ i7-7500U CPU @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)

Reading the code (thanks ChrisRackauckas) i’ve seen the most of the time is spent during LU factorization made by LAPACK.

To check if Intel MKL is faster i’ve wrote this script… but it doesn’t work…
The idea is to call the routine solver from MKL ( Documentation Library )

# Test MKL solve linear system
N = 1000;
A = rand(N,N);
b = rand(N,1);
c = copy(b);
pivot = zeros(N,1);
# Test MKL
const global librt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"])
# Open librt
Libdl.dlopen(librt)
# 101 = LAPACK_ROW_MAJOR
function lin(A::Array{Float64}, b::Array{Float64}, pivot::Array{Float64}, N::Int64 )
ccall(("LAPACKE_sgetrs", librt),
Void, # Return type
(Cint, Cuchar, Int64, Cint, Ptr{Float64}, Ptr{Float64}, Int64, Int64, Ptr{Float64}),
101, 'N', N, 1, A, b, N, N, pivot
)
return b
end
bb = lin(A,b,pivot,N);

I do not understand why bb is not calculated (remain the same, the LAPACKE_sgetrs function should rewrite the solution in b )…
Any suggestion ?

Julia’s arrays are column major so you probably want to use 102.

s is for single precision i.e. Float32 and your input is Float64 so you should use the d version of the LAPACK routine

If you want to measure the factorization you should call dgetrf where f is for factorization instead of the dgetrs which is a solver routine for which takes the output of dgetrf as input (as well as a right-hand-side.)

Yes i want measure the LU factorization time, but for testing purpose i want measure the time requred for the solution.

@dpsanders
With the & ( isn’t deprecated ?, Ref{T} instead) i obtain the error :
“LoadError: MethodError: Cannot convert an object of type Array{Float64,2} to an object of type Float64”

Sorry if this is a little orthogonal, but it might be of interest – Julia’s built-in linear solver was a bit slow for my purposes, so I implemented a simple coordinate descent algorithm that solves ordinary least squares about an order of magnitude faster for most of my use cases, and with a significantly lower memory footprint. I’m sure it could be optimized further, but this is what I’ve been using:

function OLS_cd{T<:Real}(X::Array{T,2}, y::Array{T,1}, tolerance::T=1e-12)
N,P = size(X)
β = zeros(T,P)
r = copy(y)
ρ = ones(T,P)
while norm(ρ,Inf) > tolerance
@inbounds for j ∈ 1:P
ρ[j] = zero(T); @inbounds for i ∈ 1:N ρ[j] += X[i,j]*r[i] end; ρ[j] /= N
@inbounds for i = 1:N r[i] -= ρ[j]*X[i,j] end
β[j] += ρ[j]
end
end
return(β)
end

For a 10,000 by 1,000 dimensional problem, the benchmarks for the built-in method and then the coordinate descent are:

Built-in:

@benchmark OLS_builtin(A,b)
BenchmarkTools.Trial:
memory estimate: 768.22 mb
allocs estimate: 40087
--------------
minimum time: 66.549 s (0.15% GC)
median time: 66.549 s (0.15% GC)
mean time: 66.549 s (0.15% GC)
maximum time: 66.549 s (0.15% GC)
--------------
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%

Coordinate descent:

@benchmark OLS_cd(A,b)
BenchmarkTools.Trial:
memory estimate: 159.30 kb
allocs estimate: 186
--------------
minimum time: 9.301 s (0.00% GC)
median time: 9.301 s (0.00% GC)
mean time: 9.301 s (0.00% GC)
maximum time: 9.301 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%

This looks pretty good. I’d take out the += and use a muladd but that’s about it. If one uses Julia with -O3 (which is the default), this should auto-SIMD anyways, so I’d call it close to optimal. Then to finalize it I’d make an in-place version OLS_cd! where the user could pass in the vectors instead of allocating them. But that’s if you need to avoid BLAS for some reason. It’s probably best to do

@into ρ = X*r

and such (which is using InPlaceOps.jl to make the inplace BLAS call nicer. I always forget the actually name of the function)

Anyways, you might want to check out IterativeSolvers.jl

You live a charmed life if your use cases are well-conditioned enough for simple coordinate descent.

I Second @ChrisRackauckas in urging you to check out lsqr! in IterativeSolvers - it’s a nice implementation of a well-tested iterative scheme which is pretty robust. The algorithm has a scholarly web page which happily mentions the Julia version by Gomez and Holy.

Incidently (attn: @Tecnezio), the left-divide function wrapped in OLS_builtin does in fact just call the LAPACK qr code for this case. For big problems the iterative schemes are indeed often faster.