Linear solver \(A, B) performance vs Matlab A\b

I’m analyzing the performance of julia linear solver (A,B) vs the Matlab solver.
Julia is fast only on the second call of the solver and only for small (A) matrix. Over size of A 500x500, julia solver is always slower than Matlab.

Anyone did a similar performance comparison ?

I looked for a chart of the julia’s solver (A,B) but i found nothing.
Which file of julia source contain the solver algorithm ? I can reconstruct a flow chart of the algorithm.

My testing platform is :
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: NT (x86_64-w64-mingw32)
CPU: Intel(R) Core™ i7-7500U CPU @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)

See the very lengthy discussion here:

Basically, most points there apply here too.

Seems a nice discussion, very useful.

Anyway, I’d like to reconstruct the algorithm used by julia’s solver… no one seems to know it and can be interesting have a flow chart of it. (Like these Solve systems of linear equations Ax = B for x - MATLAB mldivide \ - MathWorks Italia)

Anyone know which file of julia source contain the solver algorithm ?

As an informative example of how to find this:

julia> A = rand(4,4)
4×4 Array{Float64,2}:
 0.791559  0.706394  0.914927  0.00942085
 0.20433   0.414622  0.209776  0.747403
 0.79516   0.356356  0.691833  0.433256
 0.557565  0.131694  0.352743  0.843174

julia> b = rand(4)
4-element Array{Float64,1}:
 0.424751
 0.232078
 0.679178
 0.963042

julia> @which A\b
\(A::AbstractArray{T,2} where T, B::Union{AbstractArray{T,1},AbstractArray{T,2}} where T) in Base.LinAlg at linalg\generic.jl:757

https://github.com/JuliaLang/julia/blob/master/base/linalg/generic.jl#L757

I think quite a few people are familiar with this multi-algorithm.

3 Likes

Reading the code (thanks ChrisRackauckas) i’ve seen the most of the time is spent during LU factorization made by LAPACK.

To check if Intel MKL is faster i’ve wrote this script… but it doesn’t work…
The idea is to call the routine solver from MKL ( Documentation Library )

# Test MKL solve linear system
N = 1000;
A = rand(N,N);
b = rand(N,1);
c = copy(b);
pivot = zeros(N,1);

# Test MKL
const global librt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"])

# Open librt
Libdl.dlopen(librt)

# 101  = LAPACK_ROW_MAJOR
function lin(A::Array{Float64}, b::Array{Float64}, pivot::Array{Float64}, N::Int64 )
  ccall(("LAPACKE_sgetrs", librt),
              Void, # Return type
                (Cint, Cuchar, Int64, Cint, Ptr{Float64}, Ptr{Float64}, Int64, Int64, Ptr{Float64}),
                101, 'N', N, 1, A, b, N, N, pivot
                )
  return b
end

bb = lin(A,b,pivot,N);

I do not understand why bb is not calculated (remain the same, the LAPACKE_sgetrs function should rewrite the solution in b )…
Any suggestion ?

Don’t quote me on this, but to test it out you may need to build with MKL. Instructions are here:

https://github.com/JuliaLang/julia#intel-compilers-and-math-kernel-library-mkl

That by default will build master (pre-0.6). So to A/B test you’ll either need to make sure you build the v0.5.0 tag or test vs another nightly build.

He just wants to call a function from an external library so not sure why he would need to rebuild anything?

Yes, i want call MKL as external library.
The execution has no error… but seems that the result (if the calculation is made) isn’t stored.

I think you need some “&” in the ccall.

  • Julia’s arrays are column major so you probably want to use 102.
  • s is for single precision i.e. Float32 and your input is Float64 so you should use the d version of the LAPACK routine
  • If you want to measure the factorization you should call dgetrf where f is for factorization instead of the dgetrs which is a solver routine for which takes the output of dgetrf as input (as well as a right-hand-side.)
1 Like

Yes i want measure the LU factorization time, but for testing purpose i want measure the time requred for the solution.

@dpsanders
With the & ( isn’t deprecated ?, Ref{T} instead) i obtain the error :
“LoadError: MethodError: Cannot convert an object of type Array{Float64,2} to an object of type Float64”

I’ve done it !
Here the post about my test:

Sorry if this is a little orthogonal, but it might be of interest – Julia’s built-in linear solver was a bit slow for my purposes, so I implemented a simple coordinate descent algorithm that solves ordinary least squares about an order of magnitude faster for most of my use cases, and with a significantly lower memory footprint. I’m sure it could be optimized further, but this is what I’ve been using:

function OLS_cd{T<:Real}(X::Array{T,2}, y::Array{T,1}, tolerance::T=1e-12)
    N,P = size(X)
    β = zeros(T,P)
    r = copy(y)
    ρ = ones(T,P)
    while norm(ρ,Inf) > tolerance
        @inbounds for j ∈ 1:P
            ρ[j] = zero(T); @inbounds for i ∈ 1:N ρ[j] += X[i,j]*r[i] end; ρ[j] /= N
            @inbounds for i = 1:N r[i] -= ρ[j]*X[i,j] end
            β[j] += ρ[j]
        end
    end
    return(β)
end

For a 10,000 by 1,000 dimensional problem, the benchmarks for the built-in method and then the coordinate descent are:

Built-in:

@benchmark OLS_builtin(A,b)
BenchmarkTools.Trial: 
  memory estimate:  768.22 mb
  allocs estimate:  40087
  --------------
  minimum time:     66.549 s (0.15% GC)
  median time:      66.549 s (0.15% GC)
  mean time:        66.549 s (0.15% GC)
  maximum time:     66.549 s (0.15% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Coordinate descent:

@benchmark OLS_cd(A,b)
BenchmarkTools.Trial: 
  memory estimate:  159.30 kb
  allocs estimate:  186
  --------------
  minimum time:     9.301 s (0.00% GC)
  median time:      9.301 s (0.00% GC)
  mean time:        9.301 s (0.00% GC)
  maximum time:     9.301 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
1 Like

How did you define OlS_builtin?

I wrapped it with:

function OLS_builtin{T<:Real}(X::Array{T,2}, y::Array{T,1})
    β = X\y
    return(β)
end
1 Like

@roryd
Why not use qr factorization offered by BLAS ?
Try Mkl qr factorization, is very fast :wink:

Anyway, in my opinion your example can have a huge speed-up using vectorial multiplication instead of using for cycle.

This looks pretty good. I’d take out the += and use a muladd but that’s about it. If one uses Julia with -O3 (which is the default), this should auto-SIMD anyways, so I’d call it close to optimal. Then to finalize it I’d make an in-place version OLS_cd! where the user could pass in the vectors instead of allocating them. But that’s if you need to avoid BLAS for some reason. It’s probably best to do

@into ρ = X*r

and such (which is using InPlaceOps.jl to make the inplace BLAS call nicer. I always forget the actually name of the function)

Anyways, you might want to check out IterativeSolvers.jl

In general, it’s more useful if you post complete code, including suitable matrices and vectors in this case.

You live a charmed life if your use cases are well-conditioned enough for simple coordinate descent.

I Second @ChrisRackauckas in urging you to check out lsqr! in IterativeSolvers - it’s a nice implementation of a well-tested iterative scheme which is pretty robust. The algorithm has a scholarly web page which happily mentions the Julia version by Gomez and Holy.

Incidently (attn: @Tecnezio), the left-divide function wrapped in OLS_builtin does in fact just call the LAPACK qr code for this case. For big problems the iterative schemes are indeed often faster.

4 Likes

@dpsanders thanks for the reminder – the missing code would be something like

n = 10000
p = 1000

A = randn(n,p)
b = randn(n)