Linear solver \(A, B) performance vs Matlab A\b

Tecnezio · February 13, 2017, 2:23pm

I’m analyzing the performance of julia linear solver (A,B) vs the Matlab solver.
Julia is fast only on the second call of the solver and only for small (A) matrix. Over size of A 500x500, julia solver is always slower than Matlab.

Anyone did a similar performance comparison ?

I looked for a chart of the julia’s solver (A,B) but i found nothing.
Which file of julia source contain the solver algorithm ? I can reconstruct a flow chart of the algorithm.

My testing platform is :
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: NT (x86_64-w64-mingw32)
CPU: Intel(R) Core™ i7-7500U CPU @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)

Tamas_Papp · February 13, 2017, 2:36pm

See the very lengthy discussion here:

Basically, most points there apply here too.

Tecnezio · February 13, 2017, 3:38pm

Seems a nice discussion, very useful.

Anyway, I’d like to reconstruct the algorithm used by julia’s solver… no one seems to know it and can be interesting have a flow chart of it. (Like these Solve systems of linear equations Ax = B for x - MATLAB mldivide \ - MathWorks Italia)

Anyone know which file of julia source contain the solver algorithm ?

ChrisRackauckas · February 13, 2017, 3:46pm

As an informative example of how to find this:

julia> A = rand(4,4)
4×4 Array{Float64,2}:
 0.791559  0.706394  0.914927  0.00942085
 0.20433   0.414622  0.209776  0.747403
 0.79516   0.356356  0.691833  0.433256
 0.557565  0.131694  0.352743  0.843174

julia> b = rand(4)
4-element Array{Float64,1}:
 0.424751
 0.232078
 0.679178
 0.963042

julia> @which A\b
\(A::AbstractArray{T,2} where T, B::Union{AbstractArray{T,1},AbstractArray{T,2}} where T) in Base.LinAlg at linalg\generic.jl:757

https://github.com/JuliaLang/julia/blob/master/base/linalg/generic.jl#L757

I think quite a few people are familiar with this multi-algorithm.

Tecnezio · February 13, 2017, 8:44pm

Reading the code (thanks ChrisRackauckas) i’ve seen the most of the time is spent during LU factorization made by LAPACK.

To check if Intel MKL is faster i’ve wrote this script… but it doesn’t work…
The idea is to call the routine solver from MKL ( Documentation Library )

# Test MKL solve linear system
N = 1000;
A = rand(N,N);
b = rand(N,1);
c = copy(b);
pivot = zeros(N,1);

# Test MKL
const global librt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"])

# Open librt
Libdl.dlopen(librt)

# 101  = LAPACK_ROW_MAJOR
function lin(A::Array{Float64}, b::Array{Float64}, pivot::Array{Float64}, N::Int64 )
  ccall(("LAPACKE_sgetrs", librt),
              Void, # Return type
                (Cint, Cuchar, Int64, Cint, Ptr{Float64}, Ptr{Float64}, Int64, Int64, Ptr{Float64}),
                101, 'N', N, 1, A, b, N, N, pivot
                )
  return b
end

bb = lin(A,b,pivot,N);

I do not understand why bb is not calculated (remain the same, the LAPACKE_sgetrs function should rewrite the solution in b )…
Any suggestion ?

ChrisRackauckas · February 13, 2017, 8:51pm

Don’t quote me on this, but to test it out you may need to build with MKL. Instructions are here:

https://github.com/JuliaLang/julia#intel-compilers-and-math-kernel-library-mkl

That by default will build master (pre-0.6). So to A/B test you’ll either need to make sure you build the v0.5.0 tag or test vs another nightly build.

kristoffer.carlsson · February 13, 2017, 8:52pm

He just wants to call a function from an external library so not sure why he would need to rebuild anything?

Tecnezio · February 13, 2017, 9:00pm

Yes, i want call MKL as external library.
The execution has no error… but seems that the result (if the calculation is made) isn’t stored.

dpsanders · February 13, 2017, 9:01pm

I think you need some “&” in the ccall.

andreasnoack · February 13, 2017, 9:10pm

Julia’s arrays are column major so you probably want to use 102.
s is for single precision i.e. Float32 and your input is Float64 so you should use the d version of the LAPACK routine
If you want to measure the factorization you should call dgetrf where f is for factorization instead of the dgetrs which is a solver routine for which takes the output of dgetrf as input (as well as a right-hand-side.)

Tecnezio · February 13, 2017, 9:19pm

Yes i want measure the LU factorization time, but for testing purpose i want measure the time requred for the solution.

@dpsanders
With the & ( isn’t deprecated ?, Ref{T} instead) i obtain the error :
“LoadError: MethodError: Cannot convert an object of type Array{Float64,2} to an object of type Float64”

Tecnezio · February 14, 2017, 2:53pm

I’ve done it !
Here the post about my test:

roryd · February 20, 2017, 6:01pm

Sorry if this is a little orthogonal, but it might be of interest – Julia’s built-in linear solver was a bit slow for my purposes, so I implemented a simple coordinate descent algorithm that solves ordinary least squares about an order of magnitude faster for most of my use cases, and with a significantly lower memory footprint. I’m sure it could be optimized further, but this is what I’ve been using:

function OLS_cd{T<:Real}(X::Array{T,2}, y::Array{T,1}, tolerance::T=1e-12)
    N,P = size(X)
    β = zeros(T,P)
    r = copy(y)
    ρ = ones(T,P)
    while norm(ρ,Inf) > tolerance
        @inbounds for j ∈ 1:P
            ρ[j] = zero(T); @inbounds for i ∈ 1:N ρ[j] += X[i,j]*r[i] end; ρ[j] /= N
            @inbounds for i = 1:N r[i] -= ρ[j]*X[i,j] end
            β[j] += ρ[j]
        end
    end
    return(β)
end

For a 10,000 by 1,000 dimensional problem, the benchmarks for the built-in method and then the coordinate descent are:

Built-in:

@benchmark OLS_builtin(A,b)
BenchmarkTools.Trial: 
  memory estimate:  768.22 mb
  allocs estimate:  40087
  --------------
  minimum time:     66.549 s (0.15% GC)
  median time:      66.549 s (0.15% GC)
  mean time:        66.549 s (0.15% GC)
  maximum time:     66.549 s (0.15% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Coordinate descent:

@benchmark OLS_cd(A,b)
BenchmarkTools.Trial: 
  memory estimate:  159.30 kb
  allocs estimate:  186
  --------------
  minimum time:     9.301 s (0.00% GC)
  median time:      9.301 s (0.00% GC)
  mean time:        9.301 s (0.00% GC)
  maximum time:     9.301 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

dpsanders · February 20, 2017, 6:14pm

How did you define OlS_builtin?

roryd · February 20, 2017, 6:20pm

I wrapped it with:

function OLS_builtin{T<:Real}(X::Array{T,2}, y::Array{T,1})
    β = X\y
    return(β)
end

Tecnezio · February 20, 2017, 9:41pm

@roryd
Why not use qr factorization offered by BLAS ?
Try Mkl qr factorization, is very fast

Anyway, in my opinion your example can have a huge speed-up using vectorial multiplication instead of using for cycle.

ChrisRackauckas · February 20, 2017, 9:52pm

This looks pretty good. I’d take out the += and use a muladd but that’s about it. If one uses Julia with -O3 (which is the default), this should auto-SIMD anyways, so I’d call it close to optimal. Then to finalize it I’d make an in-place version OLS_cd! where the user could pass in the vectors instead of allocating them. But that’s if you need to avoid BLAS for some reason. It’s probably best to do

@into ρ = X*r

and such (which is using InPlaceOps.jl to make the inplace BLAS call nicer. I always forget the actually name of the function)

Anyways, you might want to check out IterativeSolvers.jl

dpsanders · February 20, 2017, 10:36pm

In general, it’s more useful if you post complete code, including suitable matrices and vectors in this case.

Ralph_Smith · February 21, 2017, 4:07am

You live a charmed life if your use cases are well-conditioned enough for simple coordinate descent.

I Second @ChrisRackauckas in urging you to check out lsqr! in IterativeSolvers - it’s a nice implementation of a well-tested iterative scheme which is pretty robust. The algorithm has a scholarly web page which happily mentions the Julia version by Gomez and Holy.

Incidently (attn: @Tecnezio), the left-divide function wrapped in OLS_builtin does in fact just call the LAPACK qr code for this case. For big problems the iterative schemes are indeed often faster.

roryd · February 21, 2017, 4:25am

@dpsanders thanks for the reminder – the missing code would be something like

n = 10000
p = 1000

A = randn(n,p)
b = randn(n)

Topic		Replies	Views
LU factorization performance issue New to Julia linearalgebra	30	717	June 6, 2022
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3618	February 11, 2020
Performance of Julia binary wrapper behind Matlab mex-file interface performance Performance matlab	30	914	March 21, 2025
Benchmark MATLAB & Julia for Matrix Operations Performance	148	19793	October 15, 2019
How to solve this Ax=b faster? Numerics linearalgebra , linearsolve	24	2559	June 7, 2022

Linear solver \(A, B) performance vs Matlab A\b

Related topics