Benchmark MATLAB & Julia for Matrix Operations

kristoffer.carlsson · October 14, 2019, 5:53pm

You forgot the @inbounds, now your code doesn’t SIMD and it is 4x-8x slower depending on the element type. But now when you add @inbounds you need to add code to check the sizes and give good error messages. Also, if you have matrices, make sure to iterate column-wise otherwise your performance will be trash.

PetrKryslUCSD · October 14, 2019, 6:01pm

I beg to disagree. True, there is more performance to be squeezed out, but look at this

module linearcombinationtest
using BenchmarkTools
using InteractiveUtils
using Test
lincomb2!(result, alpha, a, beta, b) = begin
    for i in eachindex(a)
        result[i] = alpha * a[i] + beta * b[i]
    end
    return result
end
lincomb3!(result, alpha, a, beta, b, gamma, c) = begin
    for i in eachindex(a)
        result[i] = alpha * a[i] + beta * b[i] + gamma * c[i]
    end
    return result
end


function test()
    M, N = 2000, 2000
    result = fill(0.0, M, N)
    alpha, beta, gamma, delta, epsilon, zeta, eta  = rand(7)
    a, b, c, d, e, f, g = (rand(M, N) for _ in 1:7)
    println("Fusion versus loops")
    println("Linear Combination of 2 terms")
    @btime begin @. $result = $alpha * $a + $beta * $b; end
    @btime lincomb2!($result, $alpha, $a, $beta, $b)
    println("Linear Combination of 3 terms")
    @btime @. $result = $alpha * $a + $beta * $b + $gamma * $c
    @btime lincomb3!($result, $alpha, $a, $beta, $b, $gamma, $c)
    
    true
end
end
using .linearcombinationtest
linearcombinationtest.test()

runs in

Fusion versus loops
Linear Combination of 2 terms
  9.634 ms (0 allocations: 0 bytes)
  10.405 ms (0 allocations: 0 bytes)
Linear Combination of 3 terms
  12.922 ms (0 allocations: 0 bytes)
  13.469 ms (0 allocations: 0 bytes)

Throw in @inbounds and broadcast and loops are equally fast. All it takes is a few more keystrokes.

antoine-levitt · October 14, 2019, 6:04pm

Also that code is artificial; when you have larger vector expressions (which is definitely a common use case, that’s why matlab and numpy are enough for 90% of numeric computing) you don’t want to come back down to indices. Vector operations are nice for a lot of applications, that’s why matlab and numpy were able to develop as well as they did while being atrociously slow in loops.

The question is: is there anything actionable here? As far as I understand, there’s:

a performance bug with broadcasting, that is apparently just waiting for someone with the right skills and time
the fact that Julia uses openblas by default, but MKL is essentially one command line away (although there are caveats).
the fact that julia’s default allocator is not as good as it could be for this kind of workflow; the usual answer is to discourage allocations, which is sometimes annoying
automatic parallelization, for which there is Strided.jl, and which might see some more love once threading gets stable and fast

Is there anything else coming out of the benchmark that doesn’t fit into that?

RoyiAvital · October 14, 2019, 6:37pm

My argument is as following:

Julia indeed should allow the user to leverage MKL easily.
You may say it is done given MKL.jl.
Julia needs to integrate MKL and other Intel packages with synergistic integration.
This is more tricky, but can be done. For instance I’d like to see Julia taking advantage of all MKL API’s. It has JIT and Direct Call to make it faster in smaller matrices. I would like Julia smart JIT to see there is a loop multiplying the same matrix over and over with different matrices and use it Compact API. Also there are ways to utilize Batch and Packed API’s (See Squeeze More Performance from Intel MKL). I want Julia to use SVML for SIMD with Element Wise Math Operations. Namely tighter integration. Julia, with its flexibility built in into the language, can be the perfect driver to squeeze everything MKL has to offer. This is far form done. Actually I’m not sure even a single step was taken.

There is no drama that MKL BLAS + LAPACK is faster than OpenBLAS. On the contrary, it is amazing that they are even close given the huge amount or resources MKL team has compared to OpenBLAS.
Also, Julia’s weakness with its broadcasting will be solved. The way I see it there will be some decorations to tell compiler there is no aliasing, no dependence, etc… Just like those PRAGMA’s on C (ivdep, vector, aligned, etc…) and a decoration to force, disable or auto mode for SIMD and Multi Threading.

The question is if we one day something like this will be optimized:

for ii = 1:1000
    tC[:, :, ii] = tA[:, :, ii] * tB[:, :, ii];
end

Will be optimized into a Batch Call of MKL so we’ll have Big Matrices performance on small to medium matrices.

Will we see one day something like this:

mA = qr(mB);
mC = qr(mD);
mE = qr(mF);
mG = qr(mH);

Transformed into MKL Compact API for small matrices.

I also wish to see:

for ii = 1:100 
   tC[:, :, ii] = mA * tB[:, :, ii];
end

Calling MKL Compact API.

The above has the potential to bring large matrices performance in many use cases where the user use small to medium matrices. It is a Joker Julia can have in its hand.

Not to mention having the Sparse Solvers of MKL, utilize some of its VML Library etc…

Using SVML will allow Julia match Numba JIT optimization in Python for arrays with Element wise Mathematical expressions.

So, for me, Julia’s elegant machinery can create the perfect driver around those librtaries.
I don’t think everything should be written in Julia. I think Julia can also be the intelligent glue to utilize to the maximum low level libraries in a fashion no other language can without being explicit.

Remark
I have no idea where did you bring “No optimized allocator in Julia”. Do you have any issue to show that in GitHub? It seems Julia allocator is as efficient as any other. It is more flexible as you can do “Low Level” operations which usually are not exposed.

PetrKryslUCSD · October 14, 2019, 6:45pm

Am I the only one who is a bit uneasy about using a closed-source library with Julia? (MKL…)

Edit: Wouldn’t it be nice to have a fast BLAS implementation in Julia?

ckoe-bccms · October 14, 2019, 6:49pm

I was just in the process of writing basically that Intel can change the MKL licensing terms again in the future as they see fit. Also, there are more platforms then x86/amd64 out there. ARM and eventually RISC V which would be really interesting.

Edit: And of course POWER9 !

PetrKryslUCSD · October 14, 2019, 6:50pm

Plus, by boosting MKL with Julia we are hurting a OSS project, OpenBLAS.

Oscar_Smith · October 14, 2019, 7:43pm

Imo, the preferred solution would be to use pure Julia for at least level 1 and 2 blas. It will be a ton of work to get it to openblas, not to mention mkl speeds, but besides showing off high performance Julia, it will give us better handling of custom types and mixed precision operations.

jling · October 15, 2019, 8:00pm

Semi-jokingly saying this but, in many tasks, language (Julia vs. Python) and more cores (AMD 7mm vs. Intel 14mm) will destroy the advantage of MKL, not mentioning the potential risk of propiatary (say they decided to charge you in future iteration, or be an asshole like Nvidia)

Unless 90% of computing time is just strict dense matrix computation in tight loops, which is rare for science / data science.(even in ML you have diff time which dominant the time(I think))

Topic		Replies	Views
Matlab versus Julia General Usage	33	4935	July 15, 2021
How to accelerate matrix operations(multiplication, add, inverse) in a for loop? Performance performance , matlab	23	6679	September 2, 2018
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1879	June 25, 2023
Matlab's matmul much faster than julia's New to Julia	6	716	April 5, 2024
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4540	December 20, 2017

Benchmark MATLAB & Julia for Matrix Operations

Related topics