Benchmark MATLAB & Julia for Matrix Operations

Smaller runtime is usually better.

1 Like

I, of course, know that. If you just take a look at this image,

It looks like some ‘linear time’ worse but in reality, line in log-scale is exponential time worse.

For some reason the Cholesky decomposition, Least Squares and Matrix Inversion are faster in Matlab.

2 Likes

Two lines with the same slope in log scale just means that one line is a constant factor times the other. Is that what you mean with “exponential time worse”?

7 Likes

actually, I think you’re right. I was thinking just a line in log scale by itself.

I used logscale because if I had used the linear scale, nothing would have gotten reviled, and the lines would have fallen over each other. The reason is the large variation in x (matrix size). The linear scale would be much more be deceiving, as I talked to Royi in my pull request.

If you want to see the actual run times refer to the CSV files here:
https://github.com/juliamatlab/Julia-Matlab-Benchmark/tree/master/RunTimeData

I wanted to say that this project is detached and move to my organization:
https://github.com/juliamatlab/Julia-Matlab-Benchmark
We plan to make Matlab friendly APIs written in native Julia, and then test their performance by comparing it to the Matlab one.
So much more is coming and the benchmark is going to be much broader.

7 Likes

I updated results with MATLAB R2019b and Julia 1.2 (OpenBLAS, MKL).

The updated results are in the repository - MATLAB & Julia Matrix Operations Benchmark.

I ran tests with LinearAlgebra.BLAS.set_num_threads(6); on my 6 Physical Cores machine (See BLAS threads should default to physical not logical core count?).
It is a desktop computer so thermal throttling is the limiting factor.

In my opinion Julia’s weak points are the BLAS, LAPACK and mainly the element wise operations where MKL just give MATLAB substantial (To my use case) advantage.
Julia has the potential to utilize MKL better than MATLAB because of its language features - See Squeeze More Performance from Intel MKL.
So the potential is there…

1 Like

How could there be such a huge performance deficit in Julia in matrix addition?


That is an embarrassingly parallel problem. All ( :wink: ) one has to be careful about is memory access. Is it even possible that the algorithms in MKL and blas could be different, since the matrix can be handled as a vector?

I think the difference is that matlab does implicit threading by default

Hold on: so you think the testing in Julia was without threading, but in Matlab with threading? Isn’t that comparing apples and oranges?
Is that what’s going on, @RoyiAvital?

Yes it is, but it’s a fair point on usability: again adding two vectors in parallel in julia is kind of annoying, whereas in matlab you just do x+y. That might change if we do broadcasting by default.

So the differences are mostly explained by multithreading by default in matlab, bad benchmarking methodology (use @btime, or at least repeat do the computation twice and report the second run) and suboptimal implementation in some benchmarks: eg in

  mD = abs.(mA) .+ sin.(mB);
  mE = exp.(-(mA .^ 2));
  mF = (-mB .+ sqrt.((mB .^ 2) .- (4 .* mA .* mC))) ./ (2 .* mA);

four dots are mising (.= instead of =, and a dot on -; you can also use @. to make it simpler). I’m not sure what’s going on in cholesky, though. Could you maybe redo the benchmarks taking the above comments into account, and maybe setting threading to 1 thread in both julia and matlab?

1 Like

Please see the discussion below: how were the tests carried out wrt threading? Would it be possible to make them as fair as possible by equalizing the threading environment?

No, it’s just https://github.com/JuliaLang/julia/issues/28126 . Try it with DiffEqBase.@..:

https://github.com/JuliaDiffEq/DiffEqBase.jl/blob/master/src/diffeqfastbc.jl

The way @RoyiAvital measures the performance (using tic/toc and @elapse) is not reliable. Refer to my benchmarks for a more accurate result.


Windows - System 1:
#Julia vs Matlab

#Julia openBLAS vs Julia MKL

#Julia SIMD vs Julia openBLAS

#Everything

Ubuntu - System 2:
#Julia vs Matlab

#Julia openBLAS vs Julia MKL

#Julia SIMD vs Julia openBLAS

#Everything

Also, the overhead of using 6 BLAS threads is not reasonable for these normal operations

1 Like

Have you seen the code?
It runs 7 times and the measurement is the median.

It really not cool to be determined without checking.

Regarding the code, Please review the code and point me to any place you think I should add it and I will.
Just out of curiousity, I’d understand that for existing array it will be important to barodcast / fuse teh assignment with .=. But I’d think that for a new generated array it won’t. Not that is the reason they are missing. I was just not aware enough of that. Not very experienced with Julia.

I think you’re wrong.
First those results don’t make sense. No way 1 Thread of OpenBLAS can match Intel MKL on 4 threads. I think both @ChrisRackauckas and I pointed at this when you first published those.
I also think your computer, which is a laptop, might be thermally limited and have single memory channel which means it is memory bounded easily. My CPU is Quad Channel memory hence threading has more potential to improve results.

Second, I do @elapsed 7 times and take the median. If Julia use the correct timers in the CPU when @elapsed is used it should be a good measure technique. Probably this is what @btime is doing under the hood.

you can potentially not using a compiled version of a function if you do @elapsed manually, for example you can do some heavy calculation in global scope and just wrap the whole block with @elapse, and even if you run multiple time of that block it’s still sub-optimal.

But I think in this matrix addition case you’re good:

julia> [MatrixAdditionRunTime(3000, randn(3000), randn(3000))[2] for _=1:10000] |> mean
5.7079562e-6

julia> @benchmark (scalarA .* mX) .+ (scalarB .* mY) setup=(scalarA=rand(); scalarB=rand(); mX=randn(3000); mY=randn(3000))
BenchmarkTools.Trial:
  memory estimate:  23.52 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.000 Îźs (0.00% GC)
  median time:      2.390 Îźs (0.00% GC)
  mean time:        5.569 Îźs (26.64% GC)
  maximum time:     3.148 ms (99.79% GC)
  --------------
  samples:          10000
  evals/sample:     10
1 Like

First, it is all about the overhead. For normal operations using different threads has too much overhead and doesn’t help speed up.

If you go to my repository there are two branches that are for two completely different systems and they both show the same result.

Second, 7 times is very low, susceptible to noise. You should run the function many more times.

In my code for Julia, BenchmmarkTools takes 700 samples and their median is calculated. Additionally, I repeat the process 4 times and average those medians

For Matlab, timeit is used which is the function Matlab recommends for benchmarking. Internally, it calls the function many times. Again, I repeat the process 4 times and average those medians.

Also, when you measure the time like this, there is a big chance that the operation that you are interested in, is not optimized and compiled! :

Again, I think both of your systems are memory bounded since there is no way single thread in GEMM is as fast as 4 Threads.
You need to understand it.
It is not surprise that on my machine with Quad Channel Memory (Though modern CPU’s can get better bandwidth than my machine even with double channel configuration) you see scaling with threading.

Also, 7 runs are very stable. When I built this I tried many numbers and actually even 5 is great.
You need to understand @btime doesn’t do anything magical (Nor MATLAB’s timeit() which internally just do multiple runs and using tic() and toc()). It calls the same CPU timers.
I prefer do that manually and as you can see in the above answer of @jling, Since I’m not doing it in global scope results are correct and reasonable.

Again, it is you who have to explain how can the most optimized function in history - GEMM, which scales beautifully with threads has no scaling in your tests. Do you suggest that the people of OpenBLAS created a function 4 times faster than Intel guys (Which spent hundreds of work years on this)? Com on…

You need to find better arguments to back up those results.

The way you call manual measuring is incorrect.

Internally, timeit() calls the function many times and measures the time took for the whole process, not every individual call!

Just run open timeit, to see the code:

% timeit for Matlab:
for k = 1:num_outer_iterations
        tic(); % t1 = tic();...t = toc(t1) NOT used because not JIT-ed
	for p = 1:num_inner_iterations
	    output = f(); %#ok<NASGU>
	end
	runtimes(k) = toc();
end
t = median(runtimes) / num_inner_iterations;

I have another outer loop which calls timeit 4 times, and averages this median (https://github.com/juliamatlab/Julia-Matlab-Benchmark/blob/6e6e015efcce883f47e2942d8b60ed69315b265f/MatlabBench.m#L45)

The timeit itself calls the function multiple times and returns the median, then I calculate the average of different returns.

For example, inside timeit for matrix inversion, it runs 11*100 iterations and median is returned, and my 4 iterations wrap that to calculate the average of every 1100 iterations’ median. If I make the number of iterations around it 50, it becomes like 55000 iterations! while I explicitly stated Julia’s sample number as 700 in the code.

In contrast to your code, which measures every call and then calculates the median!

If you are suspicious about two different branches on my repository, you can run it and make a pull request. Check my Windows system’s specifications:

The performance of my system’s CPU is very good.

It is not about openBLAS or Intel. It is just a simple parallel processing concept that overhead for parallel processing should worth it!.

In order to analyze your claims I did as following.
I created the following Julia file - RunTimeValidation.

It evaluates 4 ways to calculate a function run time:

  1. Use of the method I used in my project. Use @elapsed method per iteration. Save each iteration result to array. Use 7 iterations. It yielded minimum run time of 1.4199402 [Sec]. In my project I use median() yet in this I use minimum() in order to match what’s done in BenchMarkTools.jl I will compare to.
  2. Use the method described in (1) just use 70 iterations instead of 7. In order to see if 7 gets stable enough. It yielded run time of 1.417183299 [Sec].
  3. Use @belapsed macro from BenchmarkTools.jl. It yielded run time of 1.418070099 [Sec].
  4. Use @btime macro from BenchmarkTools.jl. It yielded run time of 1.419 [Sec].

Methods (1)-(3) was executed within a function (The timed called).
Method (4) was executed with variable interpolation.

I ran the same program with smaller matrix size (100) to see if results are stable in that case as well (Small run time).
They were.

I ran it for Matrix size of 10, those are the results:

  1. 64.5 [Micro Sec].
  2. 65.1 [Micro Sec].
  3. 64.3 [Micro Sec].
  4. 64.8 [Micro Sec].

Since all methods turned to be stable within single digits of % I am pretty sure the method I used is valid.

Now, I think we can stop talk about the way time was measured and focus about the results.