Benchmark MATLAB & Julia for Matrix Operations

ChrisRackauckas · July 17, 2019, 11:00am

Smaller runtime is usually better.

jling · July 17, 2019, 11:31am

I, of course, know that. If you just take a look at this image,

It looks like some ‘linear time’ worse but in reality, line in log-scale is exponential time worse.

Juan · July 17, 2019, 11:40am

For some reason the Cholesky decomposition, Least Squares and Matrix Inversion are faster in Matlab.

kristoffer.carlsson · July 17, 2019, 12:11pm

Two lines with the same slope in log scale just means that one line is a constant factor times the other. Is that what you mean with “exponential time worse”?

jling · July 17, 2019, 12:13pm

actually, I think you’re right. I was thinking just a line in log scale by itself.

Amin_Yahyaabadi · July 17, 2019, 2:53pm

I used logscale because if I had used the linear scale, nothing would have gotten reviled, and the lines would have fallen over each other. The reason is the large variation in x (matrix size). The linear scale would be much more be deceiving, as I talked to Royi in my pull request.

If you want to see the actual run times refer to the CSV files here:
https://github.com/juliamatlab/Julia-Matlab-Benchmark/tree/master/RunTimeData

I wanted to say that this project is detached and move to my organization:
https://github.com/juliamatlab/Julia-Matlab-Benchmark
We plan to make Matlab friendly APIs written in native Julia, and then test their performance by comparing it to the Matlab one.
So much more is coming and the benchmark is going to be much broader.

RoyiAvital · October 13, 2019, 3:21pm

I updated results with MATLAB R2019b and Julia 1.2 (OpenBLAS, MKL).

The updated results are in the repository - MATLAB & Julia Matrix Operations Benchmark.

I ran tests with LinearAlgebra.BLAS.set_num_threads(6); on my 6 Physical Cores machine (See BLAS threads should default to physical not logical core count?).
It is a desktop computer so thermal throttling is the limiting factor.

In my opinion Julia’s weak points are the BLAS, LAPACK and mainly the element wise operations where MKL just give MATLAB substantial (To my use case) advantage.
Julia has the potential to utilize MKL better than MATLAB because of its language features - See Squeeze More Performance from Intel MKL.
So the potential is there…

PetrKryslUCSD · October 13, 2019, 5:11pm

How could there be such a huge performance deficit in Julia in matrix addition?

That is an embarrassingly parallel problem. All (

) one has to be careful about is memory access. Is it even possible that the algorithms in MKL and blas could be different, since the matrix can be handled as a vector?

antoine-levitt · October 13, 2019, 5:15pm

I think the difference is that matlab does implicit threading by default

PetrKryslUCSD · October 13, 2019, 5:16pm

Hold on: so you think the testing in Julia was without threading, but in Matlab with threading? Isn’t that comparing apples and oranges?
Is that what’s going on, @RoyiAvital?

antoine-levitt · October 13, 2019, 5:23pm

Yes it is, but it’s a fair point on usability: again adding two vectors in parallel in julia is kind of annoying, whereas in matlab you just do x+y. That might change if we do broadcasting by default.

So the differences are mostly explained by multithreading by default in matlab, bad benchmarking methodology (use @btime, or at least repeat do the computation twice and report the second run) and suboptimal implementation in some benchmarks: eg in

  mD = abs.(mA) .+ sin.(mB);
  mE = exp.(-(mA .^ 2));
  mF = (-mB .+ sqrt.((mB .^ 2) .- (4 .* mA .* mC))) ./ (2 .* mA);

four dots are mising (.= instead of =, and a dot on -; you can also use @. to make it simpler). I’m not sure what’s going on in cholesky, though. Could you maybe redo the benchmarks taking the above comments into account, and maybe setting threading to 1 thread in both julia and matlab?

PetrKryslUCSD · October 13, 2019, 6:05pm

Please see the discussion below: how were the tests carried out wrt threading? Would it be possible to make them as fair as possible by equalizing the threading environment?

ChrisRackauckas · October 13, 2019, 6:21pm

No, it’s just https://github.com/JuliaLang/julia/issues/28126 . Try it with DiffEqBase.@..:

https://github.com/JuliaDiffEq/DiffEqBase.jl/blob/master/src/diffeqfastbc.jl

Amin_Yahyaabadi · October 13, 2019, 8:06pm

The way @RoyiAvital measures the performance (using tic/toc and @elapse) is not reliable. Refer to my benchmarks for a more accurate result.

Windows - System 1:
#Julia vs Matlab

#Julia openBLAS vs Julia MKL

#Julia SIMD vs Julia openBLAS

#Everything

Ubuntu - System 2:
#Julia vs Matlab

#Julia openBLAS vs Julia MKL

#Julia SIMD vs Julia openBLAS

#Everything

Also, the overhead of using 6 BLAS threads is not reasonable for these normal operations

RoyiAvital · October 13, 2019, 8:25pm

Have you seen the code?
It runs 7 times and the measurement is the median.

It really not cool to be determined without checking.

Regarding the code, Please review the code and point me to any place you think I should add it and I will.
Just out of curiousity, I’d understand that for existing array it will be important to barodcast / fuse teh assignment with .=. But I’d think that for a new generated array it won’t. Not that is the reason they are missing. I was just not aware enough of that. Not very experienced with Julia.

I think you’re wrong.
First those results don’t make sense. No way 1 Thread of OpenBLAS can match Intel MKL on 4 threads. I think both @ChrisRackauckas and I pointed at this when you first published those.
I also think your computer, which is a laptop, might be thermally limited and have single memory channel which means it is memory bounded easily. My CPU is Quad Channel memory hence threading has more potential to improve results.

Second, I do @elapsed 7 times and take the median. If Julia use the correct timers in the CPU when @elapsed is used it should be a good measure technique. Probably this is what @btime is doing under the hood.

jling · October 13, 2019, 8:38pm

you can potentially not using a compiled version of a function if you do @elapsed manually, for example you can do some heavy calculation in global scope and just wrap the whole block with @elapse, and even if you run multiple time of that block it’s still sub-optimal.

But I think in this matrix addition case you’re good:

julia> [MatrixAdditionRunTime(3000, randn(3000), randn(3000))[2] for _=1:10000] |> mean
5.7079562e-6

julia> @benchmark (scalarA .* mX) .+ (scalarB .* mY) setup=(scalarA=rand(); scalarB=rand(); mX=randn(3000); mY=randn(3000))
BenchmarkTools.Trial:
  memory estimate:  23.52 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.000 μs (0.00% GC)
  median time:      2.390 μs (0.00% GC)
  mean time:        5.569 μs (26.64% GC)
  maximum time:     3.148 ms (99.79% GC)
  --------------
  samples:          10000
  evals/sample:     10

Amin_Yahyaabadi · October 13, 2019, 9:33pm

First, it is all about the overhead. For normal operations using different threads has too much overhead and doesn’t help speed up.

If you go to my repository there are two branches that are for two completely different systems and they both show the same result.

Second, 7 times is very low, susceptible to noise. You should run the function many more times.

In my code for Julia, BenchmmarkTools takes 700 samples and their median is calculated. Additionally, I repeat the process 4 times and average those medians

github.com

juliamatlab/Julia-Matlab-Benchmark/blob/6e6e015efcce883f47e2942d8b60ed69315b265f/JuliaMain.jl#L6


      
          println("started")
          
          # ]add BenchmarkTools
          # ]add MAT
          using BenchmarkTools # for benchmark
          const BenchmarkTools.DEFAULT_PARAMETERS.samples = 700;
          const BenchmarkTools.DEFAULT_PARAMETERS.evals = 1;
          using DelimitedFiles # for readdlm, writedlm
          using Statistics # for median
          using LinearAlgebra # for eigen, pinv, svd, cholesky
          using Random # for randperm
          using MAT
          
          const operationMode = 2; # 0 for test only # 1 for partial benchmark # 2 for full benchmark
          const saveData = true; # true for saving Runtimes to csv and mat
          const JuliaFlag = true; # run Julia benchmark

github.com

juliamatlab/Julia-Matlab-Benchmark/blob/6e6e015efcce883f47e2942d8b60ed69315b265f/JuliaBench.jl#L38


      
          	for ii = 1:length(vMatrixSize)
          		matrixSize = vMatrixSize[ii];
          		mX = randn(matrixSize, matrixSize);
          		mY = randn(matrixSize, matrixSize);
          		println("Matrix Size - $matrixSize");
          
          		jj=1;
          		for fun in allFunctions
          			println("Processing $(allFunctionsString[jj]) - MatrixSize= $matrixSize");
          
          			for kk = 1:numIterations;
          
          				benchIJK =@benchmark $fun($matrixSize, $mX, $mY)
          				# t =@benchmarkable $fun($matrixSize, $mX, $mY);
          				# tune!(t)
          				# run(t)
          
          				mRunTime[ii, jj, kk]=median(benchIJK).time/1e3;
          				# println("$(mRunTime[ii, jj, kk])")
          
          			end

For Matlab, timeit is used which is the function Matlab recommends for benchmarking. Internally, it calls the function many times. Again, I repeat the process 4 times and average those medians.

github.com

juliamatlab/Julia-Matlab-Benchmark/blob/6e6e015efcce883f47e2942d8b60ed69315b265f/MatlabBench.m#L48


      
              mX = randn(matrixSize, matrixSize);
              mY = randn(matrixSize, matrixSize);
              disp(['Matrix Size - ', num2str(matrixSize)]);
          
              for jj = 1:length(cRunTimeFunctions)
                  disp(['Processing ', num2str(cFunctionString{jj}), ' Matrix Size ', num2str(matrixSize)]);
          
                  for kk = 1:numIterations
          
                      fun=@() cRunTimeFunctions{jj}(matrixSize, mX, mY);
                      mRunTime(ii, jj, kk) = timeit(fun)*1e6; % computes median of bench times
                      
          
                  end
                  cRunTime{jj+1,1}=num2str(cFunctionString{jj});
          		cRunTime{1,ii+1}=matrixSize;
          		cRunTime{jj+1,ii+1}=mean(mRunTime(ii, jj,:));
              end
          end
          
          tRunTime=cell2table(cRunTime);

Also, when you measure the time like this, there is a big chance that the operation that you are interested in, is not optimized and compiled! :

RoyiAvital · October 13, 2019, 9:53pm

Again, I think both of your systems are memory bounded since there is no way single thread in GEMM is as fast as 4 Threads.
You need to understand it.
It is not surprise that on my machine with Quad Channel Memory (Though modern CPU’s can get better bandwidth than my machine even with double channel configuration) you see scaling with threading.

Also, 7 runs are very stable. When I built this I tried many numbers and actually even 5 is great.
You need to understand @btime doesn’t do anything magical (Nor MATLAB’s timeit() which internally just do multiple runs and using tic() and toc()). It calls the same CPU timers.
I prefer do that manually and as you can see in the above answer of @jling, Since I’m not doing it in global scope results are correct and reasonable.

Again, it is you who have to explain how can the most optimized function in history - GEMM, which scales beautifully with threads has no scaling in your tests. Do you suggest that the people of OpenBLAS created a function 4 times faster than Intel guys (Which spent hundreds of work years on this)? Com on…

You need to find better arguments to back up those results.

Amin_Yahyaabadi · October 13, 2019, 10:32pm

The way you call manual measuring is incorrect.

Internally, timeit() calls the function many times and measures the time took for the whole process, not every individual call!

Just run open timeit, to see the code:

% timeit for Matlab:
for k = 1:num_outer_iterations
        tic(); % t1 = tic();...t = toc(t1) NOT used because not JIT-ed
	for p = 1:num_inner_iterations
	    output = f(); %#ok<NASGU>
	end
	runtimes(k) = toc();
end
t = median(runtimes) / num_inner_iterations;

I have another outer loop which calls timeit 4 times, and averages this median (https://github.com/juliamatlab/Julia-Matlab-Benchmark/blob/6e6e015efcce883f47e2942d8b60ed69315b265f/MatlabBench.m#L45)

The timeit itself calls the function multiple times and returns the median, then I calculate the average of different returns.

For example, inside timeit for matrix inversion, it runs 11*100 iterations and median is returned, and my 4 iterations wrap that to calculate the average of every 1100 iterations’ median. If I make the number of iterations around it 50, it becomes like 55000 iterations! while I explicitly stated Julia’s sample number as 700 in the code.

In contrast to your code, which measures every call and then calculates the median!

github.com

RoyiAvital/MatlabJuliaMatrixOperationsBenchmark/blob/4dca9188f258538d28cb35d263cd0ea909ba11a6/MatlabMatrixBenchmark0001.m#L129


      
          runTime = toc();
          
          
          end
          
          function [ mA, runTime ] = MatrixMultiplicationRunTime( matrixSize, mX, mY )
          
          sacalrA = rand(1);
          sacalrB = rand(1);
          
          tic();
          mA = (sacalrA + mX) * (sacalrB + mY);
          runTime = toc();
          
          
          end
          
          function [ mA, runTime ] = MatrixQuadraticFormRunTime( matrixSize, mX, mY )
          
          vX = randn(matrixSize, 1);
          vB = randn(matrixSize, 1);

If you are suspicious about two different branches on my repository, you can run it and make a pull request. Check my Windows system’s specifications:

The performance of my system’s CPU is very good.

It is not about openBLAS or Intel. It is just a simple parallel processing concept that overhead for parallel processing should worth it!.

RoyiAvital · October 13, 2019, 11:30pm

In order to analyze your claims I did as following.
I created the following Julia file - RunTimeValidation.

It evaluates 4 ways to calculate a function run time:

Use of the method I used in my project. Use @elapsed method per iteration. Save each iteration result to array. Use 7 iterations. It yielded minimum run time of 1.4199402 [Sec]. In my project I use median() yet in this I use minimum() in order to match what’s done in BenchMarkTools.jl I will compare to.
Use the method described in (1) just use 70 iterations instead of 7. In order to see if 7 gets stable enough. It yielded run time of 1.417183299 [Sec].
Use @belapsed macro from BenchmarkTools.jl. It yielded run time of 1.418070099 [Sec].
Use @btime macro from BenchmarkTools.jl. It yielded run time of 1.419 [Sec].

Methods (1)-(3) was executed within a function (The timed called).
Method (4) was executed with variable interpolation.

I ran the same program with smaller matrix size (100) to see if results are stable in that case as well (Small run time).
They were.

I ran it for Matrix size of 10, those are the results:

64.5 [Micro Sec].
65.1 [Micro Sec].
64.3 [Micro Sec].
64.8 [Micro Sec].

Since all methods turned to be stable within single digits of % I am pretty sure the method I used is valid.

Now, I think we can stop talk about the way time was measured and focus about the results.

Topic		Replies	Views
Matlab versus Julia General Usage	33	4913	July 15, 2021
How to accelerate matrix operations(multiplication, add, inverse) in a for loop? Performance performance , matlab	23	6678	September 2, 2018
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1856	June 25, 2023
Matlab's matmul much faster than julia's New to Julia	6	708	April 5, 2024
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4539	December 20, 2017

Benchmark MATLAB & Julia for Matrix Operations

Related topics