Benchmark MATLAB & Julia for Matrix Operations

This is a huge thread with 80 replies, and it makes for some interesting reading. Just to make sure I am getting the bigger lines here, what I currently understand is:

  1. In a big benchmark of matrix arithmetic, MATLAB is faster than julia. Julia 0.6 changes the performance somewhat but does not change that overall conclusion. This does not reflect an overall speed advantage to Matlab, but it may be important for people whose code is heavily based on linear algebra.
  2. Both Matlab and Julia rely on external libraries for linear algebra. Matlab uses MKL which may be faster, but the numbers donā€™t seem to support it and MKL is proprietary and so not a good platform for an open source language.
  3. The other valid candidate to explain the difference might be multi-threading, but the current benchmarks cannot be used to tell that; it would require a run where threading is turned off in Matlab and another where experimental threading is turned on in Julia. Such tests are the only way to advance the discussion past exchange of opinion.
  4. If multithreading is the culprit, this will be solved in the (near?) future because multithreading is an active area of development.
    Is this correct? Thanks!
8 Likes

@mkborregaard yes I think that is a fair summary.

@RoyiAvital are you on windows by any chance? As far as I know, by default some optimizations (for SIMD etc.) are turned off by default on windows for compatibility with a larger range of machines. To get full performance you need to rebuild the system image. But I think someone more knowledgeable about these details, such as @tkelman can probably say with more certainty.

2 Likes

@mkborregaard,

I think Iā€™d recap results as following:

  1. Julia needs to take care of Loop Fusion (If I understand will happen on 0.6).
  2. Iā€™m not sure about Broadcasting, but it is not up to what you get in MATLAB (MATLAB both apply SIMD and Multi Threading).
  3. Julia should add Multi Threading in general and in the loop fusion and broadcasting case. Iā€™d be happy to see in Julia style, Namely user controlled by Macro. The reason is sometimes the operation is Memory Bounded and then Multi Threading only hurts performance. A macro with @Auto (Maybe even implicit) and @OFF / @ON would be great.
  4. Some Linear Algebra algorithm (Internal) in Julia are slower (See sqrtm(), expm()). Both in MATLAB and Julia they are implemented using the language itself (Mostly).
  5. Some BLAS / LAPACK supported operations are slower in Julia. Iā€™d say it is due to OpenBLAS being slower than Intel MKL.

My personal conclusions so far:

  1. Julia is doing its infant baby steps and those are impressive steps. There are some rough edges with the language (Less predictable than MATLAB regarding the work with Numerical Arrays in my opinion) yet the potential is clearly seen.
  2. I really like the spirit of letting the user control everything (For example the @inbounds Macro). This helps taking things to the extreme when needed.
  3. Julia Pro (The product, not the language) must improve its BLAS / LAPACK engine. Probably by working with OpenBLAS team.
  4. Julia approach with the . (Dot) operator is excellent. I like it. Once this and broadcasting are optimized it will be a killer feature.
  5. Multi Threading - If Julia keeps giving the user the control for Multi Threading (At least giving full control in case the user wants it) and apply it efficiently it will be able to be more efficient than MATLAB.
  6. Consistency - Juliaā€™s performance are more consistent than MATLAB. Like it!

I think this is not shown in benchmarks anywhere. The imgur link posted by @tkelman doesnā€™t show this and neither do the OpenBLAS benchmarks. I also ran the linpack benchmark on our cluster, MKL and OpenBLAS results are very close as in all other tests. I think much of the MKL myths predate OpenBLAS, when there was a tangible difference between ATLAS and MKL.

2 Likes

Since it got hinted that I messed up my benchmarking, I reran them at work and made sure that they used the correct versions. The results were the same, canā€™t post them now since I am at home.

Feel free to run the benchmarks yourself.

6 Likes

If youā€™re using binaries of Julia on any platform, we set the C compiler when building Juliaā€™s dependencies and the JIT when building the system image to avoid using instructions that arenā€™t available on older systems.

The only thing that used to be Windows specific was, in 0.4 and earlier, not using the precompiled system image due to debug info issues which led to longer startup time - this was fixed for 0.5.

Thanks. I think I knew that, but my confusion arose because on mac os or linux I always build Julia myself, whereas on Windows I donā€™t (!), so I obviously have a less optimized setup. I have also never successfully rebuilt the system image on windows.

@kristoffer.carlsson,

Are you sure of that?I canā€™t believe they have the same performance on any test unless there is something else which limit them both.

Iā€™d assume to see some variability between them.
There is larger difference between few data runs of Julia itself then what you showed.

But if you are sure your results are valid, then:

  1. Julia is limiting OpenBLASā€™s performance for some reason (Or using it inefficiently).
  2. Julia can and should match MATLABā€™s performance on any tests which use BLAS (Yet it doesnā€™t).

Just to give another view of OpenBLAS vs.Intel MKL I will time few of the tests on Octave as well (Octave uses OpenBLAS).

Back at work now and you are right. BOTH version on 0.5 actually ran with (MKL) which is obvious in hindsight since the timings on 0.6 was different even though the same underlying library should have been used.

Updated comparison at: Imgur: The magic of the Internet

2 Likes

@kristoffer.carlsson, Now your results in line with mine.
I wrote about the anomaly of 0.5 vs. 0.6 in GitHub when you published your results in the first time.

So I think my recap is valid.

@kristoffer.carlsson, Could you download my latest files and check?

Thank You.

I just want to add a point about the argument if one should test with or without multi-threading. For me itā€™s very often the case that my programs are trivially parallelizable and I use only one core for each job to avoid inefficient parallelization. For this case the single core speed is what matters. On the other hand while testing and developing the program I want to use many cores to finish a single run as fast as possible. So I definitely care about both cases.

2 Likes

To add another datapoint, here are the results on a 32-core node on our cluster, with and without threading and comparing OpenBLAS and MKL:
https://github.com/barche/julia-blas-benchmarks/blob/master/BenchmarkResults.ipynb

I also reran the HPL linpack test, here are the results:

  • Standard HPL OpenBLAS, 32 MPI processes on a single node: 757 Gflops
  • Standard HPL MKL, 32 MPI processes on a single node: 788 Gflops
  • Intel HPL MKL, 32 MPI processes on a single node: 814 Gflops
  • Intel HPL MKL, 2 MPI processes with 16 threads each on a single node: 963 Gflops

From both tests it seems clear to me that MKL wins when threading enters into the equation, but single-core performance is much closer, with the possible exception of the Cholesky and Eigen decompositions.

4 Likes

Thanks! Really well done benchmarks @barche

A couple of things to note:

  1. We did some comparisons of Julia vs MATLAB a while back and gave a talk at JuliaCon, a couple of years back. You can find this talk online and while it is probably out of date, we clearly observed that MATLABā€™s automatic parallelism actually slowed some codes down, as opposed to improving performance. Their heuristics for when to use threads has probably improved. But I donā€™t believe that automatic parallelism (when how and where threads are used is completely a mystery and not in your control) can match the efficiency of parallel-aware code. For simple stuff though, it looks pretty cool.

  2. Anyway, there is also https://github.com/IntelLabs/ParallelAccelerator.jl if you want to try automatic parallelism for Julia.

  3. If you use threads in Julia (by setting JULIA_NUM_THREADS) and call into a BLAS library, you will likely end up oversubscribing cores and completely destroying performance. This should be obvious when it happens ā€“ the slowdown is pretty massive. The answer is to also set OMP_NUM_THREADS=1, thereby preventing the libraries from starting threads. Of course, this only makes sense if youā€™re using your Julia threads effectively. True and effective nested parallelism is coming to Julia ā€œsoonā€, but it wonā€™t help with nesting parallelism from your Julia code and one of these BLAS libraries for a good while. It will help, if a BLAS library is written IN Julia though. :slight_smile:

HTH.

1 Like

@kpamnany, I totally agree with you.
In Julia spirit parallelism should be controlled by user.
Maybe a heuristic drive Auto Default method but certainly option to turn it OFF or ON by user control (And set its parameters).

Yet Iā€™d add that from R2016a and on MATLAB is improving its JIT significantly with each iteration.

Please provide a link.

Search for JuliaCon 2015 on Youtube and youā€™ll find a talk about multithreading Julia, somewhere down the playlist.

1 Like

I updated this repository. We get more interesting results! See the plots here:

  • Julia language used is updated to V 1.1.1
  • Julia + MKL.jl benchmark, which improves the performance a lot.
  • Better and more accurate benchmarking tools both in Julia and MATLAB
  • Many improvements and updates are made.

For example

:

Edit:
I wanted to say that this project is detached and move to my organization:
We plan to make Matlab friendly APIs written in native Julia, and then test their performance by comparing it to the Matlab one.
So much more is coming and the benchmark is going to be much broader.

8 Likes

log-scale is very deceiving, so Julia is exponentially worse?