Numpy vs Julia matrix multiplication

I have performed two very simple tests in numpy and julia, and have obtained what I believe are significant differences in performance. This is on Windows 11, CPU: 28 × Intel(R) Core™ i7-14700.

Numpy (python 3.12.7):

In [1]: import numpy as np
In [2]: A = np.random.randn(4000,4000);
In [3]: %timeit A @ A
226 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Julia (1.11.4):

using BenchmarkTools
A = randn(4000,4000);
@benchmark $A * $A
BenchmarkTools.Trial: 13 samples with 1 evaluation per sample.
 Range (min … max):  377.514 ms … 430.861 ms  ┊ GC (min … max): 0.13% … 4.99%
 Time  (median):     410.958 ms               ┊ GC (median):    2.11%
 Time  (mean ± σ):   408.310 ms ±  15.906 ms  ┊ GC (mean ± σ):  2.35% ± 1.39%

I also tried wrapping the multiplication in a trivial mymul(A) = A*A and using @benchmark on it, but the results are not different. I believe both languages are calling out to BLAS, but I’m a surprised that numpy is almost 2x faster.

you might want using MKL which will switch you to Intel’s Blas which is a bit faster than Openblas which we ship

2 Likes

This could also be hardware dependant. On my linux machine (with the regular LinearAlgebra, MKL being a bit slower on my CPUs), the same computation is twice faster with Julia (as long as the number of threads used by BLAS is the same). Anyway, since here both Julia and numpy are calling precompiled C/Fortran libraries, there is almost no point in making these benchmarks. It’s up to you to test the different possibilites (MKL, AOCL, OpenBLAS, AppleAccelerate…) and pick which one is the best on your hardware.

2 Likes

This has nothing to do with the language and everything to do with (a) which BLAS library you are linking and (b) how many threads it uses by default.

(Background: nearly all numerical libraries, from Julia to Python to R to Matlab, call an external library implementing the “BLAS” interface to perform matrix multiplications, at least for single and double precision. So benchmarking the performance of a 4000 \times 4000 matrix multiplication is completely uninteresting as a cross-language comparison. It’s purely a question of the BLAS configuration. Julia links the open-source OpenBLAS library by default. The Intel MKL license is more restrictive so it is an option via MKL.jl. Some Python distros link MKL by default. And different software systems have different defaults regarding the number of threads they start with on a multi-core machine.)

5 Likes

The difference between OpenBLAS and MKL can be pretty stark at times.

Story time:
A few years ago, my research group was mostly MATLAB-based and we observed that we had worse performance on AMD CPUs compared to equivalent Intel CPUs. Turns out that by default, Intel MKL would not allow all features to be used on non-intel platforms, even when fully functional.

This behaviour could be turned off by changing some environment variable / registry key, which would then allow MATLAB to use all MKL features, and voilà, equivalent performance. This was officially fixed by MathWorks in later versions by employing the same trick.

Pretty sure Intel lost a lawsuit due to this kind of behaviour.

EDIT: Reddit post discussing the issue 6 years ago

1 Like