Numpy vs Julia matrix multiplication

bernardofpc · March 3, 2026, 12:53pm

I have performed two very simple tests in numpy and julia, and have obtained what I believe are significant differences in performance. This is on Windows 11, CPU: 28 × Intel(R) Core™ i7-14700.

Numpy (python 3.12.7):

In [1]: import numpy as np
In [2]: A = np.random.randn(4000,4000);
In [3]: %timeit A @ A
226 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Julia (1.11.4):

using BenchmarkTools
A = randn(4000,4000);
@benchmark $A * $A
BenchmarkTools.Trial: 13 samples with 1 evaluation per sample.
 Range (min … max):  377.514 ms … 430.861 ms  ┊ GC (min … max): 0.13% … 4.99%
 Time  (median):     410.958 ms               ┊ GC (median):    2.11%
 Time  (mean ± σ):   408.310 ms ±  15.906 ms  ┊ GC (mean ± σ):  2.35% ± 1.39%

I also tried wrapping the multiplication in a trivial mymul(A) = A*A and using @benchmark on it, but the results are not different. I believe both languages are calling out to BLAS, but I’m a surprised that numpy is almost 2x faster.

Oscar_Smith · March 3, 2026, 1:04pm

you might want using MKL which will switch you to Intel’s Blas which is a bit faster than Openblas which we ship

sablonl · March 3, 2026, 1:19pm

This could also be hardware dependant. On my linux machine (with the regular LinearAlgebra, MKL being a bit slower on my CPUs), the same computation is twice faster with Julia (as long as the number of threads used by BLAS is the same). Anyway, since here both Julia and numpy are calling precompiled C/Fortran libraries, there is almost no point in making these benchmarks. It’s up to you to test the different possibilites (MKL, AOCL, OpenBLAS, AppleAccelerate…) and pick which one is the best on your hardware.

stevengj · March 3, 2026, 1:48pm

This has nothing to do with the language and everything to do with (a) which BLAS library you are linking and (b) how many threads it uses by default.

(Background: nearly all numerical libraries, from Julia to Python to R to Matlab, call an external library implementing the “BLAS” interface to perform matrix multiplications, at least for single and double precision. So benchmarking the performance of a 4000 \times 4000 matrix multiplication is completely uninteresting as a cross-language comparison. It’s purely a question of the BLAS configuration. Julia links the open-source OpenBLAS library by default. The Intel MKL license is more restrictive so it is an option via MKL.jl. Some Python distros link MKL by default. And different software systems have different defaults regarding the number of threads they start with on a multi-core machine.)

Mathieu · March 3, 2026, 2:55pm

The difference between OpenBLAS and MKL can be pretty stark at times.

Story time:
A few years ago, my research group was mostly MATLAB-based and we observed that we had worse performance on AMD CPUs compared to equivalent Intel CPUs. Turns out that by default, Intel MKL would not allow all features to be used on non-intel platforms, even when fully functional.

This behaviour could be turned off by changing some environment variable / registry key, which would then allow MATLAB to use all MKL features, and voilà, equivalent performance. This was officially fixed by MathWorks in later versions by employing the same trick.

Pretty sure Intel lost a lawsuit due to this kind of behaviour.

EDIT: Reddit post discussing the issue 6 years ago

bernardofpc · March 4, 2026, 12:50pm

Interesting, thank you. It does speed up, but it’s still not quite the same as numpy’s version - I’ll try to figure out whether they are the same or different LA libraries.

tobydriscoll · March 4, 2026, 10:10pm

For those wondering what this looks like for Apple Silicon (in this case, an M4 Pro).

Julia 1.12.5:

julia> using BenchmarkTools
julia> A = randn(4000, 4000);
julia> @benchmark $A * $A

BenchmarkTools.Trial: 19 samples with 1 evaluation per sample.
 Range (min … max):  259.296 ms … 353.961 ms  ┊ GC (min … max): 0.00% … 19.40%
 Time  (median):     266.551 ms               ┊ GC (median):    0.41%
 Time  (mean ± σ):   273.374 ms ±  20.743 ms  ┊ GC (mean ± σ):  1.86% ±  4.38%

julia> using AppleAccelerate
julia> @benchmark $A * $A
BenchmarkTools.Trial: 30 samples with 1 evaluation per sample.
 Range (min … max):  164.077 ms … 229.492 ms  ┊ GC (min … max): 0.71% … 28.57%
 Time  (median):     165.455 ms               ┊ GC (median):    0.68%
 Time  (mean ± σ):   167.844 ms ±  11.754 ms  ┊ GC (mean ± σ):  2.04% ±  5.13%

Python 3.12.12:

>>> import numpy as np
>>> import timeit
>>> A = np.random.randn(4000, 4000);
>>> timeit.timeit(lambda: A @ A, number=20) / 20
0.2637795458023902

There seems to be no difference in standard Julia vs. NumPy, and a nice gain in Julia if you are willing to use AppleAccelerate out of the box.

chillenb · March 13, 2026, 7:42am

Every now and then, Intel disables the most popular workaround for the anti-AMD detection in MKL. The most recent one goes like this:


int mkl_serv_intel_cpu_true() {
  return 1;
}

typedef int (*fakeintel_fptr)(void);

fakeintel_fptr mkl_serv_get_cpu_true() {
  return &mkl_serv_intel_cpu_true;
}

This page covers the situation well.

Topic		Replies	Views
Slow matrix multiplication in Julia compared to Python numpy New to Julia question	17	5799	May 19, 2018
Benchmarking Julia vs NumPy New to Julia benchmark	9	1886	March 29, 2020
Why is this simple function twice as slow as its Python version Performance question	97	5207	April 12, 2021
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	11441	October 15, 2019
LinearAlgebra.mul! for complex vectors very slow on Apple Silicon Performance performance , linearalgebra , mac-m1	5	525	November 8, 2024

Numpy vs Julia matrix multiplication

Related topics