Speed comparison matrix multiplication in Julia

jling · August 19, 2021, 9:18am

you know Jax is not single threaded right?:

In [3]: for x  in range(200):
   ...:     jnp.matmul(a,b)
   ...:

DNF · August 19, 2021, 9:25am

Relative to what? Are you running BLAS single-threaded? You can check with

BLAS._get_num_threads()

and set it to one thread with

BLAS.set_num_threads(1)

As for orientation, there is a huge difference between row and column major here:

jl> a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));

jl> b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));

jl> @btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k];
  170.893 ms (116 allocations: 30.52 MiB)

vs

l> a = Array(reshape(Int32.(1:2*2000*400), 2000,400,2));

jl> b = Array(reshape(Int32.(1:2*2000*400), 400,2000,2));

jl> @btime @tullio c[j, k, i] := $a[j, q, i] * $b[q, k, i];
  24.013 ms (116 allocations: 30.52 MiB)

sivakon · August 19, 2021, 9:27am

import jax.numpy as jnp
from jax.config import config
config.update("jax_enable_x64", False)

import os
os.environ["XLA_FLAGS"] = ("--xla_cpu_multi_thread_eigen=false intra_op_parallelism_threads=0")

a = jnp.arange(2 * 2000 * 400).reshape((2, 2000, 400)) + 1
b = jnp.arange(2 * 400 * 2000).reshape((2, 400, 2000)) + 1

print(a.dtype) # int32

%%timeit
c = jnp.matmul(a,b) # 291ms

jling · August 19, 2021, 9:32am

In [1]: import jax.numpy as jnp
   ...: from jax.config import config
   ...: config.update("jax_enable_x64", False)
   ...: 
   ...: import os
   ...: os.environ["XLA_FLAGS"] = ("--xla_cpu_multi_thread_eigen=false intra_op_parallelism_threads=0")
   ...: 
   ...: a = jnp.arange(2 * 2000 * 400).reshape((2, 2000, 400)) + 1
   ...: b = jnp.arange(2 * 400 * 2000).reshape((2, 400, 2000)) + 1

In [3]: %timeit jnp.matmul(a,b)
1 s ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

julia> function mm(A,B)
           C = Array{Int32, 3}(undef, size(A,1),size(A,2), size(B,3))
           for i in axes(A,1)
               @views C[i,:,:] = A[i,:,:] * B[i,:,:]
           end
           C
       end
mm (generic function with 1 method)

julia> a = reshape(Int32.(1:2*2000*400), 2,2000,400);

julia> b = reshape(Int32.(1:2*2000*400), 2,400,2000);

julia> using BenchmarkTools

julia> import LinearAlgebra

julia> LinearAlgebra.BLAS.set_num_threads(1)

julia> @btime mm($a,$b);
  827.518 ms (18 allocations: 61.04 MiB)

Julia already faster

DNF · August 19, 2021, 9:32am

If you don’t correct for major dimension orientation, these comparisons won’t be very meaningful.

sivakon · August 19, 2021, 9:32am

➜ julia -q                                                                                                                                                                      ~
julia> using BenchmarkTools, Tullio, LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));

julia> b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));

julia> @btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k];
  2.304 s (2 allocations: 30.52 MiB)

jling · August 19, 2021, 9:33am

You want: 2000, 400, 2, also, Tullio doesn’t use BLAS

sivakon · August 19, 2021, 9:35am

➜ jupyter console                                                                                                                                                               ~
Jupyter console 6.4.0

Python 3.7.10 (default, Apr 27 2021, 08:49:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.25.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import jax.numpy as jnp
   ...: from jax.config import config
   ...: config.update("jax_enable_x64", False)
   ...:
   ...: import os
   ...: os.environ["XLA_FLAGS"] = ("--xla_cpu_multi_thread_eigen=false intra_op_parallelism_threads=0")
   ...:
   ...: a = jnp.arange(2 * 2000 * 400).reshape((2, 2000, 400)) + 1
   ...: b = jnp.arange(2 * 400 * 2000).reshape((2, 400, 2000)) + 1
   ...:
   ...: print(a.dtype) # int32
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
int32

In [2]: %%timeit
   ...: c = jnp.matmul(a,b)
   ...:
   ...:
395 ms ± 8.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I am not sure why your Jax is slower

DNF · August 19, 2021, 9:37am

This is getting confusing.

Changing the number of BLAS threads should change the performance of the mm function, if you are comparing that to Tullio, it does not affect Tullio.

In order to compare single-threaded performance, you should

Start Julia with a single thread
set BLAS threads to 1
set jax to use one thread
Then compare all three

For multi-threading, do the same, but set up each to have the same number of threads as you have physical cores.

DNF · August 19, 2021, 9:38am

And, very importantly: change orientation of your arrays for Julia, since Julia is column major.

jling · August 19, 2021, 9:39am

it maybe due to CPU, can you check top and run it in a loop (like what I did) to make sure Jax is indeed using single thread? the fact that your single thread vs. multi only changed 5% is weird, on my end, I see 1s → 40ms difference with 48 cores.

And you really should use the correct orientation (i.e. memory layout) to be fair.

sivakon · August 19, 2021, 10:10am

I see your point. Seems like Jax is using multi-threading on my Mac. Same code and flags give the result you shared on Linux. This might be a bug in Jax on Mac.

sivakon · August 19, 2021, 10:47am

Updated the post, take a look. Thanks for the help

DNF · August 19, 2021, 10:51am

@sivakon I apologize for repeating myself, but why do you not fix the major axis orientation issue? Isn’t it correct that that jax (like numpy) uses row-major arrays? In that case these results are not comparable. Is there something I’m missing, since you’re not addressing this concern?

In the benchmark I posted further up, changing orientation leads to a 7x speedup on my computer.

jling · August 19, 2021, 11:03am

so the latest result suggested that Julia is not slower than Jax even when we’re using the awkward memory layout because col-major vs. row-major? good to know, because Jax is Numpy re-write with some TF runtime and LLVM backend that compiles kernel on its own

also you really should fix col-major vs. row-major to make it apple to apple

sivakon · August 19, 2021, 11:13am

using Tullio, BenchmarkTools
a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));
b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));
@btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k]; 
# 2.212 s (2 allocations: 30.52 MiB)


using Tullio, BenchmarkTools
a = Array(reshape(Int32.(1:2*2000*400), 2000,400,2));
b = Array(reshape(Int32.(1:2*2000*400), 400,2000,2));
@btime @tullio c[j, k, i] := $a[j, q, i] * $b[q, k, i]; 
# 1.854 s (2 allocations: 30.52 MiB)

Is this what you mean? I don’t see any speed-up here.

DNF · August 19, 2021, 11:28am

That is strange. Here are my results:

BTW, did you use

using Tullio, LoopVectorization

?

Single-threaded:

jl> Threads.nthreads()
1

jl> using Tullio, LoopVectorization

jl> a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));

jl> b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));

jl> @btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k];
  713.923 ms (2 allocations: 30.52 MiB)

jl> a = Array(reshape(Int32.(1:2*2000*400), 2000,400,2));

jl> b = Array(reshape(Int32.(1:2*2000*400), 400,2000,2));

jl> @btime @tullio c[j, k, i] := $a[j, q, i] * $b[q, k, i];
  133.608 ms (2 allocations: 30.52 MiB)

8 threads:

jl> Threads.nthreads()
8

jl> using Tullio, LoopVectorization

jl> a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));

jl> b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));

jl> @btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k];
  154.384 ms (117 allocations: 30.52 MiB)

jl> a = Array(reshape(Int32.(1:2*2000*400), 2000,400,2));

jl> b = Array(reshape(Int32.(1:2*2000*400), 400,2000,2));

jl> @btime @tullio c[j, k, i] := $a[j, q, i] * $b[q, k, i];
  23.660 ms (117 allocations: 30.52 MiB)

DNF · August 19, 2021, 11:32am

Mystery solved. Using just Tullio without LoopVectorization yields times around 1.8s. if I have

using Tullio, LoopVectorization

it gives me a 2.5x speedup with row-major orientation, but a 13.5x speedup for colum major.

But anyway, I was just getting a bit exasperated that numerous requests to fix the orientation issue were not acknowledged or noticed.

carstenbauer · August 19, 2021, 11:38am

Can confirm this:

julia> using Tullio, LoopVectorization, BenchmarkTools

julia> a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));

julia> b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));

julia> @btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k];
  1.415 s (2 allocations: 30.52 MiB)

julia> a = Array(reshape(Int32.(1:2*2000*400), 2000,400,2));

julia> b = Array(reshape(Int32.(1:2*2000*400), 400,2000,2));

julia> @btime @tullio c[j, k, i] := $a[j, q, i] * $b[q, k, i];
  251.543 ms (2 allocations: 30.52 MiB)

Without LoopVectorization it takes 3 - 4 seconds with the latter case being only about 20% faster.

sivakon · August 19, 2021, 11:52am

I wanted to fix my issue first, have apples to apples comparison when comparing speed between Julia and Jax, and later optimize my Julia code.

julia> using Tullio, LoopVectorization, BenchmarkTools

julia> a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));

julia> b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));

julia> @btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k];
  682.080 ms (2 allocations: 30.52 MiB)

julia> a = Array(reshape(Int32.(1:2*2000*400), 2000,400,2));

julia> b = Array(reshape(Int32.(1:2*2000*400), 400,2000,2));

julia> @btime @tullio c[j, k, i] := $a[j, q, i] * $b[q, k, i];
  126.235 ms (2 allocations: 30.52 MiB)

My code is hella fast now, thanks.

Topic		Replies	Views
Speeding up Matrix multiplication involving dot and hadamard product New to Julia question , performance , vector , matrices , loopvectorization	11	1632	February 9, 2022
Faster squared euclidean distance calculation Performance	11	1805	October 2, 2021
I just decided to migrate from Python+Fortran to Julia as Julia was faster in my test Community fortran , performance , python , tullio , loopvectorization	37	7112	June 25, 2021
Understanding Tullio performances Performance tullio , jax	12	394	June 10, 2025
Julia matrix-multiplication performance Performance linearalgebra	20	8625	October 30, 2022

Speed comparison matrix multiplication in Julia

Related topics