Speed comparison matrix multiplication in Julia

That is strange. Here are my results:

BTW, did you use

using Tullio, LoopVectorization

?

Single-threaded:

jl> Threads.nthreads()
1

jl> using Tullio, LoopVectorization

jl> a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));

jl> b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));

jl> @btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k];
  713.923 ms (2 allocations: 30.52 MiB)

jl> a = Array(reshape(Int32.(1:2*2000*400), 2000,400,2));

jl> b = Array(reshape(Int32.(1:2*2000*400), 400,2000,2));

jl> @btime @tullio c[j, k, i] := $a[j, q, i] * $b[q, k, i];
  133.608 ms (2 allocations: 30.52 MiB)

8 threads:

jl> Threads.nthreads()
8

jl> using Tullio, LoopVectorization

jl> a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));

jl> b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));

jl> @btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k];
  154.384 ms (117 allocations: 30.52 MiB)

jl> a = Array(reshape(Int32.(1:2*2000*400), 2000,400,2));

jl> b = Array(reshape(Int32.(1:2*2000*400), 400,2000,2));

jl> @btime @tullio c[j, k, i] := $a[j, q, i] * $b[q, k, i];
  23.660 ms (117 allocations: 30.52 MiB)