That is strange. Here are my results:
BTW, did you use
using Tullio, LoopVectorization
?
Single-threaded:
jl> Threads.nthreads()
1
jl> using Tullio, LoopVectorization
jl> a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));
jl> b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));
jl> @btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k];
713.923 ms (2 allocations: 30.52 MiB)
jl> a = Array(reshape(Int32.(1:2*2000*400), 2000,400,2));
jl> b = Array(reshape(Int32.(1:2*2000*400), 400,2000,2));
jl> @btime @tullio c[j, k, i] := $a[j, q, i] * $b[q, k, i];
133.608 ms (2 allocations: 30.52 MiB)
8 threads:
jl> Threads.nthreads()
8
jl> using Tullio, LoopVectorization
jl> a = Array(reshape(Int32.(1:2*2000*400), 2,2000,400));
jl> b = Array(reshape(Int32.(1:2*2000*400), 2,400,2000));
jl> @btime @tullio c[i, j, k] := $a[i, j, q] * $b[i, q, k];
154.384 ms (117 allocations: 30.52 MiB)
jl> a = Array(reshape(Int32.(1:2*2000*400), 2000,400,2));
jl> b = Array(reshape(Int32.(1:2*2000*400), 400,2000,2));
jl> @btime @tullio c[j, k, i] := $a[j, q, i] * $b[q, k, i];
23.660 ms (117 allocations: 30.52 MiB)