It’s not identical, but a MWE with 3 options:
using BenchmarkTools, LinearAlgebra, Random
function myfun!(y, mat1, mat2, iCase)
Tk, S = size(mat1);
Tp, N = size(mat2);
if iCase == 1
yPart = zeros(Tk, Tp);
end
@inbounds @views for n=1:N
for s in 1:S
if iCase == 1
mul!(yPart, mat1[:,s], mat2[:,n]');
y[:,:,s,n] .= yPart;
elseif iCase == 2
y[:,:,s,n] .= mat1[:,s]*mat2[:,n]';
elseif iCase == 3
for i1 = 1 : Tk
for i2 = 1 : Tp
y[i1,i2,s,n] = mat1[i1,s]*mat2[i2,n];
end
end
end
end
end
return nothing
end
function bmark(iCase)
rng = MersenneTwister(123)
Tk=10
Tp=16
S=104
N=336
mat1 = rand(rng, Tk,S)
mat2 = rand(rng, Tp,N)
y = zeros(Tk,Tp,S,N)
b = @benchmarkable myfun!($y, $mat1, $mat2, $iCase);
run(b)
end
Unrolling the matrix multiplication avoids allocations and wins in terms of speed. But mul!
helps some.
julia> bmark(1)
BenchmarkTools.Trial:
memory estimate: 1.33 KiB
allocs estimate: 1
--------------
minimum time: 8.877 ms (0.00% GC)
median time: 9.118 ms (0.00% GC)
mean time: 9.320 ms (0.00% GC)
maximum time: 13.458 ms (0.00% GC)
--------------
samples: 537
evals/sample: 1
julia> bmark(2)
BenchmarkTools.Trial:
memory estimate: 45.32 MiB
allocs estimate: 34945
--------------
minimum time: 11.003 ms (0.00% GC)
median time: 12.034 ms (0.00% GC)
mean time: 18.336 ms (32.68% GC)
maximum time: 76.342 ms (84.29% GC)
--------------
samples: 273
evals/sample: 1
julia> bmark(3)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.754 ms (0.00% GC)
median time: 4.003 ms (0.00% GC)
mean time: 4.180 ms (0.00% GC)
maximum time: 11.018 ms (0.00% GC)
--------------
samples: 1196
evals/sample: 1
I’m sure there must be a function that implements A B'
without allocating, but I don’t know what it is.