Thanks, I have made a comparison of three methods. It seems the garden variety full and explicit matrix multiplication is the fastest in my system. I am using 16 threads in i713700K. My GPU is Nvidia RTX 3060.
using LinearAlgebra
using MKL
using NNlib, CUDA, cuDNN
N = 50; Nt = 3000;
A = rand(Nt,2*N,2*N); C = rand(Nt,2*N,2*N); B = rand(Nt,Nt);
nruns = 100;
ta = 0; tb = 0; tc = 0;
for nr = 1:nruns
println(nr)
t0a = time()
Dtemp = permutedims(A[:, 1:2:end, :],(3,1,2)) ⊠ B ⊠ C[:, :, 2:2:end];
res1 = tr.(eachslice(Dtemp; dims=3));
t1a = time()
global ta = ta + (t1a-t0a)/nruns;
t0b = time()
Abatch = [ transpose(A[:, 2*pq-1, :]) for pq = 1:N ];
Cbatch = [ C[:, :, 2*pq] for pq = 1:N ];
Bbatch = [ B for pq = 1:N ];
res2 = tr.( Abatch .* Bbatch .* Cbatch );
t1b = time()
global tb = tb + (t1b-t0b)/nruns;
res3 = zeros(Float64,N);
t0c = time()
Threads.@threads for pq = 1:N #16 threads
res3[pq] = tr( transpose(@view A[:, 2*pq-1, :]) * B * (@view C[:, :, 2*pq]) )
end
t1c = time()
global tc = tc + (t1c-t0c)/nruns;
end
println("NNlib+CUDA time = ",t1a-t0a)
println("Vectorised batch time = ",t1b-t0b)
println("Explicit time = ",t1c-t0c)
Results, times averaged over 100 runs,
NNlib+CUDA time = 0.3775529026985167
Vectorised batch time = 0.36217491626739495
Explicit time = 0.23013512134552006
Any ideas how to make it any faster?