Batched matrix multiplication in CUDA

Thanks, I have made a comparison of three methods. It seems the garden variety full and explicit matrix multiplication is the fastest in my system. I am using 16 threads in i713700K. My GPU is Nvidia RTX 3060.

using LinearAlgebra
using MKL
using NNlib, CUDA, cuDNN

N = 50; Nt = 3000;
A = rand(Nt,2*N,2*N); C = rand(Nt,2*N,2*N); B = rand(Nt,Nt);

nruns = 100;
ta = 0; tb = 0; tc = 0;

for nr = 1:nruns
	println(nr)

	t0a = time()
	Dtemp = permutedims(A[:, 1:2:end, :],(3,1,2)) ⊠ B ⊠ C[:, :, 2:2:end];
	res1 = tr.(eachslice(Dtemp; dims=3));
	t1a = time()
	global ta = ta + (t1a-t0a)/nruns;

	t0b = time()
	Abatch = [ transpose(A[:, 2*pq-1, :]) for pq = 1:N ];
	Cbatch = [ C[:, :, 2*pq] for pq = 1:N ];
	Bbatch = [ B for pq = 1:N ];
	res2 = tr.( Abatch .* Bbatch .* Cbatch );
	t1b = time()
	global tb = tb + (t1b-t0b)/nruns;

	res3 = zeros(Float64,N);
	t0c = time()
	Threads.@threads for pq = 1:N    #16 threads
		res3[pq] = tr( transpose(@view A[:, 2*pq-1, :]) * B * (@view C[:, :, 2*pq]) )
	end
	t1c = time()
	global tc = tc + (t1c-t0c)/nruns;
end

println("NNlib+CUDA time = ",t1a-t0a)
println("Vectorised batch time = ",t1b-t0b)
println("Explicit time = ",t1c-t0c)

Results, times averaged over 100 runs,

NNlib+CUDA time = 0.3775529026985167
Vectorised batch time = 0.36217491626739495
Explicit time = 0.23013512134552006

Any ideas how to make it any faster?