The first piece of advice I have is to remember that the trace of the product of two matrices only uses the diagonal elements, and therefore you don’t need to calculate the whole matrix. So if you find yourself calculating Tr(A*B) it is probably better to not do the full matrix multiplication, when all you actually need is the sum of the elementwise multiplication of one matrix by the transpose of the other. So working from your vecbatch function it is faster to do something like this:
function vecbatch(A,B,C)
Abatch = [A[:, 2*pq-1, :] for pq = 1:N ];
Cbatch = [ C[:, :, 2*pq] for pq = 1:N ];
res = [sum( Abatch[pq] .* (B * Cbatch[pq])) for pq=1:N];
return res
end
The second piece of advice I have is that the cuBLAS library has a few functions for batched matrix multiplication that you might want to look into if you are planning to use CUDA for this problem.