Batched matrix multiplication in CUDA

bcmichael · November 30, 2023, 3:39pm

The first piece of advice I have is to remember that the trace of the product of two matrices only uses the diagonal elements, and therefore you don’t need to calculate the whole matrix. So if you find yourself calculating Tr(A*B) it is probably better to not do the full matrix multiplication, when all you actually need is the sum of the elementwise multiplication of one matrix by the transpose of the other. So working from your vecbatch function it is faster to do something like this:

function vecbatch(A,B,C)
	Abatch = [A[:, 2*pq-1, :] for pq = 1:N ];
	Cbatch = [ C[:, :, 2*pq] for pq = 1:N ];
	res = [sum( Abatch[pq] .* (B * Cbatch[pq])) for pq=1:N];
	
	return res
end

The second piece of advice I have is that the cuBLAS library has a few functions for batched matrix multiplication that you might want to look into if you are planning to use CUDA for this problem.

Topic		Replies	Views
Batched matrix-multiplication optimization Performance performance , linearalgebra	15	446	May 30, 2025
Best way to take trace of matrix product in Julia? Performance	17	1096	August 3, 2023
Multiply many-matrices by many-vectors Performance matlab , parallel , multithreading , tensors	33	6902	December 14, 2018
Fastest matrix multiplication General Usage mkl , linearalgebra , matrix	20	1866	February 19, 2024
Matrix Multiplication of a Large Number of Small Matrices Performance linearalgebra , tullio , column-major	12	1407	February 8, 2023

Batched matrix multiplication in CUDA

Related topics