Fastest matrix multiplication

Then write a method traceOutXYZSubSystem? After adapting all the memory layout (and basis choices!) to make the subsystems you actually want to trace out easy to trace out?

If you are fluent enough to understand kron as a shorthand for “trace out subsystem” with all its physical meaning, then by all means use that notation.

That is imo extremely advanced notation, because naive readers will come and take the kron and matmul literally, and end up assembling big matrices (possibly dense!) that they multiply, in lieu of a glorified memcpy.

A further modification I would suggest is the following:

##--initialization
sz = size(Gl,1) #I don't want to type expressions including Nf all the time
GlGa = hcat(Gl, Ga)
Vr = Vvi*Gr - I(sz)
Vl = Vvi*Gl

...
#tmp1 = similar(hcat(M*Gl, M*Ga))
#tmp2 = similar(Gl)

mul!(tmp1, M[:,:, ij], GlGa)
mul!(j0[:,:,ij],  Vl,  tmp1[:,1:sz])
mul!(tmp2, Vr, tmp[:, sz+1:2*sz])
j0[:,:,ij] .+= tmp2

In other words:

  1. We can get better memory locality by using M[:,:,ij]*hcat(Gl,Ga) instead of doing two separate matmuls (as @StefanKarpinski remarked upthread)
  2. We can hoist this hcat out of the loop
  3. We can re-associate stuff like Vvi * (Gr * X) into (Vvi * Gr) * X and then hoist one matmul Vvi * Gr out of the loop.

PS. This brings it down from 1.6s → 1.2s on my machine.

@eveningsilverfox I think you will need to show us the sparsity structure of M[:,:, ij] if you want more help.

If Mi was dense, I think the remaining matmuls are still big enough for CUDA. I.e. I think with these sizes, it should amortize to send M[:,:,ij] from main memory to gpu memory, and fetch back the resulting j0[:,:, ij].

3 Likes