Fastest matrix multiplication

The rate determining step in my code is intermediate-sized matrix multiplications, which I have to do many many (around 10⁸) times.

using LinearAlgebra
using MKL
using BenchmarkTools

aa = rand(ComplexF64, 164,164);
bb = rand(ComplexF64, 164,164);
tta = time()
cc = aa*bb;
ttb = time()
println(" (jac) tt = ",ttb-tta)

@btime aa*bb

In my pc, the output of the println is around 0.02s, while the one with btime is around 90µs. Are there any ways to make it faster? Note that the input matrices aa and bb are freshly calculated each time I have to perform a multiplication, so converting them to a new type would also lead to added time.
Ideally, I would be dealing with matrices of size 500 times 500.

Note that the first call in Julia compile some functions. So the second call uses the compiled function and is fast.
Also call @btime $aa * $bb.

Some low hanging fruits you can check

  • Check: Threads.nthreads() whether you enabled multithreading in your Julia.
  • Do you need ComplexF64? Or is ComplexF32 sufficiently?
  • Do you have a CUDA card in your computer? This could speed up things massively.
4 Likes
  1. I am initialising Julia with julia -t16. My CPU is i7 13700K with 8 performance and 8 efficient cores, totalling 24 threads.
  2. I will check that.
  3. I have an RTX 3060 and I can use CUDA. But since I am recalculating the input matrices aa and bb each time I do the multiplication, won’t the overhead of transferring to the GPU make it even slower?

Can you share how you recalculate the matrices? Maybe this can be done on the GPU too?

If it’s some for loop style, you could try to translate it to KernelAbstractions.jl.

It is rather complicated. It is certainly too long to elaborate over chat/text.
I am solving a set of non-linear equations for my physics work. These matrices are obtained from the dressed Green’s functions, and enter the Jacobian which I feed to Julia’s nlsolve.

aa is a fixed/constant matrix, which I set just once. In short, in one kind of multiplications, bb = Gr, where Gr = (I - grV)\inv(gr), and gr, V are matrices which I obtain numerically. In another kind bb = GrSig*conj(transpose(Gr)), where Sig is again something I know numerically. And the list goes on.
I have to do them for a bunch of parameters (energies, other quantum indices…) and then sum them up to obtain a set of equations for a “current”, which is the thing being solved for by nlsolve.

Doing your benchmarks in ‘global scope’ will not give you useful results. Code that you benchmark should be put in a function.

You can save time on the multiplication itself by doing it in-place, using the mul! function:

mul!(cc, aa, bb) 

The question about threads is a bit of a red herring. BLAS does not use Julia threads, but has its own BLAS threads independent of Julia.

2 Likes

Not sure, but sounds like all of it could be done on a GPU?

I would really recommend to try KernelAbstractions.jl, it’s a neat and simple to use package which provides great speed-up in many cases. A RTX 3060 is a powerful card, and could provide easily 10-15x speed-up.

1 Like

I will check that out and try to see if I can move it all over to my GPU.

mul!(cc, aa, bb) 

is faster than cc .= aa*bb, if I have already defined cc earlier?

1 Like

Tes, cc .= aa*bb does not work in-place. It is equivalent to doing

temp = aa * bb  # creates temporary array
cc .= temp # copy into cc
1 Like

So this is the minimum working example which needs to be optimised. I find out the randomly initialised matrices as a function of other parameters in an external loop, which also contains this block. I found that doing all the computation in CUDA, which forces me to make larger block diagonal matrices such as kron(I(8*Nf+2), Gr), and so on, quickly eats up my ram (64Gb) and it crashes.

The slowest steps are the ones where I calculate j0 and j1. It might not seem like the best practice to use Threads.@threads outside along with the for loop, but I find that it significantly speeds up the calculation (I start Julia with julia -t16). Maybe it’s because BLAS/MKL doesn’t fully utilise all threads?

Nf = 40;

##---Initialisations---
Vvi = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1));
Gr = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1));
Gl = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1));
Ga = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1));
Mi = zeros(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1),8*Nf+2); M = zeros(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1),8*Nf+2);
for ij = 1:2*(4*Nf+1)
    Mi[:,:,ij] = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1));
    M[:,:,ij] = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1));
end
m1t = kron(I(2*(2*Nf+1)), [1 0]); m1 = kron(I(2*(2*Nf+1)), [1; 0]); m2t = kron(I(2*(2*Nf+1)), [0 1]); m2 = kron(I(2*(2*Nf+1)), [0; 1]);

j = zeros(ComplexF64, 2*Nf+1,2*Nf+1,2*(4*Nf+1)); j0 = zeros(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1), 8*Nf+2); j1 = zeros(ComplexF64, 2*(2*Nf+1),2*(2*Nf+1), 8*Nf+2);
##---Calculation---
Threads.@threads for ij = 1:2*(4*Nf+1)
    j0[:,:,ij] .= -( (Mi[:,:,ij]*Gl) + (Vvi*( Gr*M[:,:,ij]*Gl + Gl*M[:,:,ij]*Ga )) ); 
    j1[:,:,ij] .= m1t*j0[:,:,ij]*m1 + m2t*j0[:,:,ij]*m2;
    j[:,:,ij] .= j1[1:(2*Nf+1),1:(2*Nf+1),ij] - j1[(2*Nf+1)+1:2*(2*Nf+1),(2*Nf+1)+1:2*(2*Nf+1),ij]; #much faster than the prev. two lines. Not worth optimising
end

Basically every line allocates unnecessary matrices here.

Instead of:

Mi = zeros(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1),8*Nf+2); M = zeros(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1),8*Nf+2);
for ij = 1:2*(4*Nf+1)
    Mi[:,:,ij] = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1));
    M[:,:,ij] = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1));
end

just do:

Mi = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1),8*Nf+2)
M = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1),8*Nf+2)

The other lines

j0[:,:,ij] .= -( (Mi[:,:,ij]*Gl) + (Vvi*( Gr*M[:,:,ij]*Gl + Gl*M[:,:,ij]*Ga )) )

allocate like 10 temporary matrices. Each Mi[:,:,ij] actually copies and each matric multiplication allocates a new matrix. A remedy for the first thing is putting @views in front of the loop. To fix the other allocations you need to restructure the computation a bit and preallocate some array to use mul! to avoid the allocations.

Also note that in Julia you don’t need to put “;” at the end of lines.

2 Likes

You’ll likely get a lot of mileage out of concatenating matrices that are getting multiplied by the same aa left hand side and doing a single large multiplication, ideally using preallocated matrices and mul! to compute the matmul in place. If you’re then summing the results afterward, it’s possible you could pre-sum the right hand sides to do less work and avoid summing afterwards. That would reduce the big O time of your algorithm substantially, giving much bigger speedups.

Once you’ve done that, test various BLAS libraries and thread counts and see which one does best. There is no universally best implementation. My guess is that for smaller matrices MKL is best, but it might not be. You’ll want to test with and without hyperthreading. Without is likely best but you never know.

3 Likes

On my 8-core Core i7 machine, I get the biggest speedup by just setting

BLAS.set_num_threads(1)

to avoid thread contention between your parallel loop and BLAS. For me this reduces the loop time from about 11 seconds to 6 seconds. I then tried to reduce the unnecessary allocations by using views and mul! as @abraemer suggested above, being careful to use task local storage from the OhMyThreads package:

using LinearAlgebra, MKL
BLAS.set_num_threads(1) # Avoid contention with threaded loop
using OhMyThreads: TaskLocalValue


@views function tryit()
    Nf = 40
    ##---Initialisations---
    Vvi = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1))
    Gr = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1))
    Gl = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1))
    Ga = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1))
    Mi = zeros(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1),8*Nf+2); M = zeros(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1),8*Nf+2)
    for ij = 1:2*(4*Nf+1)
        Mi[:,:,ij] = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1))
        M[:,:,ij] = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1))
    end
    m1t = kron(I(2*(2*Nf+1)), [1 0]); m1 = kron(I(2*(2*Nf+1)), [1; 0]); m2t = kron(I(2*(2*Nf+1)), [0 1]); m2 = kron(I(2*(2*Nf+1)), [0; 1])

    j =  zeros(ComplexF64, 2*Nf+1,2*Nf+1,2*(4*Nf+1))
    j0 = zeros(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1), 8*Nf+2)
    j1 = zeros(ComplexF64, 2*(2*Nf+1),2*(2*Nf+1), 8*Nf+2)
    tls = TaskLocalValue{Vector{Matrix{ComplexF64}}}( () -> 
                     begin
                         t1, t2, t3 = (similar(Gr) for _ in 1:3)
                         bs1 = zeros(ComplexF64, size(m2))
                         ss1, ss2 = (zeros(ComplexF64, size(m1t,1), size(m1t,1)) for _ in 1:2)
                         [t1, t2, t3, bs1, ss1, ss2]
                     end)

    
##---Calculation--
    @time Threads.@threads for ij = 1:2*(4*Nf+1)
        t1, t2, t3, bs1, ss1, ss2 = tls[]
        #j0[:,:,ij] .= -( (Mi[:,:,ij]*Gl) + (Vvi*( Gr*M[:,:,ij]*Gl + Gl*M[:,:,ij]*Ga )) )
        mul!(t1, Gl, M[:,:,ij]);  mul!(t2, t1, Ga) # t2 contains Gl * M[:,:,ij] * Ga
        mul!(t1, Gr, M[:,:,ij]);  mul!(t3, t1, Gl) # t3 contains Gr * M[:,:,ij] * Gl
        t3 .+= t2; mul!(t2, Vvi, t3)
        mul!(t3, Mi[:,:,ij], Gl)
        @. j0[:,:,ij] = -(t3 + t2)
        #j1[:,:,ij] .= m1t*j0[:,:,ij]*m1 + m2t*j0[:,:,ij]*m2
        mul!(bs1, j0[:,:,ij], m1); mul!(ss1, m1t, bs1)
        mul!(bs1, j0[:,:,ij], m2); mul!(ss2, m2t, bs1)
        @. j1[:,:,ij] = ss1 + ss2
        j[:,:,ij] .= j1[1:(2*Nf+1),1:(2*Nf+1),ij] - j1[(2*Nf+1)+1:2*(2*Nf+1),(2*Nf+1)+1:2*(2*Nf+1),ij] #much faster than the prev. two lines. Not worth optimising
    end
end

The second execution tryit() then results in

julia> tryit()
  5.284830 seconds (6.60 k allocations: 114.961 MiB)

Hopefully you get even more speedup on your 16 core machine. I haven’t tried concatenating as suggested by @StefanKarpinski .

5 Likes

Building on the result of @PeterSimon, I found another big improvement: Note your matrices m1,m1t,m2,m2t are all very sparse so you should use SparseArrays.jl for that. Doing that I got an improvement

old: 3.356652 seconds (6.60 k allocations: 114.960 MiB)
new: 2.286051 seconds (805 allocations: 83.535 MiB)

I then notices that the last operation in the loop is missing a . in front of the “-” so it allocates a new array and then just copies it to j. I also made use of the 5-arg version of mul!:

mul!(A,B,C,x,y) -> A = x*B*C + y A

to get rid of half the temporary arrays for a ~5-10% speedup.
My final time is

2.067672 seconds (113 allocations: 32.046 MiB)

with code:

using LinearAlgebra, MKL, SparseArrays
BLAS.set_num_threads(1) # Avoid contention with threaded loop
using OhMyThreads: TaskLocalValue


@views function tryit()
    Nf = 40
    ##---Initialisations---
    Vvi = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1))
    Gr = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1))
    Gl = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1))
    Ga = rand(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1))
    Mi = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1),8*Nf+2)
    M = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1),8*Nf+2)
    # is there a reason for this loop?
    # for ij = 1:2*(4*Nf+1)
    #     Mi[:,:,ij] = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1))
    #     M[:,:,ij] = rand(ComplexF64, 2*2*(2*Nf+1),2*2*(2*Nf+1))
    # end
    m1t = kron(I(2*(2*Nf+1)), sparse([1 0])); m1 = transpose(m1t); m2t = kron(I(2*(2*Nf+1)), sparse([0 1])); m2 = transpose(m2t)

    j =  zeros(ComplexF64, 2*Nf+1,2*Nf+1,2*(4*Nf+1))
    j0 = zeros(ComplexF64, 4*(2*Nf+1),4*(2*Nf+1), 8*Nf+2)
    j1 = zeros(ComplexF64, 2*(2*Nf+1),2*(2*Nf+1), 8*Nf+2)
    tls = TaskLocalValue{Vector{Matrix{ComplexF64}}}( () ->
                     begin
                         t1, t2 = (similar(Gr) for _ in 1:2)
                         bs1 = zeros(ComplexF64, size(m2))
                         #ss1, ss2 = (zeros(ComplexF64, size(m1t,1), size(m1t,1)) for _ in 1:2)
                         [t1, t2, bs1]#, ss1, ss2]
                     end)


##---Calculation--
    @time Threads.@threads for ij = 1:2*(4*Nf+1)
        t1, t2, bs1 = tls[]
        #j0[:,:,ij] .= -( (Mi[:,:,ij]*Gl) + (Vvi*( Gr*M[:,:,ij]*Gl + Gl*M[:,:,ij]*Ga )) )
        mul!(t1, Gl, M[:,:,ij]);  mul!(t2, t1, Ga) # t2 contains Gl * M[:,:,ij] * Ga
        mul!(t1, Gr, M[:,:,ij]);  mul!(t2, t1, Gl, true, true) # t2 contains Gr * M[:,:,ij] * Gl + Gl * M[:,:,ij] * Ga
        mul!(j0[:,:,ij], Vvi, t2, -1, false) # multiply into j0[:,:,ij]
        mul!(j0[:,:,ij], Mi[:,:,ij], Gl, -1, true) # and subtract other part directly

        #j1[:,:,ij] .= m1t*j0[:,:,ij]*m1 + m2t*j0[:,:,ij]*m2
        # this uses sparse matrices now!
        mul!(bs1, j0[:,:,ij], m1); mul!(j1[:,:,ij], m1t, bs1)
        mul!(bs1, j0[:,:,ij], m2); mul!(j1[:,:,ij], m2t, bs1, true, true)

        # note the additional .
        j[:,:,ij] .= j1[1:(2*Nf+1),1:(2*Nf+1),ij] .- j1[(2*Nf+1)+1:2*(2*Nf+1),(2*Nf+1)+1:2*(2*Nf+1),ij] #much faster than the prev. two lines. Not worth optimising
    end
end
2 Likes

Look, kron is an anti-pattern. It is good for certain mathematical proofs, but almost all code using kron is garbage, with very few exceptions.

You are using m1 only to compute sparse-dense matrix products a la

mul!(bs1, j0[:,:,ij], m1)

This is the same as

bs1 .= view(j0, :, 1:2:size(j0, 2), ij );

Likewise

mul!(view(j1, :,:,ij), m1t, bs1)

is equivalent to

view(j1, :, :, ij) .= view(bs1, 1:2:size(bs1, 1), :);

You can use the broadcasted variant as above, but ideally you would just write the plain loop using linear indexing where appropriate.

PS. I am saying “almost all code using kron is garbage” because kron is an example of extremely structured linear operators.

If you know your linear operator and its structure, then it is most of the time a mistake to represent this linear operator as a matrix, dense or sparse – instead you simply know what this operator does, no abstractions or libraries needed.

It is my personal opinion that this also applies to proofs, btw – it is very profitable to be able to think “extract every second column” instead of “multiply these matrices”. Some authors choose to write that as e.g. A*m1, and I have nothing against that notation, as long as it is clear that this is a shorthand for “extract every second column of A” and not an invitation to forget what A*m1 means.

4 Likes

I agree with you, that in this case the fastest thing to do would be to just write a loop over the correct indices which can be deduced rather easy. But I would argue that it is not necessarily worth it because it is only a very minor speed up and makes the code significantly less general (you cannot use any projector anymore) and less easy to understand (what are these indices?). I think the kron are somewhat clearer in their intent.
Using this line instead of the products with m1 etc

j1[:,:,ij] .= j0[1:2:4*(2*Nf+1), 1:2:4*(2*Nf+1), ij] .+ j0[2:2:4*(2*Nf+1), 2:2:4*(2*Nf+1), ij]
# 2.049803 seconds (97 allocations: 25.637 MiB)

I’d like to point out that this imo does not quite apply to computational quantum physics. Most of our linear operators are just sums of tensor/kronecker products but one wouldn’t want to write functions for their action usually as they are to complicated to exploit. In theory one could compute the relevant indices of these operators without kron but in practice it is rarely worth the hassle if kron does the job as well. So with respect to this field, I would object to the sentiment that "almost all code using kron is garbage` :wink:

On top of that, I would argue that in this domain the kron carries a lot of semantic information and thus the code is easier to read and understand. For example the code we just optimized clearly reads as tracing out a 2-level subsystem which is fully lost by just using the indices instead of the matrices (not sure whether OP does something quantum mechanics related - to my biased eye it looks a bit like it). Ofc some comments in the code could fix this as well… But what if I wanted to trace out some other subsystem at some point? Then the original would be much easier to adapt with less opportunity for errors. That’s why I kept more closely to the original code in that respect :slight_smile:

2 Likes

Yes, removing the kron was quite helpful. While it reduces readability as @abraemer mentioned, it drastically reduces the complexity.

Thanks @abraemer and @PeterSimon. I had the loops on Mi and M as they are calculated from a separate function, which depends on the loop index ij. Mi and M are actually sparse. That is, M[:,:,ij] is a sparse matrix for each ij. If it is possible to make a structure/array of sparse M and Mi for each ij that should ideally help.

I have no experience using kronecker products, but if they carry meaning, perhaps using the Kronecker.jl package would preserve that meaning while still providing the speedup. It’s just a matter of replacing kron with kronecker, AIUI.