The Optimization Problem with Nested Loops in Julia

Hi everyone, I have a problem related to the optimization problem with nested loops. I use QuantumOptics.jl (qojulia.org) module to calculate for some physical situation. First of all, let me share my code and explain it.

function Hubbard_Interaction(P, Pt, cut_mb_basis, cut_off, U)
    
    # P1 and P1t are just a matrix, don't focus on it :)
    P1 = P.data
    P1t = Pt.data

    #Preety fast calculation with einsum. No problem here
    @einsum coefficient[k,l,m,n] := P1[k,i] * P1[l,i] * P1t[i,m] * P1t[i,n]

    # Sparse operator for fast modifying
    Vint_mb_cut = SparseOperator(cut_mb_basis)
    
    # The problem starts here :/    
    for k in 1:cut_off
        for l in 1:cut_off
            for m in 1:cut_off
                for n in 1:cut_off

                    # These four operators are just matrices but note that they depend on the loop indices!
                    a1t = create(cut_mb_basis, k)
                    a2t = create(cut_mb_basis, l)
                    a2  = destroy(cut_mb_basis, m)      
                    a1  = destroy(cut_mb_basis, n)

                    #Matrix multiplication pretty fast, no problem here      
                    Vint_mb_cut += U/2*coefficient[k,l,m,n]*a1t*a2t*a2*a1
                end
            end
        end
    end
    
    return Vint_mb_cut
end

There is a function that calculates the interaction between two particles. It’s unnecessary for non-physicist people. The main point is that there is a nested loop in my code. When cutt_off is a big number, the code works quite slowly :frowning: The big question is how can I optimize this code for big cutt_off (or just a number from starting 1) values? For example, cut_off = 50 :confused:

It seems you are rebuilding the creation and annihilation matrices unnecessarily in the innermost loop. E.g. a1t = create(cut_mb_basis, k) only depends on the outermost variable and could be moved to

for k in 1:cut_off
  a1t = create(cut_mb_basis, k)
  [...]

and so on, which should save about 75%.

But actually it might be smarter to create all operators from the get-go. I suppose they are sparse matrices? Shouldn’t take up too much space. I’ve never worked with QuantumOptics, so that’s a guess.

3 Likes

All of these matrices are sparse arrays, they help to improve code performance of course. I tried your opinion, but the results are not convincing. Because when cut_off = 75 the code still works slowly.

Can you post your updated code maybe? Because the one above allocates 4 matrices in every inner loop run, which looks really costly (though I am not sure what create and destroy do).

Did you try pre-allocating all operators outside of the loop, e.g

A = [create(cut_mb_basis, k) for k in 1:cut_off]
At = [destroy(cut_mb_basis, k) for k in 1:cut_off]

and then indexing into those arrays in the loop?

3 Likes

Your innermost loop executes cut_off^4 times, so anything you can do to speed it up will be critical. @skleinbo’s suggestion of pre-computing your operators looks like a good first step.

This is still going to allocate in your inner loop. If the matrices are small, you’re going to pay a big overhead price for these allocations, which you can avoid if you pre-allocate scratch space and use in-place operations like mul!.

How big are your matrices? If they are a small fixed size (say 10x10 or smaller), you might use StaticArrays.jl. (This is also in the performance tips.)

5 Likes

The first thing to think about would be whether you could take advantage of any symmetries here. Assuming your operators are all bosonic, both coefficient and your operator product should be symmetric under the exchanges k \leftrightarrow l and m \leftrightarrow n. Taking advantage of that (for example using SymmetricTensor from GitHub - Ferrite-FEM/Tensors.jl: Efficient computations with symmetric and non-symmetric tensors with support for automatic differentiation.), you’d only need to calculate (n(n+1))^2/4 instead of n^4 matrix products, a 4x speedup!

You probably also have commutation relations for a_i and a^\dagger_i, so you might even be able get that down to {n+3 \choose 4} + n(n+1)/4 matrix products, which would be a 24x speedup

4 Likes

Thank you! It’s a very effective way to construct any array quickly.

I generally use the very big matrices/tensors that have 9000x9000 shapes.