Hi everyone, I have a problem related to the optimization problem with nested loops. I use *QuantumOptics.jl (qojulia.org)* module to calculate for some physical situation. First of all, let me share my code and explain it.

```
function Hubbard_Interaction(P, Pt, cut_mb_basis, cut_off, U)
# P1 and P1t are just a matrix, don't focus on it :)
P1 = P.data
P1t = Pt.data
#Preety fast calculation with einsum. No problem here
@einsum coefficient[k,l,m,n] := P1[k,i] * P1[l,i] * P1t[i,m] * P1t[i,n]
# Sparse operator for fast modifying
Vint_mb_cut = SparseOperator(cut_mb_basis)
# The problem starts here :/
for k in 1:cut_off
for l in 1:cut_off
for m in 1:cut_off
for n in 1:cut_off
# These four operators are just matrices but note that they depend on the loop indices!
a1t = create(cut_mb_basis, k)
a2t = create(cut_mb_basis, l)
a2 = destroy(cut_mb_basis, m)
a1 = destroy(cut_mb_basis, n)
#Matrix multiplication pretty fast, no problem here
Vint_mb_cut += U/2*coefficient[k,l,m,n]*a1t*a2t*a2*a1
end
end
end
end
return Vint_mb_cut
end
```

There is a function that calculates the interaction between two particles. It’s unnecessary for non-physicist people. The main point is that there is a nested loop in my code. When cutt_off is a big number, the code works quite slowly The big question is how can I optimize this code for big cutt_off (or just a number from starting 1) values? For example, cut_off = 50

It seems you are rebuilding the creation and annihilation matrices unnecessarily in the innermost loop. E.g. `a1t = create(cut_mb_basis, k)`

only depends on the outermost variable and could be moved to

```
for k in 1:cut_off
a1t = create(cut_mb_basis, k)
[...]
```

and so on, which should save about 75%.

But actually it might be smarter to create all operators from the get-go. I suppose they are sparse matrices? Shouldn’t take up too much space. I’ve never worked with QuantumOptics, so that’s a guess.

3 Likes

All of these matrices are sparse arrays, they help to improve code performance of course. I tried your opinion, but the results are not convincing. Because when cut_off = 75 the code still works slowly.

Can you post your updated code maybe? Because the one above allocates 4 matrices in every inner loop run, which looks really costly (though I am not sure what create and destroy do).

Did you try pre-allocating all operators outside of the loop, e.g

```
A = [create(cut_mb_basis, k) for k in 1:cut_off]
At = [destroy(cut_mb_basis, k) for k in 1:cut_off]
```

and then indexing into those arrays in the loop?

3 Likes

Your innermost loop executes `cut_off^4`

times, so anything you can do to speed it up will be critical. @skleinbo’s suggestion of pre-computing your operators looks like a good first step.

This is still going to allocate in your inner loop. If the matrices are small, you’re going to pay a big overhead price for these allocations, which you can avoid if you pre-allocate scratch space and use in-place operations like `mul!`

.

How big are your matrices? If they are a small fixed size (say 10x10 or smaller), you might use StaticArrays.jl. (This is also in the performance tips.)

5 Likes

The first thing to think about would be whether you could take advantage of any symmetries here. Assuming your operators are all bosonic, both `coefficient`

and your operator product should be symmetric under the exchanges k \leftrightarrow l and m \leftrightarrow n. Taking advantage of that (for example using `SymmetricTensor`

from GitHub - Ferrite-FEM/Tensors.jl: Efficient computations with symmetric and non-symmetric tensors with support for automatic differentiation.), you’d only need to calculate (n(n+1))^2/4 instead of n^4 matrix products, a 4x speedup!

You probably also have commutation relations for a_i and a^\dagger_i, so you might even be able get that down to {n+3 \choose 4} + n(n+1)/4 matrix products, which would be a 24x speedup

4 Likes

Thank you! It’s a very effective way to construct any array quickly.

I generally use the very big matrices/tensors that have 9000x9000 shapes.