# Assistance in speeding up Array operations and summations

I have a problem where I have arrays of matrices that I do operations on and then sum the result. I wonder if I can optimize this part of my code as it will be used repeatedly. Please find below an MWE that reproduces my problem
(Note I had P as real in my first post but it should be complex)

``````using LinearAlgebra, BenchmarkTools

begin #variables that determine the dimension of the arrays
N = 50
dim = 10
Nsite = 100
end

begin  #two arrays used in the calculation
phi0 = rand(ComplexF64, Nsite, dim, N)
P = rand(ComplexF64, dim, dim, N)
end
#The main function
function ne(N::Int, phi0::Array{ComplexF64, 3}, P::Array{Float64, 3})
return @views sum([conj(phi0[:, :, q])*P[:, :, q]*transpose(phi0[:, :, q]) for q in 1:N])
end
#benchmarking ne
@btime ne(N, phi0, P)
``````

With the parameters In the MWE I get

``````@btime ne(N, phi0, P)
3.054 ms (299 allocations: 16.65 MiB)
``````

I notice there are large allocations that I suspect could hinder performance. In my actual situation The dimension of the matrices is roughly 10^4 and with those the allocations take even much more memory.

Is there a way I can make this function run faster?

I’m guessing it’s the `sum([… for i in 1:N])` pattern which is allocating an entire array for each term. Have you tried preallocating and using a loop, something like this?

``````function ne(N, phi0, P)
out = zeros(Nsite, Nsite)
for q in 1:N
out .+= @views conj(phi0[:, :, q])*P[:, :, q]*transpose(phi0[:, :, q])
end
out
end
``````

Btw, doing `conj(A)*B*transpose(A)` is like `A*conj(B)*A'`, might might be prettier, and I think there may be specialisations for things of the form `u*A*u'`, but don’t quote me on that!

1 Like

Thanks for your reply. I got an InExact error when applying your function. I’m assuming because zeroes was initialized by integers and out is a complex matrix. so I changed your function a bit

``````function ne(N, phi0, P)
out = zeros(ComplexF64, Nsite, Nsite)
for q in 1:N
out .+= @views conj(phi0[:, :, q])*P[:, :, q]*transpose(phi0[:, :, q])
end
out
end
``````

This way does indeed lead to fewer allocations (I will need to check how it scales for larger Nsite) This is the benchmark of running your function with Nsite = 100

``````@btime ne(N, phi0, P)
2.185 ms (202 allocations: 9.32 MiB)
``````

I’m not sure why `conj(A)*B*transpose(A)` is equivalent to `A*conj(B)*A'` Unless the result is a real matrix, I don’t see how they are equal, but maybe I misunderstood.

Sorry, I meant `conj(A)*B*transpose(A)` and `A*conj(B)*A'` are conjugate to each other! So if you had

``````phi0 = conj(phi0)
``````

then your inner loop would look like `phi0[:, :, q]*P[:, :, q]*phi0[:, :, q]'`, which is a little nicer (to me).

It still strikes me as a lot of allocation, since in principle I don’t see why any new arrays need to be allocated, but I’m out of ideas for how to improve it, I’m afraid!

I think allocations stem from the conj (which I don’t think is lazy) and then from one of the mat-muls which would explain the number of allocations (100 loop iterations with 2 allocations + 1 for output array is almost 202 ).

Perhaps you can use Tullio.jl or similar to avoid this intermediate allocations. The code should look like:

``````function ne(N, phi0, P)
@tullio grad=false out[i,j] := conj(phi0[i, k, q])*P[k, l, q]*phi0[j, l, q]
out
end
``````

I think this will avoid the intermediate allocation but it might be slower regardless as it needs to perform more operations (see Fast&Slow section).

In the end it might be best to preallocate not just the out array but also one (or two?) arrays for intermediate results and reuse these between iterations. Then you can have the best of both worlds. I can probably write you something tomorrow if needed.

3 Likes

Here’s a version with `mul!` to avoid allocations in the loop.

``````julia> @btime ne(\$N, \$phi0, \$P);  # original
min 2.471 ms, mean 4.488 ms (299 allocations, 16.65 MiB)

julia> @btime ne_3(\$N, \$phi0, \$P);  # Tullio
min 14.015 ms, mean 14.152 ms (51 allocations, 159.11 KiB)

julia> function ne_6(N, phi0, P)
Nsite = size(phi0, 1)
out = zeros(ComplexF64, Nsite, Nsite)
P_tmp = @views P[:, :, 1] .+ im
mul_tmp = zeros(ComplexF64, dim, Nsite)
for q in 1:N
@views P_tmp .= P[:, :, q]  # complex, to avoid allocations in mul!
@views mul!(mul_tmp, P_tmp, adjoint(phi0[:, :, q]))  # costs 3/4 of the time
@views mul!(out, phi0[:, :, q], mul_tmp, true, true)
end
out .= conj.(out)
end
ne_6 (generic function with 1 method)

julia> ne(N, phi0, P) ≈ ne_6(N, phi0, P)
true

julia> @btime ne_6(\$N, \$phi0, \$P);
min 1.985 ms, mean 2.880 ms (4 allocations, 173.81 KiB)

julia> let C = rand(ComplexF64,3,3), A = rand(3,3), B = rand(ComplexF64,3,3)
@less mul!(C,A,B',true,false)
end   # this calls generic_matmatmul! it seems
``````

Notice that `mul!(complex, real, complex)` causes allocations, more than `mul!(complex, real .+ 0im, complex)` in fact. That’s what `P_tmp` is avoiding.

But this is still wasteful. With a bit more thought, you could perhaps instead exploit the fact that `P[:, :, q]` is real, by `reinterpret`-ing the complex matrices. Either by hand, or by arranging to hit `mul!(C::StridedMatrix{Complex{Float64}}, A::StridedVecOrMat{Complex{Float64}}, B::StridedVecOrMat{Float64}, alpha::Real, beta::Real)`

3 Likes

Actually P is in general a complex matrix (It is Hermitian). Sorry, that was oversight on my part in my MWE that I had it as type float. I will edit my MWE to have P as a complex matrix.

Your solution with mul! function seems to have the lowest allocations and it scales nicely with increasing Nsite. I think I will use it, thanks a lot!

1 Like

Oh I see. Now that I look at it, I think this looks more natural. I can afford to define phi0 as conj(phi0). Thanks!

1 Like

I’m trying to do the sum with zero allocations as I eventually will want to use multithreading (which seems to prefer zero allocations) I have built on @mcabbott solution by trying to preallocate the mul_tmp array and also preallocate the output. as follows

``````using LinearAlgebra, BenchmarkTools

begin  #two arrays used in the calculation
phi0 = rand(ComplexF64, 1000, 10, 50)
P = rand(ComplexF64, 10, 10, 50)
end

begin
tmp = zeros(ComplexF64, 10, 1000);
out = zeros(ComplexF64, 1000, 1000);   #preallocating the intermediate mul and the output
function ne(phi0, P, tmp, out)
for q in 1:50
@views mul!(tmp, P[:, :, q], adjoint(phi0[:, :, q]))
@views mul!(out, phi0[:, :, q], tmp, true, true)
end
return out
end
end
``````

Now I get zero allocations when I use the output as a global variable that gets passed to ne as a variable

``````@btime ne(phi0, P, tmp, out)
107.077 ms (0 allocations: 0 bytes)
``````

But that can’t be correct in the long run as a second application of the function will keep on adding on the ‘out’ variable which just gives the wrong sum so I know this can’t be the right way. Any help how to make this idea work?

Is it fine to just zero the array inside the function?

``````out = Array{ComplexF64}(undef, 1000, 1000)
function ne(phi0, P, tmp, out)
fill!(out, 0)
for q in 1:50
...
end
return out
end
``````

Thnaks! that works. I wasn’t aware of using fill! will not lead to allocations.
By the way is there a reason you preallocate out as `Array{ComplexF64}(undef, 1000, 1000)` as compared to `zeros(ComplexF64, 1000, 1000)` ?

Which is the reasoning for the `begin ... end` blocks in the code? To my knowledge they will have no effect (at least in the way they are being used in your code).

Might be from a pluto notebook

1 Like