Assistance in speeding up Array operations and summations

I have a problem where I have arrays of matrices that I do operations on and then sum the result. I wonder if I can optimize this part of my code as it will be used repeatedly. Please find below an MWE that reproduces my problem
(Note I had P as real in my first post but it should be complex)

using LinearAlgebra, BenchmarkTools

begin #variables that determine the dimension of the arrays
     N = 50 
    dim = 10 
    Nsite = 100
end

begin  #two arrays used in the calculation
phi0 = rand(ComplexF64, Nsite, dim, N)
P = rand(ComplexF64, dim, dim, N)
end
#The main function
function ne(N::Int, phi0::Array{ComplexF64, 3}, P::Array{Float64, 3})
	return @views sum([conj(phi0[:, :, q])*P[:, :, q]*transpose(phi0[:, :, q]) for q in 1:N])
end
#benchmarking ne
@btime ne(N, phi0, P)

With the parameters In the MWE I get

@btime ne(N, phi0, P)
3.054 ms (299 allocations: 16.65 MiB)

I notice there are large allocations that I suspect could hinder performance. In my actual situation The dimension of the matrices is roughly 10^4 and with those the allocations take even much more memory.

Is there a way I can make this function run faster?

I’m guessing it’s the sum([… for i in 1:N]) pattern which is allocating an entire array for each term. Have you tried preallocating and using a loop, something like this?

function ne(N, phi0, P)
    out = zeros(Nsite, Nsite)
    for q in 1:N
        out .+= @views conj(phi0[:, :, q])*P[:, :, q]*transpose(phi0[:, :, q])
    end
    out
end

Btw, doing conj(A)*B*transpose(A) is like A*conj(B)*A', might might be prettier, and I think there may be specialisations for things of the form u*A*u', but don’t quote me on that!

1 Like

Thanks for your reply. I got an InExact error when applying your function. I’m assuming because zeroes was initialized by integers and out is a complex matrix. so I changed your function a bit

function ne(N, phi0, P)
    out = zeros(ComplexF64, Nsite, Nsite)
    for q in 1:N
        out .+= @views conj(phi0[:, :, q])*P[:, :, q]*transpose(phi0[:, :, q])
    end
    out
end

This way does indeed lead to fewer allocations (I will need to check how it scales for larger Nsite) This is the benchmark of running your function with Nsite = 100

@btime ne(N, phi0, P)
2.185 ms (202 allocations: 9.32 MiB)

I’m not sure why conj(A)*B*transpose(A) is equivalent to A*conj(B)*A' Unless the result is a real matrix, I don’t see how they are equal, but maybe I misunderstood.

Sorry, I meant conj(A)*B*transpose(A) and A*conj(B)*A' are conjugate to each other! So if you had

phi0 = conj(phi0)

then your inner loop would look like phi0[:, :, q]*P[:, :, q]*phi0[:, :, q]', which is a little nicer (to me).

It still strikes me as a lot of allocation, since in principle I don’t see why any new arrays need to be allocated, but I’m out of ideas for how to improve it, I’m afraid!

I think allocations stem from the conj (which I don’t think is lazy) and then from one of the mat-muls which would explain the number of allocations (100 loop iterations with 2 allocations + 1 for output array is almost 202 :slightly_smiling_face: ).

Perhaps you can use Tullio.jl or similar to avoid this intermediate allocations. The code should look like:

function ne(N, phi0, P)
    @tullio grad=false out[i,j] := conj(phi0[i, k, q])*P[k, l, q]*phi0[j, l, q]
    out
end

I think this will avoid the intermediate allocation but it might be slower regardless as it needs to perform more operations (see Fast&Slow section).

In the end it might be best to preallocate not just the out array but also one (or two?) arrays for intermediate results and reuse these between iterations. Then you can have the best of both worlds. I can probably write you something tomorrow if needed.

3 Likes

Here’s a version with mul! to avoid allocations in the loop.

julia> @btime ne($N, $phi0, $P);  # original
  min 2.471 ms, mean 4.488 ms (299 allocations, 16.65 MiB)

julia> @btime ne_3($N, $phi0, $P);  # Tullio
  min 14.015 ms, mean 14.152 ms (51 allocations, 159.11 KiB)

julia> function ne_6(N, phi0, P)
        Nsite = size(phi0, 1)
        out = zeros(ComplexF64, Nsite, Nsite)
        P_tmp = @views P[:, :, 1] .+ im
        mul_tmp = zeros(ComplexF64, dim, Nsite)
        for q in 1:N
          @views P_tmp .= P[:, :, q]  # complex, to avoid allocations in mul!
          @views mul!(mul_tmp, P_tmp, adjoint(phi0[:, :, q]))  # costs 3/4 of the time
          @views mul!(out, phi0[:, :, q], mul_tmp, true, true)
        end
        out .= conj.(out)
       end
ne_6 (generic function with 1 method)

julia> ne(N, phi0, P) ≈ ne_6(N, phi0, P)
true

julia> @btime ne_6($N, $phi0, $P);
  min 1.985 ms, mean 2.880 ms (4 allocations, 173.81 KiB)

julia> let C = rand(ComplexF64,3,3), A = rand(3,3), B = rand(ComplexF64,3,3)
         @less mul!(C,A,B',true,false)
       end   # this calls generic_matmatmul! it seems

Notice that mul!(complex, real, complex) causes allocations, more than mul!(complex, real .+ 0im, complex) in fact. That’s what P_tmp is avoiding.

But this is still wasteful. With a bit more thought, you could perhaps instead exploit the fact that P[:, :, q] is real, by reinterpret-ing the complex matrices. Either by hand, or by arranging to hit mul!(C::StridedMatrix{Complex{Float64}}, A::StridedVecOrMat{Complex{Float64}}, B::StridedVecOrMat{Float64}, alpha::Real, beta::Real)

3 Likes

Actually P is in general a complex matrix (It is Hermitian). Sorry, that was oversight on my part in my MWE that I had it as type float. I will edit my MWE to have P as a complex matrix.

Your solution with mul! function seems to have the lowest allocations and it scales nicely with increasing Nsite. I think I will use it, thanks a lot!

1 Like

Oh I see. Now that I look at it, I think this looks more natural. I can afford to define phi0 as conj(phi0). Thanks!

1 Like

I’m trying to do the sum with zero allocations as I eventually will want to use multithreading (which seems to prefer zero allocations) I have built on @mcabbott solution by trying to preallocate the mul_tmp array and also preallocate the output. as follows

using LinearAlgebra, BenchmarkTools  

begin  #two arrays used in the calculation
phi0 = rand(ComplexF64, 1000, 10, 50)
P = rand(ComplexF64, 10, 10, 50)
end

begin
tmp = zeros(ComplexF64, 10, 1000);
out = zeros(ComplexF64, 1000, 1000);   #preallocating the intermediate mul and the output	
function ne(phi0, P, tmp, out)
	for q in 1:50
		@views mul!(tmp, P[:, :, q], adjoint(phi0[:, :, q]))
		@views mul!(out, phi0[:, :, q], tmp, true, true)
	end
	 return out 
end
end

Now I get zero allocations when I use the output as a global variable that gets passed to ne as a variable

@btime ne(phi0, P, tmp, out)
107.077 ms (0 allocations: 0 bytes)

But that can’t be correct in the long run as a second application of the function will keep on adding on the ‘out’ variable which just gives the wrong sum so I know this can’t be the right way. Any help how to make this idea work?

Is it fine to just zero the array inside the function?

out = Array{ComplexF64}(undef, 1000, 1000)
function ne(phi0, P, tmp, out)
    fill!(out, 0)	
    for q in 1:50
		...
	end
	return out 
end

Thnaks! that works. I wasn’t aware of using fill! will not lead to allocations.
By the way is there a reason you preallocate out as Array{ComplexF64}(undef, 1000, 1000) as compared to zeros(ComplexF64, 1000, 1000) ?

Which is the reasoning for the begin ... end blocks in the code? To my knowledge they will have no effect (at least in the way they are being used in your code).

Might be from a pluto notebook

1 Like