I have a problem where I have arrays of matrices that I do operations on and then sum the result. I wonder if I can optimize this part of my code as it will be used repeatedly. Please find below an MWE that reproduces my problem

(Note I had P as real in my first post but it should be complex)

```
using LinearAlgebra, BenchmarkTools
begin #variables that determine the dimension of the arrays
N = 50
dim = 10
Nsite = 100
end
begin #two arrays used in the calculation
phi0 = rand(ComplexF64, Nsite, dim, N)
P = rand(ComplexF64, dim, dim, N)
end
#The main function
function ne(N::Int, phi0::Array{ComplexF64, 3}, P::Array{Float64, 3})
return @views sum([conj(phi0[:, :, q])*P[:, :, q]*transpose(phi0[:, :, q]) for q in 1:N])
end
#benchmarking ne
@btime ne(N, phi0, P)
```

With the parameters In the MWE I get

```
@btime ne(N, phi0, P)
3.054 ms (299 allocations: 16.65 MiB)
```

I notice there are large allocations that I suspect could hinder performance. In my actual situation The dimension of the matrices is roughly 10^4 and with those the allocations take even much more memory.

Is there a way I can make this function run faster?