Large allocations using @Threads.threads in a loop leads to slow down

I have the following problem that I don’t know how to solve while using multithreading to create a matrix. I have a function that returns the matrix element after operating over arrays that are large in size. I’m using @views to avoid copying the data.

function F(m, n, phi0, K_array, P_list)::ComplexF64
      return @views ...do simple computation over  phi0, K_array, P_list 
end

Applying F once gives (using @btime form BenchmarkTools)

55.524 μs (700 allocations: 39.84 KiB)

Now I use a machine with 100 cores to construct a matrix of size Nsite*Nsite using @Threads.threads as follows

function F_array(phi0, K_array, P_list, Nsite)
    result = zeros(ComplexF64, Nsite, Nsite)

    @Threads.threads for m in 1:Nsite
        for n in m+1:Nsite
            result[m, n] = F(m, n, phi0, K_array, P_list)
        end
    end

    return result
end

When I apply F_array(with Nsite = 4000) I get

416.801 s (8381104705 allocations: 345.49 GiB)

which has a huge allocation that eventually leads to the computation being very slow.

I have tried a simple function instead of F that has similar run time and allocations and when I construct the matrix using this simple function it takes roughly 500 ms so I was expecting that I should get a similar time for my F_array function.
I’m suspecting that it has to do with the fact that the function F uses large arrays in its definition. Is there a way to fix this problem?
Any other insight will be greatly appreciated.

Since your F function returns ComplexF64, there is no gain in using @views:

result = zeros(ComplexF64, 10, 10)
@views result[1] === result[1] # true
@views result[1,1] === result[1,1] # true

Specifically, getindex calls are just ignored by @views (later addition here: after you shared the MWE in a later post, it became obvious that you also had slicing operations - so @views actually makes sense in the context).

In your F, you have the …do simple computation over phi0, K_array, P_list comment - but simple doesn’t necessarily imply non-allocating.

At this point, I think it is relevant to share more about what happens inside F.

Can you provide a minimal working example that we can run?

1 Like

Thanks for you response. The computation inside the function F is close to the following

function F(m, n, phi0, K_array, P_list)::ComplexF64
      return @views sum(K_array[q][m, n]*adjoint(phi0[q][n, :])*P_list[q]*phi0[q][m, :] for q in 1:N)
end

where N = 50 . K_array is an N vector of matrices each of size NsiteNsite and phi0 is an N vector of matrices each of size Nsite10 and P_list is an N vector of matrices each of size 10*10. When I run the function F I get 700 allocations which I assume to be not terrible?

It seems you are running that for 4000x4000 elements (so in itself, F is not expensive, but when running it 16000000 times… it can get expensive overall).

At this point, I should need to initialize by hand those inputs… so I didn’t run your code yet.

One suggestion: make sure that the computation for each element is expensive enough to overcome the overhead associated with @threads usage.

I think we could provide more value if you could share an MWE (minimal working example) - just to make sure we don’t make assumptions when writing the code by hand - and also, if we can just copy-paste-run is better than having to duplicate the work you already did.

Note that these matrices require large memory space so I have to size it down so it can be run on a laptop, say. A MWE is the following with Julia -t number of threads

using LinearAlgebra, Base.Threads, BenchmarkTools
Nsite = 100
N = 50
dim = 10
K_array = [rand(ComplexF64, (Nsite, Nsite)) for i in 1:N]
phi0 = [rand(ComplexF64, (Nsite, dim)) for i in 1:N]
P_list = [rand(Float64, (dim, dim)) for i in 1:N]
function F(m, n, phi0, K_array, P_list, N)::ComplexF64
      return @views sum(K_array[q][m, n]*adjoint(phi0[q][n, :])*P_list[q]*phi0[q][m, :] for q in 1:N)
end
@btime F(2, 3, phi0, K_array, P_list, N) #a test for the F run time and allocation

#construct the array 
function F_array(phi0, K_array, P_list, Nsite, N)
    result = zeros(ComplexF64, Nsite, Nsite)

    @Threads.threads for m in 1:Nsite
        for n in m+1:Nsite
            result[m, n] = F(m, n, phi0, K_array, P_list, N)
        end
    end

    return result
end

OK - first of all, please take a look at this.

In your code, you define a series of global variables without assigning types or using const. That can impair the performance of your code significantly.

Try to adjust your code to what seems obvious from the performance tips section - and we can take it from there afterward.

P. S. The allocations related to F seem to be dependent on N. For N = 50, I get 50 allocations for F (even if I set Nsite to 4000). Are you getting 700 allocations for F when running the exact code that you shared?

Thanks a lot for your suggestions. I think I’m getting close to solve my problem but I’m also learning a lot as well. So Here is a more complete MWE for my function F

using LinearAlgebra, Base.Threads, BenchmarkTools
begin #Global variables used in the function
    const G = rand(Float64)
    const N = 50 
    Ne = 5 
    const dim = 2*Ne
    const Nsite = 100
    const phi0 = [rand(ComplexF64, (Nsite, dim)) for i in 1:N]
    const K_array = [rand(ComplexF64, (Nsite, Nsite)) for i in 1:N]
    const unitcell_rot = [rand(Float64, 2) for i in 1:Nsite]
end
begin
function kminusq_modG(k::Int, q::Int, N::Int)::Int 
        if k-q > N÷2-1
                return k-q-N÷2+1
        elseif k-q < -N÷2
                return k-q+N+N÷2+1
        else
                return k-q+N÷2+1
    end
end

function l(k::Int, q::Int, N::Int)::Float64 
        if k-q > N÷2-1
                return 1.0
        elseif k-q < -N÷2
                return -1
        else
                return 0.0
    end
end
function P_list_const(Ne, N)
	P_list = zeros(Float64, 2*Ne, 2*Ne)
	for i in 1:Ne
		P_list[i, i] = 1.0
	end
	return [P_list for i in 1:N]
end
	 const P_init = P_list_const(Ne, N)
end

#The function F 

function F(tau1::Int, tau2::Int, k::Int, G::Float64, unitcell_rot::Vector{Vector{Float64}}, phi0::Vector{Matrix{ComplexF64}}, N::Int, K_array::Vector{Matrix{ComplexF64}}, P_list::Vector{Matrix{Float64}})::ComplexF64

       return @views sum(K_array[kminusq_modG(k, q, N)][tau1, tau2]*exp(-im*l(k, q, N)*G*(unitcell_rot[tau1][2]-unitcell_rot[tau2][2]))*adjoint(phi0[q][tau2, :])*P_list[q]*phi0[q][tau1, :] for q in 1:N)

end
@btime F(60, 40, 12, G, unitcell_rot, phi0, 50, K_array, P_init) #example run of F

#Function to construct an array in parallel
function F_array(k::Int, G::Float64, unitcell_rot::Vector{Vector{Float64}}, phi0::Vector{Matrix{ComplexF64}}, N::Int, K_array::Vector{Matrix{ComplexF64}}, P_list::Vector{Matrix{Float64}}, Nsite::Int)
    result = zeros(ComplexF64, Nsite, Nsite)

    @Threads.threads for n in 1:Nsite
        for m in n+1:Nsite
            result[m, n] = F(m, n, k, G, unitcell_rot, phi0, N, K_array, P_list)
        end
    end

    return result
end

@btime F_array(18, G, unitcell_rot, phi0, 50, K_array, P_init, Nsite) #an example run of F_array

Running F on my laptop gives the following benchmark
32.500 μs (100 allocations: 21.88 KiB)

These allocations look like twice the terms in the sum. Could I do better?
I’m still getting a significant slowdown when I use multithreading for constructing my array.

One thing to note is that you should probably also switch the order of the loops and then parallelize the loop that corresponds to column indices of the matrix (such that each task operates on a column of the output matrix).

Oh I actually thought that in my code the @Threads.threads was to parallelize the two loops together. So is it only parallelizing the outer loop?

  • Switching the order of the loops, to have the inner loop index be the row index, is generally better in Julia because it uses “column major order”. Nothing to do with multithreading.
  • @threads only parallelizes one loop, the one that follows the macro. While you could also parallelize the inner loop, I think you should avoid nested multithreading for now.
2 Likes

Also note that your latest MWE is incomplete in the sense that I can’t copy paste and run it. For example, it neither contains the multithreaded variant anymore nor the btime calls. It is much easier to help you if you provide copy pasteable MWEs.

Sorry, I was providing the MWE for the function F by itself as it seemed I could optimize a lot in it per @algunion suggestions before the parallelization. I will include the multithreading part too. Thanks for pointing that out. (The MWE should now include the parallel part)

I just want to point that these const are only relevant if the data is not passed as a parameter to the functions using it. Which doesn’t appear to be the case (you are passing as parameters as far as I can see). Making everything constant is of course not very practical if you want to change the values.

2 Likes

Just to clarify: when we say MWE, we usually refer to a snippet that we can copy-paste-run. A function definition alone is not fitting these criteria (usually).

1 Like

If the costly part is the inner product you are computing there, not that the layout of the matrices if very important:

julia> x = rand(100,100); y = rand(100,100);

julia> f(x,y) = @views adjoint(x[1,:])*y[5,:] # What you are doing
f (generic function with 1 method)

julia> @btime f($x,$y)
  47.868 ns (0 allocations: 0 bytes)
23.175988601542105

julia> g(x,y) = @views adjoint(x[:,1])*y[:,5] # What you should be doing
g (generic function with 1 method)

julia> @btime g($x,$y)
  22.821 ns (0 allocations: 0 bytes)
29.39968033475811

You might try Polyester.jl. I’ve have very good luck with it and it has been (at least for me) faster than @threads.

On the off-chance that this isn’t quite a complete MWE, it’s perhaps worth noting that large allocations can happen in @threads loops as a result of many threads trying to reassign the same variable — not only would this cause allocations, but it can also cause all sorts of race conditions and contention, making performance abysmal (and likely the wrong answer, too).

3 Likes

Thanks everyone for your helpful replies. After a second look at the data structure I have and the kind of computations my functions are doing, I found that I can represent all of my calculations into product of matrices or sum over product of matrices. This way seems to be much better than my earlier attempt (especially since parallelization didn’t help)

I do still have some questions regarding optimizing the vectorized version (memory allocations again!) but I’m assuming since this is not directly related to this topic, I should ask them as a new question? I’m not sure what is the usual practice in this situation.

1 Like