Large allocations using @Threads.threads in a loop leads to slow down

ahmed · August 17, 2023, 6:16pm

I have the following problem that I don’t know how to solve while using multithreading to create a matrix. I have a function that returns the matrix element after operating over arrays that are large in size. I’m using @views to avoid copying the data.

function F(m, n, phi0, K_array, P_list)::ComplexF64
      return @views ...do simple computation over  phi0, K_array, P_list 
end

Applying F once gives (using @btime form BenchmarkTools)

55.524 μs (700 allocations: 39.84 KiB)

Now I use a machine with 100 cores to construct a matrix of size Nsite*Nsite using @Threads.threads as follows

function F_array(phi0, K_array, P_list, Nsite)
    result = zeros(ComplexF64, Nsite, Nsite)

    @Threads.threads for m in 1:Nsite
        for n in m+1:Nsite
            result[m, n] = F(m, n, phi0, K_array, P_list)
        end
    end

    return result
end

When I apply F_array(with Nsite = 4000) I get

416.801 s (8381104705 allocations: 345.49 GiB)

which has a huge allocation that eventually leads to the computation being very slow.

I have tried a simple function instead of F that has similar run time and allocations and when I construct the matrix using this simple function it takes roughly 500 ms so I was expecting that I should get a similar time for my F_array function.
I’m suspecting that it has to do with the fact that the function F uses large arrays in its definition. Is there a way to fix this problem?
Any other insight will be greatly appreciated.

algunion · August 17, 2023, 7:30pm

Since your F function returns ComplexF64, there is no gain in using @views:

result = zeros(ComplexF64, 10, 10)
@views result[1] === result[1] # true
@views result[1,1] === result[1,1] # true

Specifically, getindex calls are just ignored by @views (later addition here: after you shared the MWE in a later post, it became obvious that you also had slicing operations - so @views actually makes sense in the context).

In your F, you have the …do simple computation over phi0, K_array, P_list comment - but simple doesn’t necessarily imply non-allocating.

At this point, I think it is relevant to share more about what happens inside F.

Can you provide a minimal working example that we can run?

ahmed · August 17, 2023, 7:43pm

Thanks for you response. The computation inside the function F is close to the following

function F(m, n, phi0, K_array, P_list)::ComplexF64
      return @views sum(K_array[q][m, n]*adjoint(phi0[q][n, :])*P_list[q]*phi0[q][m, :] for q in 1:N)
end

where N = 50 . K_array is an N vector of matrices each of size NsiteNsite and phi0 is an N vector of matrices each of size Nsite10 and P_list is an N vector of matrices each of size 10*10. When I run the function F I get 700 allocations which I assume to be not terrible?

algunion · August 17, 2023, 7:53pm

It seems you are running that for 4000x4000 elements (so in itself, F is not expensive, but when running it 16000000 times… it can get expensive overall).

At this point, I should need to initialize by hand those inputs… so I didn’t run your code yet.

One suggestion: make sure that the computation for each element is expensive enough to overcome the overhead associated with @threads usage.

I think we could provide more value if you could share an MWE (minimal working example) - just to make sure we don’t make assumptions when writing the code by hand - and also, if we can just copy-paste-run is better than having to duplicate the work you already did.

ahmed · August 17, 2023, 8:10pm

Note that these matrices require large memory space so I have to size it down so it can be run on a laptop, say. A MWE is the following with Julia -t number of threads

using LinearAlgebra, Base.Threads, BenchmarkTools
Nsite = 100
N = 50
dim = 10
K_array = [rand(ComplexF64, (Nsite, Nsite)) for i in 1:N]
phi0 = [rand(ComplexF64, (Nsite, dim)) for i in 1:N]
P_list = [rand(Float64, (dim, dim)) for i in 1:N]
function F(m, n, phi0, K_array, P_list, N)::ComplexF64
      return @views sum(K_array[q][m, n]*adjoint(phi0[q][n, :])*P_list[q]*phi0[q][m, :] for q in 1:N)
end
@btime F(2, 3, phi0, K_array, P_list, N) #a test for the F run time and allocation

#construct the array 
function F_array(phi0, K_array, P_list, Nsite, N)
    result = zeros(ComplexF64, Nsite, Nsite)

    @Threads.threads for m in 1:Nsite
        for n in m+1:Nsite
            result[m, n] = F(m, n, phi0, K_array, P_list, N)
        end
    end

    return result
end

algunion · August 17, 2023, 8:58pm

OK - first of all, please take a look at this.

In your code, you define a series of global variables without assigning types or using const. That can impair the performance of your code significantly.

Try to adjust your code to what seems obvious from the performance tips section - and we can take it from there afterward.

P. S. The allocations related to F seem to be dependent on N. For N = 50, I get 50 allocations for F (even if I set Nsite to 4000). Are you getting 700 allocations for F when running the exact code that you shared?

ahmed · August 18, 2023, 4:14am

Thanks a lot for your suggestions. I think I’m getting close to solve my problem but I’m also learning a lot as well. So Here is a more complete MWE for my function F

using LinearAlgebra, Base.Threads, BenchmarkTools
begin #Global variables used in the function
    const G = rand(Float64)
    const N = 50 
    Ne = 5 
    const dim = 2*Ne
    const Nsite = 100
    const phi0 = [rand(ComplexF64, (Nsite, dim)) for i in 1:N]
    const K_array = [rand(ComplexF64, (Nsite, Nsite)) for i in 1:N]
    const unitcell_rot = [rand(Float64, 2) for i in 1:Nsite]
end
begin
function kminusq_modG(k::Int, q::Int, N::Int)::Int 
        if k-q > N÷2-1
                return k-q-N÷2+1
        elseif k-q < -N÷2
                return k-q+N+N÷2+1
        else
                return k-q+N÷2+1
    end
end

function l(k::Int, q::Int, N::Int)::Float64 
        if k-q > N÷2-1
                return 1.0
        elseif k-q < -N÷2
                return -1
        else
                return 0.0
    end
end
function P_list_const(Ne, N)
	P_list = zeros(Float64, 2*Ne, 2*Ne)
	for i in 1:Ne
		P_list[i, i] = 1.0
	end
	return [P_list for i in 1:N]
end
	 const P_init = P_list_const(Ne, N)
end

#The function F 

function F(tau1::Int, tau2::Int, k::Int, G::Float64, unitcell_rot::Vector{Vector{Float64}}, phi0::Vector{Matrix{ComplexF64}}, N::Int, K_array::Vector{Matrix{ComplexF64}}, P_list::Vector{Matrix{Float64}})::ComplexF64

       return @views sum(K_array[kminusq_modG(k, q, N)][tau1, tau2]*exp(-im*l(k, q, N)*G*(unitcell_rot[tau1][2]-unitcell_rot[tau2][2]))*adjoint(phi0[q][tau2, :])*P_list[q]*phi0[q][tau1, :] for q in 1:N)

end
@btime F(60, 40, 12, G, unitcell_rot, phi0, 50, K_array, P_init) #example run of F

#Function to construct an array in parallel
function F_array(k::Int, G::Float64, unitcell_rot::Vector{Vector{Float64}}, phi0::Vector{Matrix{ComplexF64}}, N::Int, K_array::Vector{Matrix{ComplexF64}}, P_list::Vector{Matrix{Float64}}, Nsite::Int)
    result = zeros(ComplexF64, Nsite, Nsite)

    @Threads.threads for n in 1:Nsite
        for m in n+1:Nsite
            result[m, n] = F(m, n, k, G, unitcell_rot, phi0, N, K_array, P_list)
        end
    end

    return result
end

@btime F_array(18, G, unitcell_rot, phi0, 50, K_array, P_init, Nsite) #an example run of F_array

Running F on my laptop gives the following benchmark
32.500 μs (100 allocations: 21.88 KiB)

These allocations look like twice the terms in the sum. Could I do better?
I’m still getting a significant slowdown when I use multithreading for constructing my array.

carstenbauer · August 18, 2023, 4:16am

One thing to note is that you should probably also switch the order of the loops and then parallelize the loop that corresponds to column indices of the matrix (such that each task operates on a column of the output matrix).

ahmed · August 18, 2023, 4:29am

Oh I actually thought that in my code the @Threads.threads was to parallelize the two loops together. So is it only parallelizing the outer loop?

carstenbauer · August 18, 2023, 4:37am

Switching the order of the loops, to have the inner loop index be the row index, is generally better in Julia because it uses “column major order”. Nothing to do with multithreading.
@threads only parallelizes one loop, the one that follows the macro. While you could also parallelize the inner loop, I think you should avoid nested multithreading for now.

carstenbauer · August 18, 2023, 4:43am

Also note that your latest MWE is incomplete in the sense that I can’t copy paste and run it. For example, it neither contains the multithreaded variant anymore nor the btime calls. It is much easier to help you if you provide copy pasteable MWEs.

ahmed · August 18, 2023, 4:48am

Sorry, I was providing the MWE for the function F by itself as it seemed I could optimize a lot in it per @algunion suggestions before the parallelization. I will include the multithreading part too. Thanks for pointing that out. (The MWE should now include the parallel part)

lmiq · August 18, 2023, 8:50am

I just want to point that these const are only relevant if the data is not passed as a parameter to the functions using it. Which doesn’t appear to be the case (you are passing as parameters as far as I can see). Making everything constant is of course not very practical if you want to change the values.

algunion · August 18, 2023, 10:53am

Just to clarify: when we say MWE, we usually refer to a snippet that we can copy-paste-run. A function definition alone is not fitting these criteria (usually).

lmiq · August 18, 2023, 11:23am

If the costly part is the inner product you are computing there, not that the layout of the matrices if very important:

julia> x = rand(100,100); y = rand(100,100);

julia> f(x,y) = @views adjoint(x[1,:])*y[5,:] # What you are doing
f (generic function with 1 method)

julia> @btime f($x,$y)
  47.868 ns (0 allocations: 0 bytes)
23.175988601542105

julia> g(x,y) = @views adjoint(x[:,1])*y[:,5] # What you should be doing
g (generic function with 1 method)

julia> @btime g($x,$y)
  22.821 ns (0 allocations: 0 bytes)
29.39968033475811

ctkelley · August 18, 2023, 11:51am

You might try Polyester.jl. I’ve have very good luck with it and it has been (at least for me) faster than @threads.

mbauman · August 18, 2023, 1:20pm

On the off-chance that this isn’t quite a complete MWE, it’s perhaps worth noting that large allocations can happen in @threads loops as a result of many threads trying to reassign the same variable — not only would this cause allocations, but it can also cause all sorts of race conditions and contention, making performance abysmal (and likely the wrong answer, too).

ahmed · August 19, 2023, 7:52am

Thanks everyone for your helpful replies. After a second look at the data structure I have and the kind of computations my functions are doing, I found that I can represent all of my calculations into product of matrices or sum over product of matrices. This way seems to be much better than my earlier attempt (especially since parallelization didn’t help)

I do still have some questions regarding optimizing the vectorized version (memory allocations again!) but I’m assuming since this is not directly related to this topic, I should ask them as a new question? I’m not sure what is the usual practice in this situation.

Topic		Replies	Views
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4036	February 11, 2019
Slower @threads than serial for array computations Julia at Scale	26	2748	May 7, 2020
Large memory allocation when a loop is threaded, none when run single-threaded General Usage multithreading , memory-allocation	12	482	October 11, 2021
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2521	July 13, 2020
Memory usage in nested loops New to Julia multithreading , memory-allocation	2	780	December 10, 2019

Large allocations using @Threads.threads in a loop leads to slow down

Related topics