Multithreading doesn't improve the performance

Hello, I’m trying to calculate my for-loop with multithreading. But there’s no increase in computing speed. Here’s the struct of my code.

B = [[[zeros(ComplexF64,1,6) for a in 1:3] for _ in 1:Nw] for z in 1:Threads.nthreads()]
#I create some array to store the results of each thread
chunked = chunks(k1_range, n=Threads.nthreads())
    @time @threads for ich in 1:Threads.nthreads()
        calC1mnkw1_temp_in = [zeros(ComplexF64,6,6) for _ in 1:Nw]
        Q1A = [zeros(ComplexF64,6,6) for _ in 1:3]
        x = 0
        for k1 in chunked[ich]
            println(k1)
            for k2 in k2_range
                x = x + 1
                Tk = eigenmatrix_array[x]
                eigenValueMatrix = eigenmatrix_value_array[x]
                eigenmatrix = eigenmatrix_array[x]
                eigValDiff, omega = omega_freq(eigenValueMatrix,omega,eigValDiff)
                ......#Here are some functions related to "k1" and "k2"
                for i in 1:Nw
                    w1 = ((ws+i*dw)*1.602176634*(10^(-19)))/hbar
                    w2 = w1    
                    div1, div2 = div_sion(w1,w2,eigValDiff,div1, div2)
                    ......#Here are some functions related to "i", "k1" and "k2".
                    for d in 1:3
                        for a in 1:3
                            A[d,a,i] = A[d,a,i] + tr(tempA) #Here I sum my results
                            for b in 1:3
                                ......
                            end
                            for n in 1:6
                                ......
                                for m in 1:6
                                    ......
                                end
                            end
                        end
                    end
                end
            end
        end
        B[ich] = B   # The results of each thread are stored here
        ......
    end
 

I have already tried to reduce the memory allocation. There’s only 3 times allocation per k2-loop. I wonder if there are other factors affecting the efficiency of multithreading.
Sorry if I didn’t show the code clear enough. It’s inconvenient to show everything because there’s too much code in my program.

The number one thing to do is to put your code in a function.

Also, your workload seems to scale with the number of available threads:

for ich in 1:Threads.nthreads()

as in, the more threads you have, the more work you have to do. Perhaps this is not representative of what your code is actually doing, but it makes it very difficult to interpret what is going on.

There’s also printing, which should probably be removed, and then you allocate memory inside each iteration, which is suboptimal:

calC1mnkw1_temp_in = [zeros(ComplexF64,6,6) for _ in 1:Nw]
Q1A = [zeros(ComplexF64,6,6) for _ in 1:3]

Can you provide a self-contained, runnable example (an “MWE”) so that people can try to run the code for themselves?

1 Like

I am using “chunks” to divide my tasks by thread count. The k1-loop will be divided according to the number of thread and the tasks is allocated by using the ich-loop with @threads.

Q1A = [zeros(ComplexF64,6,6) for _ in 1:3]

This definition is to create private variables for each thread. It will be execute only one time in each threads, and this will help to reduce the memory allocate because I will keep reusing them in my loop.

I see. But do you think you can either fill in the necessary code, or pare it down to the point where it is runnable? It is much harder to help without some code to run.

BTW, did you make sure to put your code in a function, and pass all variables as parameters to that function? That is part of what a self-contained example should do as well.

I will try to make an excutable, but it’s hard to isolate that part from the whole program, so I can’t guarantee that.
And yes, the loop is excuted in a function, and variables are passed as parameters. But I will check that again to see if I have missed anything. Thanks.

It’s good to show this in your post, otherwise that will be the main thing other posters focus on.

The best thing you could do is to create a smaller example that still demonstrates your performance issue. Also, remember to provide some way to make dummy input data, otherwise we can run the function.

Until you post a MWE, here’s some general pointers:

i) Unless it’s relevant that your arrays are initialized at zero, in my experience it’s typically faster that you use the constructor with undef argument.

B = Array{ComplexF64}(undef, 6, 3, Nw, Threads.nthreads())

Do note that this will be indexed as B[i, j, k, l].

ii) As per the style guide, I would suggest you change chunks to chunks! and directly pass B as an argument to mutate.

iii) Make use of the @code_warntype, @profile, with the VSCode specific @profview and @profview_allocs, macros. Also, try benchmarking your functions that get called in your chunks function using BenchmarkTools.jl

iv) It’s unclear from the snippet you posted, but if you keep reusing names, instead of assigning values to preallocated memory, such as eigenmatrix, you might trigger the garbage collector.

v) Sometimes this slips by, but did you check that your Julia instance starts with multiple allocated threads? You can check by either executing versioninfo() or Threads.nthreads() in the REPL.