Multithreading doesn't improve the performance

wq-123 · February 18, 2025, 7:38am

Hello, I’m trying to calculate my for-loop with multithreading. But there’s no increase in computing speed. Here’s the struct of my code.

B = [[[zeros(ComplexF64,1,6) for a in 1:3] for _ in 1:Nw] for z in 1:Threads.nthreads()]
#I create some array to store the results of each thread
chunked = chunks(k1_range, n=Threads.nthreads())
    @time @threads for ich in 1:Threads.nthreads()
        calC1mnkw1_temp_in = [zeros(ComplexF64,6,6) for _ in 1:Nw]
        Q1A = [zeros(ComplexF64,6,6) for _ in 1:3]
        x = 0
        for k1 in chunked[ich]
            println(k1)
            for k2 in k2_range
                x = x + 1
                Tk = eigenmatrix_array[x]
                eigenValueMatrix = eigenmatrix_value_array[x]
                eigenmatrix = eigenmatrix_array[x]
                eigValDiff, omega = omega_freq(eigenValueMatrix,omega,eigValDiff)
                ......#Here are some functions related to "k1" and "k2"
                for i in 1:Nw
                    w1 = ((ws+i*dw)*1.602176634*(10^(-19)))/hbar
                    w2 = w1    
                    div1, div2 = div_sion(w1,w2,eigValDiff,div1, div2)
                    ......#Here are some functions related to "i", "k1" and "k2".
                    for d in 1:3
                        for a in 1:3
                            A[d,a,i] = A[d,a,i] + tr(tempA) #Here I sum my results
                            for b in 1:3
                                ......
                            end
                            for n in 1:6
                                ......
                                for m in 1:6
                                    ......
                                end
                            end
                        end
                    end
                end
            end
        end
        B[ich] = B   # The results of each thread are stored here
        ......
    end

I have already tried to reduce the memory allocation. There’s only 3 times allocation per k2-loop. I wonder if there are other factors affecting the efficiency of multithreading.
Sorry if I didn’t show the code clear enough. It’s inconvenient to show everything because there’s too much code in my program.

DNF · February 18, 2025, 8:08am

The number one thing to do is to put your code in a function.

Also, your workload seems to scale with the number of available threads:

for ich in 1:Threads.nthreads()

as in, the more threads you have, the more work you have to do. Perhaps this is not representative of what your code is actually doing, but it makes it very difficult to interpret what is going on.

There’s also printing, which should probably be removed, and then you allocate memory inside each iteration, which is suboptimal:

calC1mnkw1_temp_in = [zeros(ComplexF64,6,6) for _ in 1:Nw]
Q1A = [zeros(ComplexF64,6,6) for _ in 1:3]

Can you provide a self-contained, runnable example (an “MWE”) so that people can try to run the code for themselves?

wq-123 · February 18, 2025, 8:25am

I am using “chunks” to divide my tasks by thread count. The k1-loop will be divided according to the number of thread and the tasks is allocated by using the ich-loop with @threads.

Q1A = [zeros(ComplexF64,6,6) for _ in 1:3]

This definition is to create private variables for each thread. It will be execute only one time in each threads, and this will help to reduce the memory allocate because I will keep reusing them in my loop.

DNF · February 18, 2025, 8:38am

I see. But do you think you can either fill in the necessary code, or pare it down to the point where it is runnable? It is much harder to help without some code to run.

BTW, did you make sure to put your code in a function, and pass all variables as parameters to that function? That is part of what a self-contained example should do as well.

wq-123 · February 18, 2025, 8:47am

I will try to make an excutable, but it’s hard to isolate that part from the whole program, so I can’t guarantee that.
And yes, the loop is excuted in a function, and variables are passed as parameters. But I will check that again to see if I have missed anything. Thanks.

DNF · February 18, 2025, 9:15am

It’s good to show this in your post, otherwise that will be the main thing other posters focus on.

The best thing you could do is to create a smaller example that still demonstrates your performance issue. Also, remember to provide some way to make dummy input data, otherwise we can run the function.

Deceneu · February 18, 2025, 10:27am

Until you post a MWE, here’s some general pointers:

i) Unless it’s relevant that your arrays are initialized at zero, in my experience it’s typically faster that you use the constructor with undef argument.

B = Array{ComplexF64}(undef, 6, 3, Nw, Threads.nthreads())

Do note that this will be indexed as B[i, j, k, l].

ii) As per the style guide, I would suggest you change chunks to chunks! and directly pass B as an argument to mutate.

iii) Make use of the @code_warntype, @profile, with the VSCode specific @profview and @profview_allocs, macros. Also, try benchmarking your functions that get called in your chunks function using BenchmarkTools.jl

iv) It’s unclear from the snippet you posted, but if you keep reusing names, instead of assigning values to preallocated memory, such as eigenmatrix, you might trigger the garbage collector.

v) Sometimes this slips by, but did you check that your Julia instance starts with multiple allocated threads? You can check by either executing versioninfo() or Threads.nthreads() in the REPL.

Topic		Replies	Views
@threads for loop performance Performance	6	698	December 11, 2020
Multithreading of a simple loop Performance performance , multithreading	6	2070	November 3, 2020
Multiple-threading does not increase speed in my 16-threads workstation, I am wondering is there something wrong in my code? Performance question , performance , parallel	3	308	August 22, 2023
Threaded loop far slower than sequential loop (+ high compilation time) Performance multithreading	3	811	September 17, 2021
Question about Multi-threading Performance Performance	3	1365	June 30, 2018

Multithreading doesn't improve the performance

Related topics