Threads.@threads ~44 times slower than single threaded code

Kaguro · September 14, 2022, 4:36pm

Hello All.

I have a program where I basically solve a system of ODEs at N time points using Euler method, and add the solution to an array. N is ~1000

Then I wrap this in another loop which runs the inner loop M times. M is large, ~1000000 typically. This solution is superimposed on top of the previous solution.At last I divide the array by M to have taken the average solution.

The code:

function runthis()
 N=1000
 M=500000
 
 dt=0.01

a=1


data = zeros(N,1)


x = 0.0
xnew=0.0

v=0.0
vnew=0.0

Threads.@threads for i=1:M

x=0.0
v=0.0

for j=1:N
    noise=randn()/sqrt(dt)
       
    xnew=x + dt*(v)
    vnew=v + dt*(-a*x + noise)
    

       data[j]  += (x*x)/M #to find <x(t)^2>
x=xnew
v=vnew
end
end  

f = open("output.txt", "w")
for i = 1:N
	println(f, round(i*dt,digits=2)," ",data[i])
end
close(f)
end
@time runthis()

The execution times are:
without threads:
2.481255 seconds (6.01 k allocations: 883.625 KiB)
with threads:
110.841950 seconds (7.24 G allocations: 107.955 GiB, 17.62% gc time, 0.06% compilation time)

Also, to be noted: if I define the noise=0.0 before the outer for loop, then the execution slows further down by, like hundred times more!!

Can anyone explain why is my multithreaded program runs so slow, and how to fix it? Also why does scope of noise affect runtime so much?

Any help will be very appreciated!

stevengj · September 14, 2022, 4:41pm

This seems like a race condition, and additionally means all the threads are contending for the same memory. If I were you I would use a separate buffer for each thread.

Kaguro · September 14, 2022, 4:45pm

But doesn’t race condition mean that I’ll be reading the variable accessed by other threads, and this causes ambiguity becaues nobody knows which thread affected it?

Here I don’t care about that data[j] was, I just want to add to it something.

Also, how do I create separate buffers? manually create 4 arrays? (I run julia -p 4) then wrap everything around yet another loop for creating threads?

junder873 · September 14, 2022, 5:32pm

I think the actual issue is that you define x, xnew, v and vnew before the threaded loop but then redefine those variables every loop. Eliminating those definitions seems to solve the problem:

function runthis()
    N=1000
    M=500000
    dt=0.01
    a=1
    data = zeros(N,1)
   
   
    Threads.@threads for i=1:M
    
        x=0.0
        v=0.0
    
        for j=1:N
            noise=randn()/sqrt(dt)
            
            xnew=x + dt*(v)
            vnew=v + dt*(-a*x + noise)
        
    
            data[j] += (x*x)/M #to find <x(t)^2>
            x=xnew
            v=vnew
        end
    end
    data
end

@time runthis()
# 0.749008 seconds (48.86 k allocations: 2.591 MiB, 5.07% compilation time)

mihalybaci · September 14, 2022, 5:51pm

Interesting. Removing those 4 variables from the beginning seems to get rid of the huge amount of allocations that were happening, which is probably what was causing the slow down.

stevengj · September 14, 2022, 5:59pm

Imagine what happens if two threads execute data[j] += something at the same time for the same j. Realize that this line corresponds (crudely) to three steps:

read data[j]
add it to something
store the result back into data[j]

Suppose that two threads are doing this at once, and (since the update here is not “atomic”) these three steps can be interleaved in any possible way. For example suppose that thread 1 executes steps 1–2, then thread 2 executes steps 1–2, then thread 1 executes step 3, then thread 2 executes step 3. The result is that the update from thread 1 is overwritten and lost by thread 2.

If you google “race condition” you’ll find lots more information on this difficulty, and ways to avoid it.

mihalybaci · September 14, 2022, 6:41pm

using .Threads

function runthis()
    N=1000
    M=500000
    
    dt=0.01
   
    a=1
    
    
    data = zeros(N, nthreads())
    
    
    @threads for i=1:M
        x=0.0
        v=0.0
    
        for j=1:N
            noise=randn()/sqrt(dt)
          
            xnew=x + dt*(v)
            vnew=v + dt*(-a*x + noise)
       
            data[j, threadid()]  += (x*x)/M #to find <x(t)^2>
            x=xnew
            v=vnew
        end
    end  
   
    data_sum = sum(data, dims=2)

    f = open("output.txt", "w")
    for i = 1:N
        println(f, round(i*dt,digits=2)," ",data_sum[i])
    end
    close(f)

end

@time runthis()
@time runthis()

The above code is an attempt to remove the race condition. It seems to work well and tests slightly faster for me.

Kaguro · September 14, 2022, 8:07pm

This definitely worked! I created another loop wrapping these two: for P=1:4 and gave one to each thread. Then set data[j,P]+= (x*x)/M. Yours looks prettier. @mihalybaci

All this definitely speeds up my work considerably and I can get much better results.
Thanks very much! Thanks to everyone! I love Julia!

mihalybaci · September 14, 2022, 8:15pm

Glad it works for you!

Topic		Replies	Views
Slower execution with multi-threading using @threads macro Performance question , parallel , multithreading	5	738	August 13, 2020
Question for lower performance by using @threads in for loop New to Julia question	13	1054	July 9, 2021
Julia Threads.@threads slower than single thread performance Performance multithreading , pde	11	2913	April 24, 2023
Threads.@threads does not work properly General Usage	6	310	July 7, 2024
Multi-threading on a 2 CPU system New to Julia multithreading	15	1084	February 2, 2023

Threads.@threads ~44 times slower than single threaded code

Related topics