Threads.@threads ~44 times slower than single threaded code

Hello All.

I have a program where I basically solve a system of ODEs at N time points using Euler method, and add the solution to an array. N is ~1000

Then I wrap this in another loop which runs the inner loop M times. M is large, ~1000000 typically. This solution is superimposed on top of the previous solution.At last I divide the array by M to have taken the average solution.

The code:

function runthis()
 N=1000
 M=500000
 
 dt=0.01

a=1


data = zeros(N,1)


x = 0.0
xnew=0.0

v=0.0
vnew=0.0

Threads.@threads for i=1:M

x=0.0
v=0.0

for j=1:N
    noise=randn()/sqrt(dt)
       
    xnew=x + dt*(v)
    vnew=v + dt*(-a*x + noise)
    

       data[j]  += (x*x)/M #to find <x(t)^2>
x=xnew
v=vnew
end
end  

f = open("output.txt", "w")
for i = 1:N
	println(f, round(i*dt,digits=2)," ",data[i])
end
close(f)
end
@time runthis()

The execution times are:
without threads:
2.481255 seconds (6.01 k allocations: 883.625 KiB)
with threads:
110.841950 seconds (7.24 G allocations: 107.955 GiB, 17.62% gc time, 0.06% compilation time)

Also, to be noted: if I define the noise=0.0 before the outer for loop, then the execution slows further down by, like hundred times more!!

Can anyone explain why is my multithreaded program runs so slow, and how to fix it? Also why does scope of noise affect runtime so much?

Any help will be very appreciated!

This seems like a race condition, and additionally means all the threads are contending for the same memory. If I were you I would use a separate buffer for each thread.

But doesn’t race condition mean that I’ll be reading the variable accessed by other threads, and this causes ambiguity becaues nobody knows which thread affected it?

Here I don’t care about that data[j] was, I just want to add to it something.

Also, how do I create separate buffers? manually create 4 arrays? (I run julia -p 4) then wrap everything around yet another loop for creating threads?

I think the actual issue is that you define x, xnew, v and vnew before the threaded loop but then redefine those variables every loop. Eliminating those definitions seems to solve the problem:

function runthis()
    N=1000
    M=500000
    dt=0.01
    a=1
    data = zeros(N,1)
   
   
    Threads.@threads for i=1:M
    
        x=0.0
        v=0.0
    
        for j=1:N
            noise=randn()/sqrt(dt)
            
            xnew=x + dt*(v)
            vnew=v + dt*(-a*x + noise)
        
    
            data[j] += (x*x)/M #to find <x(t)^2>
            x=xnew
            v=vnew
        end
    end
    data
end

@time runthis()
# 0.749008 seconds (48.86 k allocations: 2.591 MiB, 5.07% compilation time)

Interesting. Removing those 4 variables from the beginning seems to get rid of the huge amount of allocations that were happening, which is probably what was causing the slow down.

Imagine what happens if two threads execute data[j] += something at the same time for the same j. Realize that this line corresponds (crudely) to three steps:

  1. read data[j]
  2. add it to something
  3. store the result back into data[j]

Suppose that two threads are doing this at once, and (since the update here is not “atomic”) these three steps can be interleaved in any possible way. For example suppose that thread 1 executes steps 1–2, then thread 2 executes steps 1–2, then thread 1 executes step 3, then thread 2 executes step 3. The result is that the update from thread 1 is overwritten and lost by thread 2.

If you google “race condition” you’ll find lots more information on this difficulty, and ways to avoid it.

4 Likes
using .Threads

function runthis()
    N=1000
    M=500000
    
    dt=0.01
   
    a=1
    
    
    data = zeros(N, nthreads())
    
    
    @threads for i=1:M
        x=0.0
        v=0.0
    
        for j=1:N
            noise=randn()/sqrt(dt)
          
            xnew=x + dt*(v)
            vnew=v + dt*(-a*x + noise)
       
            data[j, threadid()]  += (x*x)/M #to find <x(t)^2>
            x=xnew
            v=vnew
        end
    end  
   
    data_sum = sum(data, dims=2)

    f = open("output.txt", "w")
    for i = 1:N
        println(f, round(i*dt,digits=2)," ",data_sum[i])
    end
    close(f)

end

@time runthis()
@time runthis()

The above code is an attempt to remove the race condition. It seems to work well and tests slightly faster for me.

1 Like

This definitely worked! I created another loop wrapping these two: for P=1:4 and gave one to each thread. Then set data[j,P]+= (x*x)/M. Yours looks prettier. @mihalybaci

All this definitely speeds up my work considerably and I can get much better results.
Thanks very much! Thanks to everyone! I love Julia!

1 Like

Glad it works for you!