Julia Threads.@threads slower than single thread performance

cricci · September 18, 2020, 5:04am

I am trying to solve numerically the heat equation in 1d:

I am using finite differences and I have some trouble using the @threads instruction in Julia. In particular below there are two version of the same code: the first one is single thread while the other uses @threads (they are identical apart from the @thread instruction)

function heatSecLoop(;T::Float64)

    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-6
    Δx = 1e-3

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)    
    u[round(Int,Nx/2)] = 1
    u_old = zeros(Nx)

    println("starting loop")
    for t=1:Nt-1
        u_old .= u
        for i=2:Nx-1
            u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
        end

        if t % round(Int,Nt/10) == 0
            println("time = " * string(t*Δt) )
        end
    end
    println("done")
    return u
end

function heatParLoop(;T::Float64)

    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-6
    Δx = 1e-3

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)    
    u[round(Int,Nx/2)] = 1
    u_old = zeros(Nx)
    
    println("starting loop")
    for t=1:Nt-1
        u_old .= u
        Threads.@threads for i=2:Nx-1
             u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
        end

        if t % round(Int,Nt/10) == 0
            println("time = " * string(t*Δt) )
        end
    end
    println("done")
    return u
end

The issue is that the sequential one is faster than the multi-thread one. Here are the timing (after one run to compile)

julia> Threads.nthreads()
2

julia> @time heatParLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
  5.417182 seconds (12.14 M allocations: 9.125 GiB, 6.59% gc time)

julia> @time heatSecLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
  3.892801 seconds (1.00 M allocations: 7.629 GiB, 8.06% gc time)

Of course the heat equation is just an example for a more intricate problem. I also tried using other libraries such as SharedArrays together with Distributed but with worse results.

Any help is appreciated.

rveltz · September 18, 2020, 6:07am

1000 unknowns is probably to small. Your problem is not limited by computation but by the speed of access to memory. Hence, threads will not help here. You can try perhaps @avx from LoopVectorization.

rveltz · September 18, 2020, 6:24am

you should remove the dot, although it should have no impact

cricci · September 18, 2020, 6:43am

Thanks! will do

lungben · September 18, 2020, 6:56am

In general, spawning threads takes some time, therefore it is usually better to write @threads on your outermost loop. This way, you save thread spawning time, but still all your CPU cores will be occupied.

cricci · September 18, 2020, 6:58am

@rveltz
I also tried \Delta t = 1e-8 and \Delta x = 1e-4 with similar results for what concern @threads

julia> @time heatParLoop(T=0.01)
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
 31.064373 seconds (11.19 M allocations: 1.496 GiB, 0.72% gc time)

julia> @time heatSecLoop(T=0.01)
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
 21.752236 seconds (109 allocations: 162.859 KiB)

However the package LoopVectorization and @avx it’s awesome

using LoopVectorization
function heatVecLoop(;T::Float64)

    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-8
    Δx = 1e-4

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)    
    u[round(Int,Nx/2)] = 1
    u_old = zeros(Nx)
    
    println("starting loop")
    for t=1:Nt-1
        u_old .= u
        @avx for i=2:Nx-1
             u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
        end

        if t % round(Int,Nt/10) == 0
            println("time = " * string(t*Δt) )
        end
    end
    println("done")
    return u
end
julia> @time heatVecLoop(T=0.01)
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
  5.577220 seconds (109 allocations: 162.859 KiB)

Thanks!

lungben · September 18, 2020, 6:59am

As far as I can see, this copy is not required. It causes a large amount of allocations and probably quite some execution time.

cricci · September 18, 2020, 7:00am

@lungben
That I knew. However in this situation I don’t know if there is a way to do that, since the outer most loop cannot be executed asynchronously since each step require the previous one to execute.

However you are somewhat right about the copy instruction. It is strictly needed, but perhaps is better by assigning with .= into a preallcated variable to save time. However all the benchmarks show similar results as before
(edited original code to avoid copy)

cricci · September 18, 2020, 7:21am

FYI, the results are still kind of the same even by taking

 Δt = 1e-10
 Δx = 1e-5

@time heatSecLoop(T=0.00001)
 26.710780 seconds (109 allocations: 1.532 MiB)
@time heatParLoop(T=0.00001)
 23.646293 seconds (1.20 M allocations: 158.215 MiB, 0.18% gc time)
@time heatVecLoop(T=0.00001)
  9.190979 seconds (109 allocations: 1.532 MiB)

now the @thread version is slightly faster than the single threaded one (not by much) and the version with LoopVectorization still crushes them.

Elrod · September 18, 2020, 9:54am

I get better performance from this:

function heatFMLoop(;T::Float64)
    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-8
    Δx = 1e-4
    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)
    u[round(Int,Nx/2)] = 1
    u_old = similar(u)
    println("starting loop")
    for t=1:Nt-1
        u_old, u = u, u_old
        @inbounds @fastmath for i=2:Nx-1
             u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
        end
        if t % round(Int,Nt/10) == 0
            println("time = " * string(t*Δt) )
        end
    end
    println("done")
    return u
end

Results:

julia> @time heatVecLoop(T = 0.01); # after modifying it similarly
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
  1.874391 seconds (104 allocations: 162.641 KiB)

julia> @time heatFMLoop(T = 0.01);
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
  1.756562 seconds (104 allocations: 162.641 KiB)

I’ll have to look into what LoopVectorization is doing wrong here.

cricci · September 27, 2020, 4:17am

FYI, i tried the same approach with the same problem (heat equation) but in 2D. Same exact results if not that in this case the use of @threads is even more dreadful than in 1D (also here I think is simply related to memory problem).
At present, I haven’t understood yet how to write efficient parallel code to solve such type of PDEs with Julia (or any other time-evolution process where you cannot parallelize the outer most loop)

cocoa1231 · April 24, 2023, 8:57am

Hello, my code is facing similar issues. I’m iterating over a Tables.namedtupleiterator (after calling collect on the iterator) whose rows are the bins of a histogram and the multi threaded code seems to perform worse than the single threaded version. I’ve added Threads.@threads on the outer most loop. Has anybody had success with this so far?

Topic		Replies	Views
Question for lower performance by using @threads in for loop New to Julia question	13	1052	July 9, 2021
Overhead of `Threads.@threads` Performance question , multithreading	30	5342	March 13, 2021
@threads vs @parallel, a simple fail case for @threads Performance	3	1398	October 31, 2017
Threads.@threads parallelism and htop Performance multithreading , threads	2	454	May 25, 2021
Loosing performance with `Threads.@threads` for loop Performance parallel , multithreading , threads	10	703	October 7, 2021

Julia Threads.@threads slower than single thread performance

Related topics