Julia Threads.@threads slower than single thread performance

I am trying to solve numerically the heat equation in 1d:

enter image description here

I am using finite differences and I have some trouble using the @threads instruction in Julia. In particular below there are two version of the same code: the first one is single thread while the other uses @threads (they are identical apart from the @thread instruction)

function heatSecLoop(;T::Float64)

    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-6
    Δx = 1e-3

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)    
    u[round(Int,Nx/2)] = 1
    u_old = zeros(Nx)

    println("starting loop")
    for t=1:Nt-1
        u_old .= u
        for i=2:Nx-1
            u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
        end

        if t % round(Int,Nt/10) == 0
            println("time = " * string(t*Δt) )
        end
    end
    println("done")
    return u
end

function heatParLoop(;T::Float64)

    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-6
    Δx = 1e-3

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)    
    u[round(Int,Nx/2)] = 1
    u_old = zeros(Nx)
    
    println("starting loop")
    for t=1:Nt-1
        u_old .= u
        Threads.@threads for i=2:Nx-1
             u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
        end

        if t % round(Int,Nt/10) == 0
            println("time = " * string(t*Δt) )
        end
    end
    println("done")
    return u
end

The issue is that the sequential one is faster than the multi-thread one. Here are the timing (after one run to compile)

julia> Threads.nthreads()
2

julia> @time heatParLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
  5.417182 seconds (12.14 M allocations: 9.125 GiB, 6.59% gc time)

julia> @time heatSecLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
  3.892801 seconds (1.00 M allocations: 7.629 GiB, 8.06% gc time)

Of course the heat equation is just an example for a more intricate problem. I also tried using other libraries such as SharedArrays together with Distributed but with worse results.

Any help is appreciated.

1 Like

1000 unknowns is probably to small. Your problem is not limited by computation but by the speed of access to memory. Hence, threads will not help here. You can try perhaps @avx from LoopVectorization.

1 Like

you should remove the dot, although it should have no impact

Thanks! will do

In general, spawning threads takes some time, therefore it is usually better to write @threads on your outermost loop. This way, you save thread spawning time, but still all your CPU cores will be occupied.

@rveltz
I also tried \Delta t = 1e-8 and \Delta x = 1e-4 with similar results for what concern @threads

julia> @time heatParLoop(T=0.01)
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
 31.064373 seconds (11.19 M allocations: 1.496 GiB, 0.72% gc time)

julia> @time heatSecLoop(T=0.01)
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
 21.752236 seconds (109 allocations: 162.859 KiB)

However the package LoopVectorization and @avx it’s awesome

using LoopVectorization
function heatVecLoop(;T::Float64)

    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-8
    Δx = 1e-4

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)    
    u[round(Int,Nx/2)] = 1
    u_old = zeros(Nx)
    
    println("starting loop")
    for t=1:Nt-1
        u_old .= u
        @avx for i=2:Nx-1
             u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
        end

        if t % round(Int,Nt/10) == 0
            println("time = " * string(t*Δt) )
        end
    end
    println("done")
    return u
end
julia> @time heatVecLoop(T=0.01)
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
  5.577220 seconds (109 allocations: 162.859 KiB)

Thanks!

As far as I can see, this copy is not required. It causes a large amount of allocations and probably quite some execution time.

@lungben
That I knew. However in this situation I don’t know if there is a way to do that, since the outer most loop cannot be executed asynchronously since each step require the previous one to execute.

However you are somewhat right about the copy instruction. It is strictly needed, but perhaps is better by assigning with .= into a preallcated variable to save time. However all the benchmarks show similar results as before
(edited original code to avoid copy)

FYI, the results are still kind of the same even by taking

 Δt = 1e-10
 Δx = 1e-5
@time heatSecLoop(T=0.00001)
 26.710780 seconds (109 allocations: 1.532 MiB)
@time heatParLoop(T=0.00001)
 23.646293 seconds (1.20 M allocations: 158.215 MiB, 0.18% gc time)
@time heatVecLoop(T=0.00001)
  9.190979 seconds (109 allocations: 1.532 MiB)

now the @thread version is slightly faster than the single threaded one (not by much) and the version with LoopVectorization still crushes them.

I get better performance from this:

function heatFMLoop(;T::Float64)
    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-8
    Δx = 1e-4
    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)
    u[round(Int,Nx/2)] = 1
    u_old = similar(u)
    println("starting loop")
    for t=1:Nt-1
        u_old, u = u, u_old
        @inbounds @fastmath for i=2:Nx-1
             u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
        end
        if t % round(Int,Nt/10) == 0
            println("time = " * string(t*Δt) )
        end
    end
    println("done")
    return u
end

Results:

julia> @time heatVecLoop(T = 0.01); # after modifying it similarly
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
  1.874391 seconds (104 allocations: 162.641 KiB)

julia> @time heatFMLoop(T = 0.01);
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
  1.756562 seconds (104 allocations: 162.641 KiB)

I’ll have to look into what LoopVectorization is doing wrong here.

FYI, i tried the same approach with the same problem (heat equation) but in 2D. Same exact results if not that in this case the use of @threads is even more dreadful than in 1D (also here I think is simply related to memory problem).
At present, I haven’t understood yet how to write efficient parallel code to solve such type of PDEs with Julia (or any other time-evolution process where you cannot parallelize the outer most loop)

Hello, my code is facing similar issues. I’m iterating over a Tables.namedtupleiterator (after calling collect on the iterator) whose rows are the bins of a histogram and the multi threaded code seems to perform worse than the single threaded version. I’ve added Threads.@threads on the outer most loop. Has anybody had success with this so far?

1 Like