I am trying to solve numerically the heat equation in 1d:
I am using finite differences and I have some trouble using the @threads instruction in Julia. In particular below there are two version of the same code: the first one is single thread while the other uses @threads (they are identical apart from the @thread instruction)
function heatSecLoop(;T::Float64)
println("start")
L = 1
ν = 0.5
Δt = 1e-6
Δx = 1e-3
Nt = ceil(Int, T/Δt )
Nx = ceil(Int,L/Δx + 2)
u = zeros(Nx)
u[round(Int,Nx/2)] = 1
u_old = zeros(Nx)
println("starting loop")
for t=1:Nt-1
u_old .= u
for i=2:Nx-1
u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
end
if t % round(Int,Nt/10) == 0
println("time = " * string(t*Δt) )
end
end
println("done")
return u
end
function heatParLoop(;T::Float64)
println("start")
L = 1
ν = 0.5
Δt = 1e-6
Δx = 1e-3
Nt = ceil(Int, T/Δt )
Nx = ceil(Int,L/Δx + 2)
u = zeros(Nx)
u[round(Int,Nx/2)] = 1
u_old = zeros(Nx)
println("starting loop")
for t=1:Nt-1
u_old .= u
Threads.@threads for i=2:Nx-1
u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
end
if t % round(Int,Nt/10) == 0
println("time = " * string(t*Δt) )
end
end
println("done")
return u
end
The issue is that the sequential one is faster than the multi-thread one. Here are the timing (after one run to compile)
julia> Threads.nthreads()
2
julia> @time heatParLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
5.417182 seconds (12.14 M allocations: 9.125 GiB, 6.59% gc time)
julia> @time heatSecLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
3.892801 seconds (1.00 M allocations: 7.629 GiB, 8.06% gc time)
Of course the heat equation is just an example for a more intricate problem. I also tried using other libraries such as SharedArrays together with Distributed but with worse results.
1000 unknowns is probably to small. Your problem is not limited by computation but by the speed of access to memory. Hence, threads will not help here. You can try perhaps @avx from LoopVectorization.
In general, spawning threads takes some time, therefore it is usually better to write @threads on your outermost loop. This way, you save thread spawning time, but still all your CPU cores will be occupied.
@rveltz
I also tried \Delta t = 1e-8 and \Delta x = 1e-4 with similar results for what concern @threads
julia> @time heatParLoop(T=0.01)
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
31.064373 seconds (11.19 M allocations: 1.496 GiB, 0.72% gc time)
julia> @time heatSecLoop(T=0.01)
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
21.752236 seconds (109 allocations: 162.859 KiB)
However the package LoopVectorization and @avx it’s awesome
using LoopVectorization
function heatVecLoop(;T::Float64)
println("start")
L = 1
ν = 0.5
Δt = 1e-8
Δx = 1e-4
Nt = ceil(Int, T/Δt )
Nx = ceil(Int,L/Δx + 2)
u = zeros(Nx)
u[round(Int,Nx/2)] = 1
u_old = zeros(Nx)
println("starting loop")
for t=1:Nt-1
u_old .= u
@avx for i=2:Nx-1
u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
end
if t % round(Int,Nt/10) == 0
println("time = " * string(t*Δt) )
end
end
println("done")
return u
end
julia> @time heatVecLoop(T=0.01)
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
5.577220 seconds (109 allocations: 162.859 KiB)
@lungben
That I knew. However in this situation I don’t know if there is a way to do that, since the outer most loop cannot be executed asynchronously since each step require the previous one to execute.
However you are somewhat right about the copy instruction. It is strictly needed, but perhaps is better by assigning with .= into a preallcated variable to save time. However all the benchmarks show similar results as before
(edited original code to avoid copy)
function heatFMLoop(;T::Float64)
println("start")
L = 1
ν = 0.5
Δt = 1e-8
Δx = 1e-4
Nt = ceil(Int, T/Δt )
Nx = ceil(Int,L/Δx + 2)
u = zeros(Nx)
u[round(Int,Nx/2)] = 1
u_old = similar(u)
println("starting loop")
for t=1:Nt-1
u_old, u = u, u_old
@inbounds @fastmath for i=2:Nx-1
u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2*u_old[i] + u_old[i+1])
end
if t % round(Int,Nt/10) == 0
println("time = " * string(t*Δt) )
end
end
println("done")
return u
end
Results:
julia> @time heatVecLoop(T = 0.01); # after modifying it similarly
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
1.874391 seconds (104 allocations: 162.641 KiB)
julia> @time heatFMLoop(T = 0.01);
start
starting loop
time = 0.001
time = 0.002
time = 0.003
time = 0.004
time = 0.005
time = 0.006
time = 0.007
time = 0.008
time = 0.009000000000000001
done
1.756562 seconds (104 allocations: 162.641 KiB)
I’ll have to look into what LoopVectorization is doing wrong here.
FYI, i tried the same approach with the same problem (heat equation) but in 2D. Same exact results if not that in this case the use of @threads is even more dreadful than in 1D (also here I think is simply related to memory problem).
At present, I haven’t understood yet how to write efficient parallel code to solve such type of PDEs with Julia (or any other time-evolution process where you cannot parallelize the outer most loop)
Hello, my code is facing similar issues. I’m iterating over a Tables.namedtupleiterator (after calling collect on the iterator) whose rows are the bins of a histogram and the multi threaded code seems to perform worse than the single threaded version. I’ve added Threads.@threads on the outer most loop. Has anybody had success with this so far?