Thank you very much @goerch for your detailed comments and helpful information, which helped me to improve the performance significantly!
However, after further research, I decided to also give multithreading a try, as it seems to be better suited for this use case. A slightly better simulation-adapted MWE looks like the following.
MWE Multithreading
using Base.Threads, LinearAlgebra, Test, BenchmarkTools
abstract type AbstractModel end
struct Model <: AbstractModel
idx::Vector{Tuple{Int,Int}}
A::Matrix{Float64}
B::Vector{Float64}
end
struct DistributedModel <: AbstractModel
idx::Vector{Tuple{Int,Int}} # indices
idxdist::Vector{UnitRange{Int}} # range of indices for each thread
A::Matrix{Float64} # updated results
Aw::Array{Float64,3} # results, each thread gets his own Aw[:,:,threadid()]
B::Vector{Float64} # part of pre-processing, will only be read
end
function update2(m::AbstractModel, factor::Float64)
for (i,j) in m.idx
L = sqrt((m.A[1,i]-m.A[1,j])^2 +
(m.A[2,i]-m.A[2,j])^2 +
(m.A[3,i]-m.A[3,j])^2)
k = factor * m.B[i]
m.A[1,i] += k*L
m.A[2,i] += k*L
m.A[3,i] += k*L
m.A[1,j] -= k*L
m.A[2,j] -= k*L
m.A[3,j] -= k*L
end
end
function threads_update2(m::AbstractModel, factor::Float64)
@threads for _ in 1:nthreads()
tid = threadid()
for cnt in m.idxdist[tid]
i,j = m.idx[cnt]
L = sqrt((m.A[1,i]-m.A[1,j])^2 +
(m.A[2,i]-m.A[2,j])^2 +
(m.A[3,i]-m.A[3,j])^2)
k = factor * m.B[i]
m.Aw[1,i,tid] += k*L
m.Aw[2,i,tid] += k*L
m.Aw[3,i,tid] += k*L
m.Aw[1,j,tid] -= k*L
m.Aw[2,j,tid] -= k*L
m.Aw[3,j,tid] -= k*L
end
end
@threads for i in 1:size(m.A,2)
m.A[1,i] = sum(@view m.Aw[1,i,:])
m.A[2,i] = sum(@view m.Aw[2,i,:])
m.A[3,i] = sum(@view m.Aw[3,i,:])
end
end
function defaultdist(sz::Int, nc::Int)
if sz >= nc
chunk_size = div(sz,nc)
remainder = rem(sz,nc)
sidx = zeros(Int64, nc+1)
for i = 1:(nc+1)
sidx[i] += (i-1)*chunk_size + 1
if i<= remainder
sidx[i] += i-1
else
sidx[i] += remainder
end
end
grid = fill(0:0,nc)
for i = 1:nc
grid[i] = sidx[i]:sidx[i+1]-1
end
return grid
else
sidx = [1:(sz+1);]
grid = fill(0:0,nc)
for i = 1:sz
grid[i] = sidx[i]:sidx[i+1]-1
end
return grid
end
end
function test_parallel()
n = 100_000_000
N = 1000
idx = [(rand(1:N),rand(1:N)) for k=1:n]
idxdist = defaultdist(n,nthreads())
A = hcat(([rand(-1000.:1000.);rand(-1000.:1000.);rand(-1000.:1000.)] for k=1:N)...)
Aw = zeros((3,N,nthreads()))
@threads for i = 1:nthreads()
Aw[:,:,i] .= A
end
B = [rand(-1000.:1000.) for k=1:N]
model1 = Model(idx,A,B)
model2 = DistributedModel(idx,idxdist,A,Aw,B)
@btime update2($model1,1e-9)
@btime threads_update2($model2,1e-9)
@test model1.A ≈ model2.A
end
test_parallel()
results (Julia 1.6.3, 8 threads on Intel i9 MacBook Pro)
480.496 ms (0 allocations: 0 bytes)
108.942 ms (82 allocations: 6.36 KiB)
Test Passed
A few comments:
- The
defaultdist
function distributes the indices across all threads. It is an extended function of the version used in theDistributedArrays
package. - To prevent a race condition, the data is written to an array
Aw
(third dimension for each thread) and then later summed up inA
. - With 8 threads, I now get a speed increase of about factor 4.4 for the MWE. With my simulation code, only factor ~3.5 is achieved, since there is also a lot of non-parallelized file export.
- Even when varying the workload on my local machine, with multithreading I was able to produce a better performance increase compared to the version with
SharedArrays
andDistributed
. - In the coming days I will start simulations on a cluster and am already curious if and how the scaling changes. I will also try the
Distributed
version to see how it scales with higher workloads.
For any further comments and tips I am very grateful!