Performance issues with parallel Julia code

Thank you very much @goerch for your detailed comments and helpful information, which helped me to improve the performance significantly!
However, after further research, I decided to also give multithreading a try, as it seems to be better suited for this use case. A slightly better simulation-adapted MWE looks like the following.

MWE Multithreading
using Base.Threads, LinearAlgebra, Test, BenchmarkTools

abstract type AbstractModel end

struct Model <: AbstractModel
    idx::Vector{Tuple{Int,Int}}
    A::Matrix{Float64}
    B::Vector{Float64}
end

struct DistributedModel <: AbstractModel
    idx::Vector{Tuple{Int,Int}} # indices
    idxdist::Vector{UnitRange{Int}} # range of indices for each thread
    A::Matrix{Float64} # updated results
    Aw::Array{Float64,3} # results, each thread gets his own Aw[:,:,threadid()]
    B::Vector{Float64} # part of pre-processing, will only be read
end

function update2(m::AbstractModel, factor::Float64)
    for (i,j) in m.idx
        L = sqrt((m.A[1,i]-m.A[1,j])^2 +
                 (m.A[2,i]-m.A[2,j])^2 +
                 (m.A[3,i]-m.A[3,j])^2)
        k = factor * m.B[i]
        m.A[1,i] += k*L
        m.A[2,i] += k*L
        m.A[3,i] += k*L
        m.A[1,j] -= k*L
        m.A[2,j] -= k*L
        m.A[3,j] -= k*L
    end
end

function threads_update2(m::AbstractModel, factor::Float64)
    @threads for _ in 1:nthreads()
        tid = threadid()
        for cnt in m.idxdist[tid]
            i,j = m.idx[cnt]
            L = sqrt((m.A[1,i]-m.A[1,j])^2 +
                     (m.A[2,i]-m.A[2,j])^2 +
                     (m.A[3,i]-m.A[3,j])^2)
            k = factor * m.B[i]
            m.Aw[1,i,tid] += k*L
            m.Aw[2,i,tid] += k*L
            m.Aw[3,i,tid] += k*L
            m.Aw[1,j,tid] -= k*L
            m.Aw[2,j,tid] -= k*L
            m.Aw[3,j,tid] -= k*L
        end
    end
    @threads for i in 1:size(m.A,2)
        m.A[1,i] = sum(@view m.Aw[1,i,:])
        m.A[2,i] = sum(@view m.Aw[2,i,:])
        m.A[3,i] = sum(@view m.Aw[3,i,:])
    end
end

function defaultdist(sz::Int, nc::Int)
    if sz >= nc
        chunk_size = div(sz,nc)
        remainder = rem(sz,nc)
        sidx = zeros(Int64, nc+1)
        for i = 1:(nc+1)
            sidx[i] += (i-1)*chunk_size + 1
            if i<= remainder
                sidx[i] += i-1
            else
                sidx[i] += remainder
            end
        end
        grid = fill(0:0,nc)
        for i = 1:nc
            grid[i] = sidx[i]:sidx[i+1]-1
        end
        return grid
    else
        sidx = [1:(sz+1);]
        grid = fill(0:0,nc)
        for i = 1:sz
            grid[i] = sidx[i]:sidx[i+1]-1
        end
        return grid
    end
end

function test_parallel()
    n = 100_000_000
    N = 1000
    idx = [(rand(1:N),rand(1:N)) for k=1:n]
    idxdist = defaultdist(n,nthreads())
    A = hcat(([rand(-1000.:1000.);rand(-1000.:1000.);rand(-1000.:1000.)] for k=1:N)...)
    Aw = zeros((3,N,nthreads()))
    @threads for i = 1:nthreads()
        Aw[:,:,i] .= A
    end
    B = [rand(-1000.:1000.) for k=1:N]

    model1 = Model(idx,A,B)
    model2 = DistributedModel(idx,idxdist,A,Aw,B)

    @btime update2($model1,1e-9)
    @btime threads_update2($model2,1e-9)
    @test model1.A ≈ model2.A
end
test_parallel()

results (Julia 1.6.3, 8 threads on Intel i9 MacBook Pro)

  480.496 ms (0 allocations: 0 bytes)
  108.942 ms (82 allocations: 6.36 KiB)
Test Passed

A few comments:

  • The defaultdist function distributes the indices across all threads. It is an extended function of the version used in the DistributedArrays package.
  • To prevent a race condition, the data is written to an array Aw (third dimension for each thread) and then later summed up in A.
  • With 8 threads, I now get a speed increase of about factor 4.4 for the MWE. With my simulation code, only factor ~3.5 is achieved, since there is also a lot of non-parallelized file export.
  • Even when varying the workload on my local machine, with multithreading I was able to produce a better performance increase compared to the version with SharedArrays and Distributed.
  • In the coming days I will start simulations on a cluster and am already curious if and how the scaling changes. I will also try the Distributed version to see how it scales with higher workloads.

For any further comments and tips I am very grateful!