Multithreading very slow

Greetings,

I am translating a C++ particle simulation code into Julia. In each iteration, the algorithm computes forces that are given by pairwise interactions, which leads to an O(N^2) operation, with N the number of particles. I want to use multithreading to (roughly) evenly distribute the work across a few threads. With C++, for 1000 particles, I would be getting close-to-linear speedups up to 6 or 7 threads.
In Julia, however, using two threads is twice as slow as using one thread.

Am I doing something fundamentally wrong here?

function compute_forces!(system::ParticleSystem)
    N   = system.N_particles
    box = system.boxlength

    # Zero out forces, energy, laplacian
    fill!(system.forces, 0.0)

    forces_per_thread    = system.forces_per_thread

    n_threads            = system.n_threads

    fill!(forces_per_thread, 0.0)

    # Parallelize over threads, distribute i indices cyclically for better load balance.
    Threads.@threads for tid in 1:n_threads
        r_ij = MVector{2, Float64}(0.0, 0.0)

        @inbounds for i in tid:n_threads:N-1
            @inbounds for j in (i+1):N
                # Minimum image convention for 2D box
                r_ij[1] = mod(system.positions[1,i] - system.positions[1,j] + box/2, box) - box/2
                r_ij[2] = mod(system.positions[2,i] - system.positions[2,j] + box/2, box) - box/2

                # Compute force for this pair
                force = system.force_model(r_ij)

                # Accumulate forces using Newton's 3rd law into thread-local buffer
                @inbounds for k in 1:2
                    forces_per_thread[k, i, tid] += force[k]
                    forces_per_thread[k, j, tid] -= force[k]
                end

                # Accumulate pairwise energy and laplacian
                energies_per_thread[tid]  += energy
                laplaceU_per_thread[tid] += 2 * laplace_U  # Factor 2 for i and j.
            end
        end
    end

    # Reduction over threads: forces
    @inbounds for i in 1:N
        @inbounds for k in 1:2
            acc = 0.0
            @inbounds for t in 1:n_threads
                acc += forces_per_thread[k, i, t]
            end
            system.forces[k, i] = acc
        end
    end
end 

And the particle system struct:

mutable struct ParticleSystem{F <: AbstractForce}
    # HYPERPARAMETERS.   
    boxlength   :: Float64   # Length of the (cubic) simulation box [-L/2, L/2]^2.
    N_particles :: Int64     # Number of particles.

    # CONFIGURATION.
    positions ::Matrix{Float64}  # All matrices are of shape (2, N).
    momenta   ::Matrix{Float64}
    forces    ::Matrix{Float64}
    
    energy    ::Float64  # Potential energy in the system.
    laplace_U  ::Float64

    n_threads ::Int
    forces_per_thread   ::Array{Float64,3} # Buffers for multithreading.

    force_model ::F  # Specify pairwise interactions.

end



function ParticleSystem(boxlength::Real, N_particles::Int, force_model::F) where {F<:AbstractForce}

    positions = Matrix{Float64}(undef, 2, N_particles)
    momenta   = Matrix{Float64}(undef, 2, N_particles)
    forces    = Matrix{Float64}(undef, 2, N_particles)

    energy   = 0.0
    laplaceU = 0.0

    # Thread-local buffers: lay out as (2, N, n_threads) so each thread
    # writes to its own contiguous slice, avoiding false sharing.
    n_threads = Threads.nthreads()
    forces_per_thread   = zeros(Float64, 2, N_particles, n_threads)

    return ParticleSystem{F}(boxlength, N_particles, positions, momenta,
                             forces, energy, laplaceU,
                             n_threads, forces_per_thread,
                             force_model)

end

Give OhMyThreads.jl a try. It should simplify the implementation and prevent falling into common traps that could cause the slow down.

Are you running into a row major vs column major issue where the different array order means that in Julia you are round robin splitting rather than splitting in chunks?

I don’t see anything obviously wrong there. How are you benchmarking the run? How long does it take?

(ps: if you can use a cutoff for the interactions and this is an actual application,. you probably want to use some kind of neighbor list. CellListMap.jl will take care of the parallelization for you and follows more or less the interface you are using, or if you can handle explicit lists of neighbors, NearestNeighbors.jl is the most common alternative).

ps2: The code does not run, there are definitions missing. If you can provide a working example, that would help.

What about this code? It appears to be leftover? Where do energy and energies_per_thread come from? If they are globals, then they will mess up your performance.

Otherwise, please fix the example such that it corresponds to the code you’re benchmarking.

This kind of code is also a red flag for false sharing (multiple threads write to same cache-line in a loop) which will be much slower than single-threaded.

The typical pattern should be

Threads.@threads for thread in threads
    energy = 0.0
...
    for idx in thread_chunk
          energy += energy_for_idx
    end
    energy_for_thread[thread] += energy
end

i.e. during the hot loop, accumulate into a register, not into memory. After the hot loop, you do the memory write.

People tend to sloppily write the “accumulate into memory” thing, and be initially happy with performance because the compiler managed to turn that into “accumulate into register”. But this is brittle!

Then 20 years down the line, their successors port this code from fortran to C/C++, or from C++ to julia, and performance collapses because the compiler can’t do that anymore because restrict / aliasing rules are subtly different. (or somebody changed compiler flags, or updated gcc, or tried to port from C11 to C23, or whatever)

This has bad spatial locality. Use e.g.

Threads.@threads for (tid, chunk) in enumerate(ChunkSplitters.chunks(1:N; n=n_threads))
...
        @inbounds for i in chunk
2 Likes