Multithreading very slow

Greetings,

I am translating a C++ particle simulation code into Julia. In each iteration, the algorithm computes forces that are given by pairwise interactions, which leads to an O(N^2) operation, with N the number of particles. I want to use multithreading to (roughly) evenly distribute the work across a few threads. With C++, for 1000 particles, I would be getting close-to-linear speedups up to 6 or 7 threads.
In Julia, however, using two threads is twice as slow as using one thread.

Am I doing something fundamentally wrong here?

function compute_forces!(system::ParticleSystem)
    N   = system.N_particles
    box = system.boxlength

    # Zero out forces, energy, laplacian
    fill!(system.forces, 0.0)

    forces_per_thread    = system.forces_per_thread

    n_threads            = system.n_threads

    fill!(forces_per_thread, 0.0)

    # Parallelize over threads, distribute i indices cyclically for better load balance.
    Threads.@threads for tid in 1:n_threads
        r_ij = MVector{2, Float64}(0.0, 0.0)

        @inbounds for i in tid:n_threads:N-1
            @inbounds for j in (i+1):N
                # Minimum image convention for 2D box
                r_ij[1] = mod(system.positions[1,i] - system.positions[1,j] + box/2, box) - box/2
                r_ij[2] = mod(system.positions[2,i] - system.positions[2,j] + box/2, box) - box/2

                # Compute force for this pair
                force = system.force_model(r_ij)

                # Accumulate forces using Newton's 3rd law into thread-local buffer
                @inbounds for k in 1:2
                    forces_per_thread[k, i, tid] += force[k]
                    forces_per_thread[k, j, tid] -= force[k]
                end

                # Accumulate pairwise energy and laplacian
                energies_per_thread[tid]  += energy
                laplaceU_per_thread[tid] += 2 * laplace_U  # Factor 2 for i and j.
            end
        end
    end

    # Reduction over threads: forces
    @inbounds for i in 1:N
        @inbounds for k in 1:2
            acc = 0.0
            @inbounds for t in 1:n_threads
                acc += forces_per_thread[k, i, t]
            end
            system.forces[k, i] = acc
        end
    end
end 

And the particle system struct:

mutable struct ParticleSystem{F <: AbstractForce}
    # HYPERPARAMETERS.   
    boxlength   :: Float64   # Length of the (cubic) simulation box [-L/2, L/2]^2.
    N_particles :: Int64     # Number of particles.

    # CONFIGURATION.
    positions ::Matrix{Float64}  # All matrices are of shape (2, N).
    momenta   ::Matrix{Float64}
    forces    ::Matrix{Float64}
    
    energy    ::Float64  # Potential energy in the system.
    laplace_U  ::Float64

    n_threads ::Int
    forces_per_thread   ::Array{Float64,3} # Buffers for multithreading.

    force_model ::F  # Specify pairwise interactions.

end



function ParticleSystem(boxlength::Real, N_particles::Int, force_model::F) where {F<:AbstractForce}

    positions = Matrix{Float64}(undef, 2, N_particles)
    momenta   = Matrix{Float64}(undef, 2, N_particles)
    forces    = Matrix{Float64}(undef, 2, N_particles)

    energy   = 0.0
    laplaceU = 0.0

    # Thread-local buffers: lay out as (2, N, n_threads) so each thread
    # writes to its own contiguous slice, avoiding false sharing.
    n_threads = Threads.nthreads()
    forces_per_thread   = zeros(Float64, 2, N_particles, n_threads)

    return ParticleSystem{F}(boxlength, N_particles, positions, momenta,
                             forces, energy, laplaceU,
                             n_threads, forces_per_thread,
                             force_model)

end

Give OhMyThreads.jl a try. It should simplify the implementation and prevent falling into common traps that could cause the slow down.

1 Like

Are you running into a row major vs column major issue where the different array order means that in Julia you are round robin splitting rather than splitting in chunks?

1 Like

I don’t see anything obviously wrong there. How are you benchmarking the run? How long does it take?

(ps: if you can use a cutoff for the interactions and this is an actual application,. you probably want to use some kind of neighbor list. CellListMap.jl will take care of the parallelization for you and follows more or less the interface you are using, or if you can handle explicit lists of neighbors, NearestNeighbors.jl is the most common alternative).

ps2: The code does not run, there are definitions missing. If you can provide a working example, that would help.

2 Likes

What about this code? It appears to be leftover? Where do energy and energies_per_thread come from? If they are globals, then they will mess up your performance.

Otherwise, please fix the example such that it corresponds to the code you’re benchmarking.

This kind of code is also a red flag for false sharing (multiple threads write to same cache-line in a loop) which will be much slower than single-threaded.

The typical pattern should be

Threads.@threads for thread in threads
    energy = 0.0
...
    for idx in thread_chunk
          energy += energy_for_idx
    end
    energy_for_thread[thread] += energy
end

i.e. during the hot loop, accumulate into a register, not into memory. After the hot loop, you do the memory write.

People tend to sloppily write the “accumulate into memory” thing, and be initially happy with performance because the compiler managed to turn that into “accumulate into register”. But this is brittle!

Then 20 years down the line, their successors port this code from fortran to C/C++, or from C++ to julia, and performance collapses because the compiler can’t do that anymore because restrict / aliasing rules are subtly different. (or somebody changed compiler flags, or updated gcc, or tried to port from C11 to C23, or whatever)

This has bad spatial locality. Use e.g.

Threads.@threads for (tid, chunk) in enumerate(ChunkSplitters.chunks(1:N; n=n_threads))
...
        @inbounds for i in chunk
4 Likes

Hi guys, sorry for the late response.

@langestefan @lmiq
Thanks for the recommendation of the packages, I’ll check them out. Although I wanted to implement it manually for my own understanding.

Yes, I only gave the most important parts of the code as the force evaluation is the only multithreaded component and also the computational bottleneck of the simulation. I may post a MWE on the weekend.

I don’t use neighborlists for this as some of the potentials are rather long ranged.

Thanks @foobar_lv2 for the hints with register/memory and spatial locality, I will try it.

Btw. the AI was not able to fix this or tell me what’s wrong. Human engineers are still needed :grin:

1 Like

This is probably because you prompt it wrong. AI is good at optimizing code if you guide it properly.

Hi All,

I prepared a simplified version of my simulation in the code block below. I implemented the changes recommended by @foobar_lv2 , but it does not seem to help.
My own quick benchmark showed that the largest compute work happens in the Lennard-Jones function, which is completely to be expected. This function is called in the hot loop, and multithreading is expected to lead to close-to-linear speedup at least for a small number of threads there.
I really don’t understand why it gets slower with more threads.

@liuyxpp The agent can see my code base and I asked it to spot the bottleneck just like I did here. If that’s not enough it either is not able to find the issue OR you need to be an expert to know what to prompt :slight_smile:

########### DEFINITIONS ########################################################
using StaticArrays
using Random
using ChunkSplitters 

abstract type AbstractForce end

mutable struct ParticleSystem{F <: AbstractForce}
    # HYPERPARAMETERS.   
    boxlength   :: Float64   # Length of the (cubic) simulation box [-L/2, L/2]^2.
    N_particles :: Int64     # Number of particles.

    # CONFIGURATION.
    positions ::Matrix{Float64}  # All matrices are of shape (2, N).
    momenta   ::Matrix{Float64}
    forces    ::Matrix{Float64}
    
    energy    ::Float64  # Potential energy in the system.
    laplace_U  ::Float64

    n_threads ::Int
    forces_per_thread   ::Array{Float64,3} # Buffers for multithreading.
    energies_per_thread ::Vector{Float64}
    laplaceU_per_thread ::Vector{Float64}

    force_model ::F  # Specify pairwise interactions.

end


function ParticleSystem(boxlength::Real, N_particles::Int, force_model::F) where {F<:AbstractForce}

    positions = Matrix{Float64}(undef, 2, N_particles)
    momenta   = Matrix{Float64}(undef, 2, N_particles)
    forces    = Matrix{Float64}(undef, 2, N_particles)

    energy   = 0.0
    laplaceU = 0.0

    # Thread-local buffers: lay out as (2, N, n_threads) so each thread
    # writes to its own contiguous slice, avoiding false sharing.
    n_threads = Threads.nthreads()
    forces_per_thread   = zeros(Float64, 2, N_particles, n_threads)
    energies_per_thread = zeros(Float64, n_threads)
    laplaceU_per_thread = zeros(Float64, n_threads)

    return ParticleSystem{F}(boxlength, N_particles, positions, momenta,
                             forces, energy, laplaceU,
                             n_threads, forces_per_thread,
                             energies_per_thread, laplaceU_per_thread,
                             force_model)

end


function initialize_system!(system::ParticleSystem)
    
    # INITIALIZE POSITIONS.
    N = system.N_particles
    L = system.boxlength

    # Put particles on a grid in [0, L]^2.
    Nx = round(Int, sqrt(N)) # Particle numbers per dimension as equal as possible.
    gapx = L/Nx 
    Ny = div(N, Nx)
    gapy = L/Ny
    N_rest = N - Nx*Ny

    n = 1
    for nx in 1:Nx
        
        x_pos = nx * gapx
        
        for ny in 1:Ny
        
            y_pos = ny * gapy
            
            system.positions[1,n] = x_pos
            system.positions[2,n] = y_pos
            
            n+=1

        end

    end
    
    # Put the remaining particles in the centers of the cubicles (where they have maximum space).
    for i in 1:N_rest
        system.positions[1,n] = system.positions[1,i] + gapx/2
        system.positions[2,n] = system.positions[2,i] + gapy/2
        n += 1
    end

    apply_periodic_boundaries!(system)  # To be safe.

    ## INITIALIZE MOMENTA.
    system.momenta .= 0   

    ## INITIALIZE FORCES.
    compute_forces!(system)

end


struct LennardJones <: AbstractForce
    
    epsilon  ::Float64  # LJ parameters.
    sigma    ::Float64

    r_cutoff ::Float64  # Switch function parameters.
    r_switch ::Float64

    four_epsilon ::Float64 # Help constants for computations.
    rc_rs_inv    ::Float64 

end


function LennardJones(epsilon::Real, sigma::Real)
    r_cutoff = 2.5*sigma
    r_switch = 0.9*r_cutoff
    four_eps = 4*epsilon
    rc_rs_inv =  1/(r_cutoff - r_switch)
    return LennardJones(epsilon, sigma, r_cutoff, r_switch, four_eps, rc_rs_inv)
end


function (lj::LennardJones)(r_ij::MVector{2, Float64})

    force = @MVector zeros(2)

    r2 =  r_ij[1]^2 + r_ij[2]^2 

    if r2 > lj.r_cutoff^2

        return force, 0, 0    # No interaction.
    
    end
    
    r = sqrt( r2 )

    # LJ terms.
    sig_r_6 = (lj.sigma / r)^6
    
    LJ_energy = lj.four_epsilon * sig_r_6 * (sig_r_6 - 1)

    LJ_deriv = -6/r * (LJ_energy + lj.four_epsilon * sig_r_6^2)

    LJ_deriv2 = 1/r2 * (42*LJ_energy + 114*lj.four_epsilon*sig_r_6*sig_r_6)
        
    # Switch terms.
    if r <= lj.r_switch
        S = 1
        S_deriv = 0
        S_deriv2 = 0
    else
        r_bar = (r-lj.r_switch) * lj.rc_rs_inv
        
        S = 1 - 6*r_bar^5 + 15*r_bar^4 -10*r_bar^3
        
        S_deriv = -30 * (r_bar^4 - 2*r_bar^3 + r_bar^2) * lj.rc_rs_inv

        S_deriv2 = -60 * (2*r_bar^3 -3*r_bar^2 + r_bar) * lj.rc_rs_inv^2
    end

    # Final results for force and energy.
    force_absolute = LJ_deriv * S  +  LJ_energy * S_deriv

    force =  -force_absolute / r  .* r_ij

    energy = LJ_energy * S

    laplace_U = S_deriv2*LJ_energy + 2*S_deriv*LJ_deriv + S*LJ_deriv2
    return force, energy, laplace_U

end


#### MULTITHREADED FORCE PART ####################################################

function compute_forces!(system::ParticleSystem)
    N   = system.N_particles
    box = system.boxlength

    # Zero out forces, energy, laplacian
    fill!(system.forces, 0.0)
    system.energy   = 0.0
    system.laplace_U = 0.0

    forces_per_thread    = system.forces_per_thread
    energies_per_thread  = system.energies_per_thread
    laplaceU_per_thread  = system.laplaceU_per_thread
    n_threads            = system.n_threads

    fill!(forces_per_thread, 0.0)
    fill!(energies_per_thread, 0.0)
    fill!(laplaceU_per_thread, 0.0)

    # Parallelize over threads, distribute i indices cyclically for better load balance.
    Threads.@threads for (tid, chunk) in enumerate(ChunkSplitters.chunks(1:N; n=n_threads))
        r_ij = MVector{2, Float64}(0.0, 0.0)
        
        energy_sum::Float64 = 0
        laplace_U_sum::Float64 = 0 

        local_forces = @view forces_per_thread[:, :, tid]
        f_i = MVector{2, Float64}(0.0, 0.0)  # one accumulator per thread

        @inbounds for i in chunk
            f_i .= 0.0                   # thread-local accumulator for particle i

            @inbounds for j in (i+1):N
                # Minimum image convention for 2D box
                r_ij[1] = mod(system.positions[1,i] - system.positions[1,j] + box/2, box) - box/2
                r_ij[2] = mod(system.positions[2,i] - system.positions[2,j] + box/2, box) - box/2

                # Compute force, energy, laplacian for this pair
                force, energy, laplace_U = system.force_model(r_ij)

                # accumulate to i in registers
                f_i[1] += force[1]
                f_i[2] += force[2]

                # still must update j immediately
                local_forces[1, j] -= force[1]
                local_forces[2, j] -= force[2]

                # Accumulate pairwise energy and laplacian
                energy_sum += energy
                laplace_U_sum += laplace_U
            end
            # single write for particle i
            local_forces[1, i] += f_i[1]
            local_forces[2, i] += f_i[2]
        end

        energies_per_thread[tid] += energy_sum
        laplaceU_per_thread[tid] += 2 * laplace_U_sum  # Factor 2 for i and j.
    end

    # Reduction over threads: forces
    @inbounds for i in 1:N
        @inbounds for k in 1:2
            acc = 0.0
            @inbounds for t in 1:n_threads
                acc += forces_per_thread[k, i, t]
            end
            system.forces[k, i] = acc
        end
    end

    # Reduction over threads: energies and Laplacian
    system.energy    = sum(energies_per_thread)
    system.laplace_U = sum(laplaceU_per_thread)
end

#### END OF MULTITHREADED FORCE PART ##############################################

function apply_periodic_boundaries!(system::ParticleSystem)

   @inbounds for i in eachindex(system.positions)
        system.positions[i] = mod(system.positions[i], system.boxlength)
    end

    return

end



### INTEGRATOR TYPES ##################################################################################################################

abstract type Integrator end   # Base type.

### BAOAB ########################################################################################################################
struct BAOAB <: Integrator

    stepsize ::Float64        # Stepsize across a full integrator step.

    gamma  ::Float64          # Friction parameter in Langevin Dynamics.
    temperature ::Float64     # Temperature of Langevin Dynamics.

    O_const1 ::Float64        # Constants used in the O-steps.
    O_const2 ::Float64

end

function BAOAB(stepsize::Real, gamma::Real, temperature::Real)

    O_const1 = exp(-gamma*stepsize)
    O_const2 = sqrt(temperature*(1-O_const1^2))

    return BAOAB(stepsize, gamma, temperature, O_const1, O_const2)

end

function integrator_step!(system::ParticleSystem, integrator::BAOAB)

    stepsize_half = 0.5 * integrator.stepsize

    B_step!(stepsize_half, system)
    A_step!(stepsize_half, system)
    O_step!(system, integrator.O_const1, integrator.O_const2)
    A_step!(stepsize_half, system)
    apply_periodic_boundaries!(system)
    compute_forces!(system)
    B_step!(stepsize_half, system)

    return

end


function A_step!(stepsize::Float64, system::ParticleSystem)
    system.positions .+= stepsize * system.momenta
    return
end


function B_step!(stepsize::Float64, system::ParticleSystem)
    system.momenta .+= stepsize * system.forces
    return
end


function O_step!(system::ParticleSystem, O_const1::Float64, O_const2::Float64)
    system.momenta .= O_const1 * system.momenta  +  O_const2 * randn(size(system.momenta))
    return
end


### END OF INTEGRATOR TYPES.


function run_simulation!(system::ParticleSystem, 
                         integrator::Integrator=BAOAB; 
                         N_iteration::Int=2000)
   

    Random.seed!(1)  # Seed RNG.
                        

    start = time()

    # Main loop.
    for i in 1:N_iteration   

        # Full iteration of specified integrator. 
        integrator_step!(system, integrator)   
        
        if i % 1000 == 0
            println("Iteration $i done! \n")
        end

    end

    println("Simulation took ", time()-start, " seconds!\n")

    return nothing

end




#%%## MAIN. ###############################################################################
L = 15                # Boxlength.
N = 1000              # No. of particles.
h=0.011              # Stepsize
T = 1                # Set temperature.
gamma = 1            # Friction parameter.
N_iteration = 5000   # Number of simulation steps.



model = LennardJones(1,0.5)    # Create pairwise force model to be used.
system = ParticleSystem(L, N, model)         # Create particle system.
initialize_system!(system)                   # Initialize system.
integrator = BAOAB(h, gamma, T)       # Create integrator.

# Run Simulation.
results = run_simulation!(system, integrator; 
                          N_iteration=N_iteration)

The most important issue is the use MVector instead of SVector in your inner functions, which is probably causing allocations and causing the slowdowns. You don’t need these mutable objects there (the compiler could probably solve this automatically, but it does not currently). With that fixed I get a reasonable speedup with more threads:

julia> @time run(;n_threads=8)
Simulation took 1.7581839561462402 seconds!

  1.759786 seconds (48.10 k allocations: 64.407 MiB, 0.90% gc time)

julia> @time run(;n_threads=1)
Simulation took 5.662581920623779 seconds!

  5.668081 seconds (48.10 k allocations: 64.300 MiB, 0.20% gc time)

The other changes I made were, as you see, using n_threads as a parameter, so I can control that on input. r_ij and forces in the inner function became SVectors. And I added the FastPow package to accelerate the computation of the exponentiation in the LJ function.

These were the times with MVectors:

julia> @time run(;n_threads=1)
Simulation took 9.161661863327026 seconds!

  9.181331 seconds (745.42 M allocations: 22.277 GiB, 26.40% gc time)

julia> @time run(;n_threads=8)
Simulation took 4.376814126968384 seconds!

  4.379819 seconds (746.40 M allocations: 22.306 GiB, 46.70% gc time)

Here is the code:

code
########### DEFINITIONS ########################################################
using StaticArrays
using Random
using ChunkSplitters
using FastPow

abstract type AbstractForce end

mutable struct ParticleSystem{F<:AbstractForce}
    # HYPERPARAMETERS.   
    boxlength::Float64   # Length of the (cubic) simulation box [-L/2, L/2]^2.
    N_particles::Int64     # Number of particles.

    # CONFIGURATION.
    positions::Matrix{Float64}  # All matrices are of shape (2, N).
    momenta::Matrix{Float64}
    forces::Matrix{Float64}

    energy::Float64  # Potential energy in the system.
    laplace_U::Float64

    n_threads::Int
    forces_per_thread::Array{Float64,3} # Buffers for multithreading.
    energies_per_thread::Vector{Float64}
    laplaceU_per_thread::Vector{Float64}

    force_model::F  # Specify pairwise interactions.

end


function ParticleSystem(boxlength::Real, N_particles::Int, force_model::F; n_threads=Threads.nthreads()) where {F<:AbstractForce}

    positions = Matrix{Float64}(undef, 2, N_particles)
    momenta = Matrix{Float64}(undef, 2, N_particles)
    forces = Matrix{Float64}(undef, 2, N_particles)

    energy = 0.0
    laplaceU = 0.0

    # Thread-local buffers: lay out as (2, N, n_threads) so each thread
    # writes to its own contiguous slice, avoiding false sharing.
    forces_per_thread = zeros(Float64, 2, N_particles, n_threads)
    energies_per_thread = zeros(Float64, n_threads)
    laplaceU_per_thread = zeros(Float64, n_threads)

    return ParticleSystem{F}(boxlength, N_particles, positions, momenta,
        forces, energy, laplaceU,
        n_threads, forces_per_thread,
        energies_per_thread, laplaceU_per_thread,
        force_model)

end


function initialize_system!(system::ParticleSystem)

    # INITIALIZE POSITIONS.
    N = system.N_particles
    L = system.boxlength

    # Put particles on a grid in [0, L]^2.
    Nx = round(Int, sqrt(N)) # Particle numbers per dimension as equal as possible.
    gapx = L / Nx
    Ny = div(N, Nx)
    gapy = L / Ny
    N_rest = N - Nx * Ny

    n = 1
    for nx in 1:Nx

        x_pos = nx * gapx

        for ny in 1:Ny

            y_pos = ny * gapy

            system.positions[1, n] = x_pos
            system.positions[2, n] = y_pos

            n += 1

        end

    end

    # Put the remaining particles in the centers of the cubicles (where they have maximum space).
    for i in 1:N_rest
        system.positions[1, n] = system.positions[1, i] + gapx / 2
        system.positions[2, n] = system.positions[2, i] + gapy / 2
        n += 1
    end

    apply_periodic_boundaries!(system)  # To be safe.

    ## INITIALIZE MOMENTA.
    system.momenta .= 0

    ## INITIALIZE FORCES.
    compute_forces!(system)

end


struct LennardJones <: AbstractForce

    epsilon::Float64  # LJ parameters.
    sigma::Float64

    r_cutoff::Float64  # Switch function parameters.
    r_switch::Float64

    four_epsilon::Float64 # Help constants for computations.
    rc_rs_inv::Float64

end


function LennardJones(epsilon::Real, sigma::Real)
    r_cutoff = 2.5 * sigma
    r_switch = 0.9 * r_cutoff
    four_eps = 4 * epsilon
    rc_rs_inv = 1 / (r_cutoff - r_switch)
    return LennardJones(epsilon, sigma, r_cutoff, r_switch, four_eps, rc_rs_inv)
end


function (lj::LennardJones)(r_ij::SVector{2,Float64})

    force = zeros(SVector{2,Float64})

    r2 = sum(abs2, r_ij)

    if r2 > lj.r_cutoff^2

        return force, 0, 0    # No interaction.

    end

    r = sqrt(r2)

    # LJ terms.
    @fastpow sig_r_6 = (lj.sigma / r)^6

    LJ_energy = lj.four_epsilon * sig_r_6 * (sig_r_6 - 1)

    LJ_deriv = -6 / r * (LJ_energy + lj.four_epsilon * sig_r_6^2)

    LJ_deriv2 = 1 / r2 * (42 * LJ_energy + 114 * lj.four_epsilon * sig_r_6 * sig_r_6)

    # Switch terms.
    if r <= lj.r_switch
        S = 1
        S_deriv = 0
        S_deriv2 = 0
    else
        r_bar = (r - lj.r_switch) * lj.rc_rs_inv

        @fastpow S = 1 - 6 * r_bar^5 + 15 * r_bar^4 - 10 * r_bar^3

        @fastpow S_deriv = -30 * (r_bar^4 - 2 * r_bar^3 + r_bar^2) * lj.rc_rs_inv

        S_deriv2 = -60 * (2 * r_bar^3 - 3 * r_bar^2 + r_bar) * lj.rc_rs_inv^2
    end

    # Final results for force and energy.
    force_absolute = LJ_deriv * S + LJ_energy * S_deriv

    force = -force_absolute / r .* r_ij

    energy = LJ_energy * S

    laplace_U = S_deriv2 * LJ_energy + 2 * S_deriv * LJ_deriv + S * LJ_deriv2
    return force, energy, laplace_U

end


#### MULTITHREADED FORCE PART ####################################################

function compute_forces!(system::ParticleSystem)
    N = system.N_particles
    box = system.boxlength

    # Zero out forces, energy, laplacian
    fill!(system.forces, 0.0)
    system.energy = 0.0
    system.laplace_U = 0.0

    forces_per_thread = system.forces_per_thread
    energies_per_thread = system.energies_per_thread
    laplaceU_per_thread = system.laplaceU_per_thread
    n_threads = system.n_threads

    fill!(forces_per_thread, 0.0)
    fill!(energies_per_thread, 0.0)
    fill!(laplaceU_per_thread, 0.0)

    # Parallelize over threads, distribute i indices cyclically for better load balance.
    Threads.@threads for (tid, chunk) in enumerate(ChunkSplitters.chunks(1:N; n=n_threads))

        energy_sum::Float64 = 0
        laplace_U_sum::Float64 = 0

        local_forces = @view forces_per_thread[:, :, tid]
        f_i = MVector{2,Float64}(0.0, 0.0)  # one accumulator per thread

         for i in chunk
            f_i .= 0.0                   # thread-local accumulator for particle i

             for j in (i+1):N
                # Minimum image convention for 2D box
                r_ij = SVector(
                    mod(system.positions[1, i] - system.positions[1, j] + box / 2, box) - box / 2,
                    mod(system.positions[2, i] - system.positions[2, j] + box / 2, box) - box / 2
                )

                # Compute force, energy, laplacian for this pair
                force, energy, laplace_U = system.force_model(r_ij)

                # accumulate to i in registers
                f_i[1] += force[1]
                f_i[2] += force[2]

                # still must update j immediately
                local_forces[1, j] -= force[1]
                local_forces[2, j] -= force[2]

                # Accumulate pairwise energy and laplacian
                energy_sum += energy
                laplace_U_sum += laplace_U
            end
            # single write for particle i
            local_forces[1, i] += f_i[1]
            local_forces[2, i] += f_i[2]
        end

        energies_per_thread[tid] += energy_sum
        laplaceU_per_thread[tid] += 2 * laplace_U_sum  # Factor 2 for i and j.
    end

    # Reduction over threads: forces
     for i in 1:N
         for k in 1:2
            acc = 0.0
             for t in 1:n_threads
                acc += forces_per_thread[k, i, t]
            end
            system.forces[k, i] = acc
        end
    end

    # Reduction over threads: energies and Laplacian
    system.energy = sum(energies_per_thread)
    system.laplace_U = sum(laplaceU_per_thread)
end

#### END OF MULTITHREADED FORCE PART ##############################################

function apply_periodic_boundaries!(system::ParticleSystem)

     for i in eachindex(system.positions)
        system.positions[i] = mod(system.positions[i], system.boxlength)
    end

    return

end



### INTEGRATOR TYPES ##################################################################################################################

abstract type Integrator end   # Base type.

### BAOAB ########################################################################################################################
struct BAOAB <: Integrator

    stepsize::Float64        # Stepsize across a full integrator step.

    gamma::Float64          # Friction parameter in Langevin Dynamics.
    temperature::Float64     # Temperature of Langevin Dynamics.

    O_const1::Float64        # Constants used in the O-steps.
    O_const2::Float64

end

function BAOAB(stepsize::Real, gamma::Real, temperature::Real)

    O_const1 = exp(-gamma * stepsize)
    O_const2 = sqrt(temperature * (1 - O_const1^2))

    return BAOAB(stepsize, gamma, temperature, O_const1, O_const2)

end

function integrator_step!(system::ParticleSystem, integrator::BAOAB)

    stepsize_half = 0.5 * integrator.stepsize

    B_step!(stepsize_half, system)
    A_step!(stepsize_half, system)
    O_step!(system, integrator.O_const1, integrator.O_const2)
    A_step!(stepsize_half, system)
    apply_periodic_boundaries!(system)
    compute_forces!(system)
    B_step!(stepsize_half, system)

    return

end


function A_step!(stepsize::Float64, system::ParticleSystem)
    system.positions .+= stepsize * system.momenta
    return
end


function B_step!(stepsize::Float64, system::ParticleSystem)
    system.momenta .+= stepsize * system.forces
    return
end


function O_step!(system::ParticleSystem, O_const1::Float64, O_const2::Float64)
    system.momenta .= O_const1 * system.momenta + O_const2 * randn(size(system.momenta))
    return
end


### END OF INTEGRATOR TYPES.


function run_simulation!(system::ParticleSystem,
    integrator::Integrator=BAOAB;
    N_iteration::Int=2000)


    Random.seed!(1)  # Seed RNG.


    start = time()

    # Main loop.
    for i in 1:N_iteration

        # Full iteration of specified integrator. 
        integrator_step!(system, integrator)

        if i % 1000 == 0
            println("Iteration $i done! \n")
        end

    end

    println("Simulation took ", time() - start, " seconds!\n")

    return nothing

end




function run(;n_threads=Threads.nthreads())
    #%%## MAIN. ###############################################################################
    L = 15                # Boxlength.
    N = 1000              # No. of particles.
    h = 0.011              # Stepsize
    T = 1                # Set temperature.
    gamma = 1            # Friction parameter.
    N_iteration = 500   # Number of simulation steps.



    model = LennardJones(1, 0.5)    # Create pairwise force model to be used.
    system = ParticleSystem(L, N, model; n_threads)         # Create particle system.
    initialize_system!(system)                   # Initialize system.
    integrator = BAOAB(h, gamma, T)       # Create integrator.

    # Run Simulation.
    run_simulation!(system, integrator; N_iteration=N_iteration)
end

If you want to see something similar but with cell lists, here is one example.

Following analysis is given by Claude Opus 4.6 (thinking) in Antigravity. It points out similar aspects to Imiq.

========== Claude Opus 4.6 (thinking) output ==========

Performance Review: Multithreading Efficiency in main.jl

Overview

The particle simulation uses a BAOAB Langevin dynamics integrator with Lennard-Jones pairwise interactions. The force computation compute_forces! is the hotspot (O(N²) complexity). Despite multithreading via Threads.@threads, parallel efficiency remains poor. This document identifies the root causes at both the algorithm and Julia language levels.


Root Cause Analysis

1. Severe Load Imbalance (Algorithm — Critical)

The code uses Newton’s 3rd Law (N3L): only pairs (i, j) with j > i are computed, and forces are applied symmetrically. The outer loop over i is split into contiguous chunks across threads via ChunkSplitters.chunks.

The problem: the inner loop length is N - i, so earlier i values do far more work:

Thread i range (N=1000, 4 threads) Total pairs Share
1 1–250 ~187,375 37.5%
2 251–500 ~125,125 25.0%
3 501–750 ~62,875 12.6%
4 751–1000 ~31,125 6.2%

Thread 1 does 6× more work than thread 4. The overall speedup is bounded by the slowest thread, so with 4 threads you get at best ~2.7× instead of the ideal 4×. This is the dominant bottleneck.

2. Write Conflicts from Newton’s 3rd Law (Algorithm)

With N3L, processing pair (i, j) requires writing to both particle i and particle j. Since j may belong to a different thread’s range, direct writes would cause data races. The current solution — per-thread force buffers forces_per_thread (2 × N × n_threads) — avoids races but introduces:

  • Extra memory: O(2 × N × T) additional storage.
  • Initialization cost: fill!(forces_per_thread, 0.0) every step.
  • Reduction cost: a serial O(2 × N × T) reduction loop after the parallel section.
  • Poor cache locality: each thread writes to scattered j indices across its entire (2, N) slice.

3. Heap Allocations in the Force Function (Julia)

# Line 130
force = @MVector zeros(2)   # Allocates on the HEAP every call

@MVector zeros(2) creates a mutable MVector{2, Float64} which is heap-allocated. With N=1000, compute_forces! calls the force function ~500,000 times per step. Each call allocates on the heap, creating massive GC pressure. Additionally, since the returned tuple (force, energy, laplace_U) contains a mutable object, the compiler cannot fully stack-allocate it.

Fix: Return scalar values (fx, fy, energy, laplace_U) or use immutable SVector. Scalars are the fastest — zero allocation, register-friendly.

4. Temporary Array Allocations in O_step! (Julia)

# Line 327
system.momenta .= O_const1 * system.momenta + O_const2 * randn(size(system.momenta))

This single line allocates 3 temporary matrices of size (2, N):

  1. O_const1 * system.momenta
  2. randn(size(system.momenta))
  3. The sum of the above

Fix: Use randn! to generate random numbers in-place, and loop element-wise:

function O_step!(system::ParticleSystem, c1::Float64, c2::Float64)
    @inbounds for i in eachindex(system.momenta)
        system.momenta[i] = c1 * system.momenta[i] + c2 * randn()
    end
end

5. Temporary Allocations in A_step! / B_step! (Julia)

system.positions .+= stepsize * system.momenta

The expression stepsize * system.momenta allocates a temporary (2, N) matrix before the in-place addition. This happens 4 times per integrator step (2× A_step + 2× B_step).

Fix: Use the @. broadcast macro or explicit loops:

function A_step!(dt::Float64, system::ParticleSystem)
    @inbounds @simd for i in eachindex(system.positions)
        system.positions[i] += dt * system.momenta[i]
    end
end

6. Expensive mod() for Periodic Boundaries (Julia)

# Lines 214–215
r_ij[1] = mod(system.positions[1,i] - system.positions[1,j] + box/2, box) - box/2

The mod function on Float64 is relatively expensive (involves division + floor). The minimum image convention can be done more cheaply with round:

@inline function wrap(dx, box)
    return dx - box * round(dx / box)
end

This produces identical results for well-behaved displacements and is more SIMD-friendly.


Recommended Optimization Strategy

Drop Newton’s 3rd Law — Use Full-Pair Computation

The most impactful change: compute all (i, j) pairs with j ≠ i, instead of only j > i.

Why this is better despite 2× more computation:

Aspect With N3L (current) Without N3L (proposed)
Pairs computed N(N-1)/2 N(N-1)
Load balance ~6:1 imbalance Perfect
Write conflicts Yes → needs per-thread buffers None → direct write
Extra memory (2, N, T) buffer None
Reduction step O(N × T) serial None
Cache locality Poor (scattered j writes) Excellent (contiguous i writes)
Scalability Degrades with more threads Near-linear

On 4+ threads, the perfect parallelism and zero-overhead synchronization more than compensate for the doubled computation.

Proposed compute_forces!:

function compute_forces!(system::ParticleSystem)
    N   = system.N_particles
    box = system.boxlength
    half_box = box / 2
    inv_box  = 1.0 / box

    fill!(system.forces, 0.0)

    energies_per_thread = system.energies_per_thread
    laplaceU_per_thread = system.laplaceU_per_thread
    fill!(energies_per_thread, 0.0)
    fill!(laplaceU_per_thread, 0.0)

    Threads.@threads for (tid, chunk) in enumerate(chunks(1:N; n=Threads.nthreads()))
        energy_sum   = 0.0
        laplaceU_sum = 0.0

        @inbounds for i in chunk
            fx_i = 0.0
            fy_i = 0.0

            xi = system.positions[1, i]
            yi = system.positions[2, i]

            for j in 1:N
                j == i && continue

                dx = xi - system.positions[1, j]
                dy = yi - system.positions[2, j]

                # Minimum image with round
                dx -= box * round(dx * inv_box)
                dy -= box * round(dy * inv_box)

                fx, fy, energy, lapU = system.force_model(dx, dy)

                fx_i += fx
                fy_i += fy
                energy_sum  += energy
                laplaceU_sum += lapU
            end

            system.forces[1, i] = fx_i
            system.forces[2, i] = fy_i
        end

        energies_per_thread[tid] = energy_sum * 0.5   # Each pair counted twice
        laplaceU_per_thread[tid] = laplaceU_sum        # Already per-particle
    end

    system.energy    = sum(energies_per_thread)
    system.laplace_U = sum(laplaceU_per_thread)
end

Scalar Force Function

function (lj::LennardJones)(dx::Float64, dy::Float64)
    r2 = dx * dx + dy * dy

    if r2 > lj.r_cutoff^2
        return 0.0, 0.0, 0.0, 0.0
    end

    r = sqrt(r2)
    sig_r_6 = (lj.sigma / r)^6

    LJ_energy = lj.four_epsilon * sig_r_6 * (sig_r_6 - 1)
    LJ_deriv  = -6 / r * (LJ_energy + lj.four_epsilon * sig_r_6^2)
    LJ_deriv2 = 1 / r2 * (42 * LJ_energy + 114 * lj.four_epsilon * sig_r_6 * sig_r_6)

    if r <= lj.r_switch
        S = 1.0; S_deriv = 0.0; S_deriv2 = 0.0
    else
        r_bar = (r - lj.r_switch) * lj.rc_rs_inv
        S        = 1 - 6*r_bar^5 + 15*r_bar^4 - 10*r_bar^3
        S_deriv  = -30 * (r_bar^4 - 2*r_bar^3 + r_bar^2) * lj.rc_rs_inv
        S_deriv2 = -60 * (2*r_bar^3 - 3*r_bar^2 + r_bar) * lj.rc_rs_inv^2
    end

    force_abs = LJ_deriv * S + LJ_energy * S_deriv
    inv_r = 1.0 / r

    fx = -force_abs * inv_r * dx
    fy = -force_abs * inv_r * dy
    energy  = LJ_energy * S
    lapU    = S_deriv2 * LJ_energy + 2 * S_deriv * LJ_deriv + S * LJ_deriv2

    return fx, fy, energy, lapU
end

Expected Performance Gains

Optimization Expected Impact
Drop N3L → perfect load balance Multithreading efficiency ~40% → 90%+
Scalar force function (no heap alloc) ~20–30% single-thread speedup + GC reduction
In-place integrator (no temp arrays) ~5 fewer (2,N) allocations per step
round-based minimum image ~5–10% force computation speedup

Overall estimate: 3–5× total speedup on 4 threads compared to current implementation.

2 Likes

The exact prompt I use is following (sorry for the inconvienience of using Chinese which is my native language)

作为一名卓越的数值算法工程师和资深的Julia性能优化大师,请认真review main.jl,并找到为什么multithreading效率不高的原因,从算法和Julia语言两个层面考虑如何进行优化。
1 Like

Point (1) above can be improved. ChunkSplitters provides the RoundRobin() splitter strategy, that can provide a more homogeneous workload:


    # Parallelize over threads, distribute i indices cyclically for better load balance.
    @threads for (tid, chunk) in enumerate(ChunkSplitters.chunks(1:N; n=n_threads, split=RoundRobin()))

With that, I get:

julia> @time run(;n_threads=8)
Simulation took 1.0903079509735107 seconds!

  1.091641 seconds (48.10 k allocations: 64.407 MiB, 0.63% gc time)

julia> @time run(;n_threads=1)
Simulation took 7.34706711769104 seconds!

  7.352670 seconds (48.10 k allocations: 64.300 MiB, 0.19% gc time)

(was ~1.7 before). Seems quite close to optimal now.

code
########### DEFINITIONS ########################################################
using StaticArrays
using Random
using ChunkSplitters
using FastPow
using Base.Threads: @threads

abstract type AbstractForce end

mutable struct ParticleSystem{F<:AbstractForce}
    # HYPERPARAMETERS.   
    boxlength::Float64   # Length of the (cubic) simulation box [-L/2, L/2]^2.
    N_particles::Int64     # Number of particles.

    # CONFIGURATION.
    positions::Matrix{Float64}  # All matrices are of shape (2, N).
    momenta::Matrix{Float64}
    forces::Matrix{Float64}

    energy::Float64  # Potential energy in the system.
    laplace_U::Float64

    n_threads::Int
    forces_per_thread::Array{Float64,3} # Buffers for multithreading.
    energies_per_thread::Vector{Float64}
    laplaceU_per_thread::Vector{Float64}

    force_model::F  # Specify pairwise interactions.

end


function ParticleSystem(boxlength::Real, N_particles::Int, force_model::F; n_threads=Threads.nthreads()) where {F<:AbstractForce}

    positions = Matrix{Float64}(undef, 2, N_particles)
    momenta = Matrix{Float64}(undef, 2, N_particles)
    forces = Matrix{Float64}(undef, 2, N_particles)

    energy = 0.0
    laplaceU = 0.0

    # Thread-local buffers: lay out as (2, N, n_threads) so each thread
    # writes to its own contiguous slice, avoiding false sharing.
    forces_per_thread = zeros(Float64, 2, N_particles, n_threads)
    energies_per_thread = zeros(Float64, n_threads)
    laplaceU_per_thread = zeros(Float64, n_threads)

    return ParticleSystem{F}(boxlength, N_particles, positions, momenta,
        forces, energy, laplaceU,
        n_threads, forces_per_thread,
        energies_per_thread, laplaceU_per_thread,
        force_model)

end


function initialize_system!(system::ParticleSystem)

    # INITIALIZE POSITIONS.
    N = system.N_particles
    L = system.boxlength

    # Put particles on a grid in [0, L]^2.
    Nx = round(Int, sqrt(N)) # Particle numbers per dimension as equal as possible.
    gapx = L / Nx
    Ny = div(N, Nx)
    gapy = L / Ny
    N_rest = N - Nx * Ny

    n = 1
    for nx in 1:Nx

        x_pos = nx * gapx

        for ny in 1:Ny

            y_pos = ny * gapy

            system.positions[1, n] = x_pos
            system.positions[2, n] = y_pos

            n += 1

        end

    end

    # Put the remaining particles in the centers of the cubicles (where they have maximum space).
    for i in 1:N_rest
        system.positions[1, n] = system.positions[1, i] + gapx / 2
        system.positions[2, n] = system.positions[2, i] + gapy / 2
        n += 1
    end

    apply_periodic_boundaries!(system)  # To be safe.

    ## INITIALIZE MOMENTA.
    system.momenta .= 0

    ## INITIALIZE FORCES.
    compute_forces!(system)

end


struct LennardJones <: AbstractForce

    epsilon::Float64  # LJ parameters.
    sigma::Float64

    r_cutoff::Float64  # Switch function parameters.
    r_switch::Float64

    four_epsilon::Float64 # Help constants for computations.
    rc_rs_inv::Float64

end


function LennardJones(epsilon::Real, sigma::Real)
    r_cutoff = 2.5 * sigma
    r_switch = 0.9 * r_cutoff
    four_eps = 4 * epsilon
    rc_rs_inv = 1 / (r_cutoff - r_switch)
    return LennardJones(epsilon, sigma, r_cutoff, r_switch, four_eps, rc_rs_inv)
end


function (lj::LennardJones)(r_ij::SVector{2,Float64})

    force = zeros(SVector{2,Float64})

    r2 = sum(abs2, r_ij)

    if r2 > lj.r_cutoff^2

        return force, 0, 0    # No interaction.

    end

    r = sqrt(r2)

    # LJ terms.
    @fastpow sig_r_6 = (lj.sigma / r)^6

    LJ_energy = lj.four_epsilon * sig_r_6 * (sig_r_6 - 1)

    LJ_deriv = -6 / r * (LJ_energy + lj.four_epsilon * sig_r_6^2)

    LJ_deriv2 = 1 / r2 * (42 * LJ_energy + 114 * lj.four_epsilon * sig_r_6 * sig_r_6)

    # Switch terms.
    if r <= lj.r_switch
        S = 1
        S_deriv = 0
        S_deriv2 = 0
    else
        r_bar = (r - lj.r_switch) * lj.rc_rs_inv

        @fastpow S = 1 - 6 * r_bar^5 + 15 * r_bar^4 - 10 * r_bar^3

        @fastpow S_deriv = -30 * (r_bar^4 - 2 * r_bar^3 + r_bar^2) * lj.rc_rs_inv

        S_deriv2 = -60 * (2 * r_bar^3 - 3 * r_bar^2 + r_bar) * lj.rc_rs_inv^2
    end

    # Final results for force and energy.
    force_absolute = LJ_deriv * S + LJ_energy * S_deriv

    force = -force_absolute / r .* r_ij

    energy = LJ_energy * S

    laplace_U = S_deriv2 * LJ_energy + 2 * S_deriv * LJ_deriv + S * LJ_deriv2
    return force, energy, laplace_U

end


#### MULTITHREADED FORCE PART ####################################################

function compute_forces!(system::ParticleSystem)
    N = system.N_particles
    box = system.boxlength

    # Zero out forces, energy, laplacian
    fill!(system.forces, 0.0)
    system.energy = 0.0
    system.laplace_U = 0.0

    forces_per_thread = system.forces_per_thread
    energies_per_thread = system.energies_per_thread
    laplaceU_per_thread = system.laplaceU_per_thread
    n_threads = system.n_threads

    fill!(forces_per_thread, 0.0)
    fill!(energies_per_thread, 0.0)
    fill!(laplaceU_per_thread, 0.0)

    # Parallelize over threads, distribute i indices cyclically for better load balance.
    @threads for (tid, chunk) in enumerate(ChunkSplitters.chunks(1:N; n=n_threads, split=RoundRobin()))

        energy_sum::Float64 = 0
        laplace_U_sum::Float64 = 0

        local_forces = @view forces_per_thread[:, :, tid]

        for i in chunk
            f_i = SVector(0.0, 0.0) # thread-local accumulator for particle i

            for j in (i+1):N
                # Minimum image convention for 2D box
                r_ij = SVector(
                    mod(system.positions[1, i] - system.positions[1, j] + box / 2, box) - box / 2,
                    mod(system.positions[2, i] - system.positions[2, j] + box / 2, box) - box / 2
                )

                # Compute force, energy, laplacian for this pair
                force, energy, laplace_U = system.force_model(r_ij)

                # accumulate to i in registers
                f_i += force

                # still must update j immediately
                local_forces[1, j] -= force[1]
                local_forces[2, j] -= force[2]

                # Accumulate pairwise energy and laplacian
                energy_sum += energy
                laplace_U_sum += laplace_U
            end
            # single write for particle i
            local_forces[1, i] += f_i[1]
            local_forces[2, i] += f_i[2]
        end

        energies_per_thread[tid] += energy_sum
        laplaceU_per_thread[tid] += 2 * laplace_U_sum  # Factor 2 for i and j.
    end

    # Reduction over threads: forces
    for i in 1:N
        for k in 1:2
            acc = 0.0
            for t in 1:n_threads
                acc += forces_per_thread[k, i, t]
            end
            system.forces[k, i] = acc
        end
    end

    # Reduction over threads: energies and Laplacian
    system.energy = sum(energies_per_thread)
    system.laplace_U = sum(laplaceU_per_thread)
end

#### END OF MULTITHREADED FORCE PART ##############################################

function apply_periodic_boundaries!(system::ParticleSystem)

    for i in eachindex(system.positions)
        system.positions[i] = mod(system.positions[i], system.boxlength)
    end

    return

end



### INTEGRATOR TYPES ##################################################################################################################

abstract type Integrator end   # Base type.

### BAOAB ########################################################################################################################
struct BAOAB <: Integrator

    stepsize::Float64        # Stepsize across a full integrator step.

    gamma::Float64          # Friction parameter in Langevin Dynamics.
    temperature::Float64     # Temperature of Langevin Dynamics.

    O_const1::Float64        # Constants used in the O-steps.
    O_const2::Float64

end

function BAOAB(stepsize::Real, gamma::Real, temperature::Real)

    O_const1 = exp(-gamma * stepsize)
    O_const2 = sqrt(temperature * (1 - O_const1^2))

    return BAOAB(stepsize, gamma, temperature, O_const1, O_const2)

end

function integrator_step!(system::ParticleSystem, integrator::BAOAB)

    stepsize_half = 0.5 * integrator.stepsize

    B_step!(stepsize_half, system)
    A_step!(stepsize_half, system)
    O_step!(system, integrator.O_const1, integrator.O_const2)
    A_step!(stepsize_half, system)
    apply_periodic_boundaries!(system)
    compute_forces!(system)
    B_step!(stepsize_half, system)

    return

end


function A_step!(stepsize::Float64, system::ParticleSystem)
    system.positions .+= stepsize * system.momenta
    return
end


function B_step!(stepsize::Float64, system::ParticleSystem)
    system.momenta .+= stepsize * system.forces
    return
end


function O_step!(system::ParticleSystem, O_const1::Float64, O_const2::Float64)
    system.momenta .= O_const1 * system.momenta + O_const2 * randn(size(system.momenta))
    return
end


### END OF INTEGRATOR TYPES.


function run_simulation!(system::ParticleSystem,
    integrator::Integrator=BAOAB;
    N_iteration::Int=2000)


    Random.seed!(1)  # Seed RNG.


    start = time()

    # Main loop.
    for i in 1:N_iteration

        # Full iteration of specified integrator. 
        integrator_step!(system, integrator)

        if i % 1000 == 0
            println("Iteration $i done! \n")
        end

    end

    println("Simulation took ", time() - start, " seconds!\n")

    return nothing

end




function run(; n_threads=Threads.nthreads())
    #%%## MAIN. ###############################################################################
    L = 15                # Boxlength.
    N = 1000              # No. of particles.
    h = 0.011              # Stepsize
    T = 1                # Set temperature.
    gamma = 1            # Friction parameter.
    N_iteration = 500   # Number of simulation steps.



    model = LennardJones(1, 0.5)    # Create pairwise force model to be used.
    system = ParticleSystem(L, N, model; n_threads)         # Create particle system.
    initialize_system!(system)                   # Initialize system.
    integrator = BAOAB(h, gamma, T)       # Create integrator.

    # Run Simulation.
    run_simulation!(system, integrator; N_iteration=N_iteration)
end
1 Like

Below is the version optimized by Opus 4.6 and the summary:

Particle Simulation: Optimization Summary & Benchmark Results

Original Code Issues

Six performance bottlenecks were identified in main.jl, grouped into two categories:

Algorithm Level

  1. Severe load imbalance — Newton’s 3rd Law (N3L) computes only pairs (i, j) with j > i, but chunks are split by contiguous i ranges. Thread 1 (i = 1..250) processes ~187K pairs while thread 8 (i = 751..1000) processes only ~31K pairs — a 6:1 imbalance. This is the primary cause of poor multithreading efficiency.

  2. Write conflicts from N3L — Updating both force[i] and force[j] from the same pair requires per-thread force buffers (2 × N × T), adding memory overhead, cache pollution, and a serial reduction step.

Julia Language Level

  1. Heap allocation in force function@MVector zeros(2) on line 130 heap-allocates a mutable vector on every call (~500K calls per step), creating massive GC pressure.

  2. Temporary arrays in O_step!randn(size(system.momenta)) allocates 3 temporary (2, N) matrices per step.

  3. Temporary arrays in A_step! / B_step!stepsize * system.momenta allocates a temporary matrix (4 calls per step).

  4. Expensive mod() for periodic boundariesmod on Float64 is slower than the equivalent round-based wrap.


Optimizations Applied in main_opt.jl

# Optimization Rationale
1 Full-pair computation (dropped N3L) Perfect load balance; each thread writes only to its own particles; no per-thread force buffers; no reduction step. Although pair count doubles, the elimination of synchronization overhead and perfect scaling more than compensates.
2 Scalar force returns Force function returns (fx, fy, energy, lapU) as 4 Float64 scalars instead of (MVector, Float64, Float64). Zero heap allocation, fully register-friendly.
3 Zero-allocation integrator A_step!, B_step! use @inbounds @simd element-wise loops. O_step! calls randn() per element instead of allocating a temporary matrix.
4 round()-based minimum image Replaced mod(dx + L/2, L) - L/2 with dx - L * round(dx / L), which is faster and more SIMD-friendly.
5 Precomputed r_cutoff_sq Added r_cutoff^2 as a struct field to avoid recomputing it every call.

Benchmark Results

Setup: N = 1000 particles, 2000 iterations (after JIT warmup), Lennard-Jones potential, BAOAB integrator.

Version Threads Time (s) Speedup vs Original/1T Parallel Efficiency
Original 1 40.21 1.00×
Original 8 11.36 3.54× 44%
Optimized 1 27.77 1.45×
Optimized 8 3.99 10.08× 87%

Analysis

  • Original multithreading efficiency: 44% — Confirms the severe load imbalance from N3L with contiguous chunking. With 8 threads, only a 3.54× speedup is achieved.

  • Optimized single-thread: 1.45× faster — Despite computing 2× more pairs (full-pair vs N3L half-pair), the optimized version is 31% faster on a single thread. This demonstrates that heap allocation overhead from MVector and temporary arrays was extremely significant.

  • Optimized multithreading efficiency: 87% — Full-pair computation gives near-perfect load balance. The 8-thread optimized version achieves 6.96× speedup over its single-threaded counterpart.

  • Overall speedup: 10.08× — From original single-threaded (40.21s) to optimized 8-threaded (3.99s).

Speedup Decomposition

Original/1T  →  Optimized/1T:  1.45×  (allocation elimination)
Optimized/1T →  Optimized/8T:  6.96×  (parallel scaling)
─────────────────────────────────────
Total:                        10.08×

source:

########### OPTIMIZED PARTICLE SIMULATION ########################################
#
# Key optimizations vs main.jl:
#   1. Dropped Newton's 3rd Law → full‑pair computation for perfect load balance
#   2. Scalar force returns (no heap‑allocated MVector)
#   3. In‑place integrator steps (no temporary array allocations)
#   4. round()‑based minimum image convention (faster than mod())
#   5. Removed per‑thread force buffers (no write conflicts, no reduction)
#
##############################################################################

using StaticArrays
using Random
using ChunkSplitters

abstract type AbstractForce end

mutable struct ParticleSystem{F <: AbstractForce}
    # HYPERPARAMETERS.
    boxlength   :: Float64
    N_particles :: Int64

    # CONFIGURATION.
    positions :: Matrix{Float64}   # (2, N)
    momenta   :: Matrix{Float64}
    forces    :: Matrix{Float64}

    energy     :: Float64
    laplace_U  :: Float64

    # Thread‑local scalar accumulators (only energy + laplacian, no force buffer).
    n_threads           :: Int
    energies_per_thread :: Vector{Float64}
    laplaceU_per_thread :: Vector{Float64}

    force_model :: F
end


function ParticleSystem(boxlength::Real, N_particles::Int, force_model::F) where {F<:AbstractForce}
    positions = Matrix{Float64}(undef, 2, N_particles)
    momenta   = Matrix{Float64}(undef, 2, N_particles)
    forces    = Matrix{Float64}(undef, 2, N_particles)

    n_threads = Threads.nthreads()
    energies_per_thread = zeros(Float64, n_threads)
    laplaceU_per_thread = zeros(Float64, n_threads)

    return ParticleSystem{F}(boxlength, N_particles,
                             positions, momenta, forces,
                             0.0, 0.0,
                             n_threads, energies_per_thread, laplaceU_per_thread,
                             force_model)
end


function initialize_system!(system::ParticleSystem)
    N = system.N_particles
    L = system.boxlength

    Nx   = round(Int, sqrt(N))
    gapx = L / Nx
    Ny   = div(N, Nx)
    gapy = L / Ny
    N_rest = N - Nx * Ny

    n = 1
    for nx in 1:Nx
        x_pos = nx * gapx
        for ny in 1:Ny
            y_pos = ny * gapy
            system.positions[1, n] = x_pos
            system.positions[2, n] = y_pos
            n += 1
        end
    end

    for i in 1:N_rest
        system.positions[1, n] = system.positions[1, i] + gapx / 2
        system.positions[2, n] = system.positions[2, i] + gapy / 2
        n += 1
    end

    apply_periodic_boundaries!(system)

    system.momenta .= 0.0

    compute_forces!(system)
end


########### LENNARD-JONES (scalar interface) ###################################

struct LennardJones <: AbstractForce
    epsilon      :: Float64
    sigma        :: Float64
    r_cutoff     :: Float64
    r_switch     :: Float64
    four_epsilon :: Float64
    rc_rs_inv    :: Float64
    r_cutoff_sq  :: Float64   # precomputed cutoff²
end


function LennardJones(epsilon::Real, sigma::Real)
    r_cutoff  = 2.5 * sigma
    r_switch  = 0.9 * r_cutoff
    four_eps  = 4 * epsilon
    rc_rs_inv = 1 / (r_cutoff - r_switch)
    return LennardJones(epsilon, sigma, r_cutoff, r_switch, four_eps, rc_rs_inv, r_cutoff * r_cutoff)
end


# Returns (fx, fy, energy, laplace_U) — all scalars, zero allocation.
@inline function (lj::LennardJones)(dx::Float64, dy::Float64)
    r2 = dx * dx + dy * dy

    if r2 > lj.r_cutoff_sq
        return 0.0, 0.0, 0.0, 0.0
    end

    r     = sqrt(r2)
    inv_r = 1.0 / r

    sig_r_6 = (lj.sigma * inv_r)^6

    LJ_energy = lj.four_epsilon * sig_r_6 * (sig_r_6 - 1)
    LJ_deriv  = -6 * inv_r * (LJ_energy + lj.four_epsilon * sig_r_6 * sig_r_6)
    LJ_deriv2 = (1.0 / r2) * (42 * LJ_energy + 114 * lj.four_epsilon * sig_r_6 * sig_r_6)

    # Switch function.
    if r <= lj.r_switch
        S        = 1.0
        S_deriv  = 0.0
        S_deriv2 = 0.0
    else
        r_bar    = (r - lj.r_switch) * lj.rc_rs_inv
        r_bar2   = r_bar * r_bar
        r_bar3   = r_bar2 * r_bar
        r_bar4   = r_bar3 * r_bar
        r_bar5   = r_bar4 * r_bar
        S        = 1.0 - 6*r_bar5 + 15*r_bar4 - 10*r_bar3
        S_deriv  = -30 * (r_bar4 - 2*r_bar3 + r_bar2) * lj.rc_rs_inv
        S_deriv2 = -60 * (2*r_bar3 - 3*r_bar2 + r_bar) * lj.rc_rs_inv^2
    end

    force_abs = LJ_deriv * S + LJ_energy * S_deriv

    fx = -force_abs * inv_r * dx
    fy = -force_abs * inv_r * dy

    energy  = LJ_energy * S
    lapU    = S_deriv2 * LJ_energy + 2 * S_deriv * LJ_deriv + S * LJ_deriv2

    return fx, fy, energy, lapU
end


########### FORCE COMPUTATION (full pair, perfectly balanced) ###################

function compute_forces!(system::ParticleSystem)
    N       = system.N_particles
    box     = system.boxlength
    inv_box = 1.0 / box

    fill!(system.forces, 0.0)

    energies_per_thread = system.energies_per_thread
    laplaceU_per_thread = system.laplaceU_per_thread
    n_threads           = system.n_threads

    fill!(energies_per_thread, 0.0)
    fill!(laplaceU_per_thread, 0.0)

    positions  = system.positions
    forces     = system.forces
    force_model = system.force_model

    Threads.@threads for (tid, chunk) in enumerate(chunks(1:N; n=n_threads))
        energy_local  = 0.0
        laplaceU_local = 0.0

        @inbounds for i in chunk
            fx_i = 0.0
            fy_i = 0.0

            xi = positions[1, i]
            yi = positions[2, i]

            for j in 1:N
                j == i && continue

                dx = xi - positions[1, j]
                dy = yi - positions[2, j]

                # Minimum image convention via round (faster than mod).
                dx -= box * round(dx * inv_box)
                dy -= box * round(dy * inv_box)

                fx, fy, energy, lapU = force_model(dx, dy)

                fx_i += fx
                fy_i += fy
                energy_local  += energy
                laplaceU_local += lapU
            end

            # Each thread writes ONLY to its own particle range — no conflicts.
            forces[1, i] = fx_i
            forces[2, i] = fy_i
        end

        # Factor 0.5: each pair (i,j) counted twice in full‑pair scheme.
        energies_per_thread[tid] = energy_local * 0.5
        laplaceU_per_thread[tid] = laplaceU_local
    end

    system.energy    = sum(energies_per_thread)
    system.laplace_U = sum(laplaceU_per_thread)
end


########### PERIODIC BOUNDARIES ################################################

function apply_periodic_boundaries!(system::ParticleSystem)
    box = system.boxlength
    @inbounds for i in eachindex(system.positions)
        system.positions[i] = mod(system.positions[i], box)
    end
end


########### INTEGRATOR — BAOAB (zero‑allocation) ###############################

abstract type Integrator end

struct BAOAB <: Integrator
    stepsize    :: Float64
    gamma       :: Float64
    temperature :: Float64
    O_const1    :: Float64
    O_const2    :: Float64
end

function BAOAB(stepsize::Real, gamma::Real, temperature::Real)
    O_const1 = exp(-gamma * stepsize)
    O_const2 = sqrt(temperature * (1 - O_const1^2))
    return BAOAB(stepsize, gamma, temperature, O_const1, O_const2)
end


function integrator_step!(system::ParticleSystem, integrator::BAOAB)
    dt_half = 0.5 * integrator.stepsize

    B_step!(dt_half, system)
    A_step!(dt_half, system)
    O_step!(system, integrator.O_const1, integrator.O_const2)
    A_step!(dt_half, system)
    apply_periodic_boundaries!(system)
    compute_forces!(system)
    B_step!(dt_half, system)

    return nothing
end


# All integrator sub‑steps: zero allocation via element‑wise loops.

function A_step!(dt::Float64, system::ParticleSystem)
    @inbounds @simd for i in eachindex(system.positions)
        system.positions[i] += dt * system.momenta[i]
    end
    return nothing
end

function B_step!(dt::Float64, system::ParticleSystem)
    @inbounds @simd for i in eachindex(system.momenta)
        system.momenta[i] += dt * system.forces[i]
    end
    return nothing
end

function O_step!(system::ParticleSystem, c1::Float64, c2::Float64)
    @inbounds for i in eachindex(system.momenta)
        system.momenta[i] = c1 * system.momenta[i] + c2 * randn()
    end
    return nothing
end


########### SIMULATION DRIVER #################################################

function run_simulation!(system::ParticleSystem,
                         integrator::Integrator;
                         N_iteration::Int=2000)

    Random.seed!(1)

    start = time()

    for i in 1:N_iteration
        integrator_step!(system, integrator)

        if i % 1000 == 0
            println("Iteration $i done!\n")
        end
    end

    elapsed = time() - start
    println("Simulation took ", elapsed, " seconds!\n")

    return elapsed
end


########### MAIN ###############################################################

L           = 15
N           = 1000
h           = 0.011
T           = 1
gamma       = 1
N_iteration = 5000

model      = LennardJones(1, 0.5)
system     = ParticleSystem(L, N, model)
initialize_system!(system)
integrator = BAOAB(h, gamma, T)

results = run_simulation!(system, integrator; N_iteration=N_iteration)

Thanks guys for the help. This improved speed by more than an order of magnitude!

There is an elegant strategy to remove the imbalance and improve scaling. May be worth considering.

  1. Sort particles in ascending order by X coordinate.
  2. In the force loop, consider only right half-space for each particle (for j in i+1:i+npart) and stop once (r[j%npart]-r[i])[1] > r_cut (accounting for periodicity).