Code parallelization for Brownian Dynamics simulation of hard spheres

Hi All!

I am using Julia to do Brownian dynamics simulation of many hard spheres in 2D and 3D (employing the repulsive LJ interaction). My code is mostly based on SoftSquishymatter.jl package, but the main difference is that I am using SVectors to store positions/velocities/forces. I am using the cell-linked list method to avoid computing the unnecessary intractions and atom decomposition for calculating forces between particles. Using ```Threads.@threads’’’ I am getting speedup moving from 1 core to 2 cores and so on. But this speedup is saturating at some point. I expected, ideally, to get speedup with a factor of the number of cores that I am using. However, it turned out that, for example, using moving from 10 to 20 cores, I don’t get any detectable speedup. Does anybody have any advice regarding this issue(?)?

This the part that my code integrates the equations of motion:

function Update_Particles!(brownian::Brownian, periodicity::SVector)
	dim = size(periodicity,1)
    Threads.@threads for particle in brownian.particles
			particle.v = particle.f .+ sqrt(2.0 /brownian.dt) .* SRandns(dim)
			particle.r = particle.r .+ brownian.dt .* particle.v

	        PBC!(particle, periodicity)
	        particle.f = SZeros(dim)

This is the part that my code computes the interaction between particles:

function compute_interactions!(wca::WCA, periodicity::SVector)
	dim = size(periodicity,1)
	if dim == 2
	    Threads.@nthread for particle in wca.particles
	        i = trunc(Int64, particle.r[1] / wca.cell_list.cell_dr[1])
	        j = trunc(Int64, particle.r[2] / wca.cell_list.cell_dr[2])
	        f_x, f_y = 0.0, 0.0
	        @inbounds for dj = -1 : 1, di = -1 : 1
	            idi = mod(i + di, wca.cell_list.num_cells[1]) + 1
	            jdj = mod(j + dj, wca.cell_list.num_cells[2]) + 1
	            pid = wca.cell_list.start_pid[idi, jdj]
	            while pid > 0
	                neighbor = wca.cell_list.particles[pid]
	                Δx = wrap_displacement(particle.r[1] - neighbor.r[1]; period = periodicity[1])
	                Δy = wrap_displacement(particle.r[2] - neighbor.r[2]; period = periodicity[2])
	                Δr² = Δx^2 + Δy^2
                       σ² = 0.25*(particle.σ + neighbor.σ)^2
			rc² = wca.rc^2
	                if 0.0 < Δr² < rc²
	                    inv² = σ² / Δr²
	                    inv⁶ = inv² * inv² * inv²
	                    coef = wca.ϵ * (48.0 * inv⁶ - 24.0) * inv⁶ / Δr²

	                    f_x += (_f_x = coef * Δx)
	                    f_y += (_f_y = coef * Δy)
	                    if wca.use_newton_3rd
	                        neighbor.f = SVector{2,Float64}(neighbor.f .- [f_x,f_y])
	                pid = wca.cell_list.next_pid[pid]
			particle.f = SVector{2,Float64}(particle.f .+ [f_x,f_y])

Communication may take time too, so such “saturation” is expected to happen.
Consider an extreme case:
If you have 4 particles and you wanna transport them based on their positions and velocities.
It won’t help at all if you assign more than 4 threads to this process.

1 Like

Note that most of your code can be written in a dimension-independent manner. But, in particular, change


particle.f = particle.f + SVector{2,Float64}(f_x,f_y)

and that will remove the allocation of the [f_x,f_y] intermediate vector. Removing this allocations will speedup the code in general, and may have implications for the scaling at the end.

To improve scaling you will need to improve the distribution of tasks, for example by splitting the tasks in nthreads chuncks of similar size. I have done a cell list implementation that does that reasonably here: CellListMap.jl/CellListMap.jl at 1629b7ac8c621d002eeb6294385f65f9c8d4eb17 · m3g/CellListMap.jl · GitHub. But that is not ideal yet, probably.

@Skoffer gave a lecture on how to that properly here: Nbabel nbody integrator speed up - #32 by Skoffer


Thanks. Changing the part you mentioned, slightly (about 3-4%) speeds up the code.