Code parallelization for Brownian Dynamics simulation of hard spheres

Hi All!

I am using Julia to do Brownian dynamics simulation of many hard spheres in 2D and 3D (employing the repulsive LJ interaction). My code is mostly based on SoftSquishymatter.jl package, but the main difference is that I am using SVectors to store positions/velocities/forces. I am using the cell-linked list method to avoid computing the unnecessary intractions and atom decomposition for calculating forces between particles. Using ```Threads.@threads’’’ I am getting speedup moving from 1 core to 2 cores and so on. But this speedup is saturating at some point. I expected, ideally, to get speedup with a factor of the number of cores that I am using. However, it turned out that, for example, using moving from 10 to 20 cores, I don’t get any detectable speedup. Does anybody have any advice regarding this issue(?)?

This the part that my code integrates the equations of motion:

function Update_Particles!(brownian::Brownian, periodicity::SVector)
	dim = size(periodicity,1)
    Threads.@threads for particle in brownian.particles
			particle.v = particle.f .+ sqrt(2.0 /brownian.dt) .* SRandns(dim)
			particle.r = particle.r .+ brownian.dt .* particle.v

	        PBC!(particle, periodicity)
	        particle.f = SZeros(dim)
    end
end

This is the part that my code computes the interaction between particles:

function compute_interactions!(wca::WCA, periodicity::SVector)
	dim = size(periodicity,1)
	if dim == 2
	    Threads.@nthread for particle in wca.particles
	        i = trunc(Int64, particle.r[1] / wca.cell_list.cell_dr[1])
	        j = trunc(Int64, particle.r[2] / wca.cell_list.cell_dr[2])
	        f_x, f_y = 0.0, 0.0
	        @inbounds for dj = -1 : 1, di = -1 : 1
	            idi = mod(i + di, wca.cell_list.num_cells[1]) + 1
	            jdj = mod(j + dj, wca.cell_list.num_cells[2]) + 1
	            pid = wca.cell_list.start_pid[idi, jdj]
	            while pid > 0
	                neighbor = wca.cell_list.particles[pid]
	                Δx = wrap_displacement(particle.r[1] - neighbor.r[1]; period = periodicity[1])
	                Δy = wrap_displacement(particle.r[2] - neighbor.r[2]; period = periodicity[2])
	                Δr² = Δx^2 + Δy^2
                       σ² = 0.25*(particle.σ + neighbor.σ)^2
			rc² = wca.rc^2
	                if 0.0 < Δr² < rc²
	                    inv² = σ² / Δr²
	                    inv⁶ = inv² * inv² * inv²
	                    coef = wca.ϵ * (48.0 * inv⁶ - 24.0) * inv⁶ / Δr²

	                    f_x += (_f_x = coef * Δx)
	                    f_y += (_f_y = coef * Δy)
	                    if wca.use_newton_3rd
	                        neighbor.f = SVector{2,Float64}(neighbor.f .- [f_x,f_y])
	                    end
	                end
	                pid = wca.cell_list.next_pid[pid]
	            end
	        end
			particle.f = SVector{2,Float64}(particle.f .+ [f_x,f_y])
		end
       end
end

Communication may take time too, so such “saturation” is expected to happen.
Consider an extreme case:
If you have 4 particles and you wanna transport them based on their positions and velocities.
It won’t help at all if you assign more than 4 threads to this process.

1 Like

Note that most of your code can be written in a dimension-independent manner. But, in particular, change

to:

particle.f = particle.f + SVector{2,Float64}(f_x,f_y)

and that will remove the allocation of the [f_x,f_y] intermediate vector. Removing this allocations will speedup the code in general, and may have implications for the scaling at the end.

To improve scaling you will need to improve the distribution of tasks, for example by splitting the tasks in nthreads chuncks of similar size. I have done a cell list implementation that does that reasonably here: CellListMap.jl/CellListMap.jl at 1629b7ac8c621d002eeb6294385f65f9c8d4eb17 · m3g/CellListMap.jl · GitHub. But that is not ideal yet, probably.

@Skoffer gave a lecture on how to that properly here: Nbabel nbody integrator speed up - #32 by Skoffer

3 Likes

Thanks. Changing the part you mentioned, slightly (about 3-4%) speeds up the code.

Do you have any idea regarding GPU implementation of a MD/BD code? I am not familiar with it, but it seems that implementation of the cell-list is tricky part in doing so.

With the current code that I have it seems that more or less I will not be able to get much faster simulations and I am thinking of moving to GPU.

No that I’m aware of. I have implemented this pacakge, though: GitHub - m3g/CellListMap.jl: Flexible implementation of cell lists to map the calculations of particle-pai.

I think it provides a reasonably performant cell list implementation, which run in parallel in CPUs. Maybe that helps, it is optimized in many aspects relative to a “simple” cell list implementation. Taking care of preallocating the auxiliary vectors, it could be used to do simulations.

In simulations, however, it is usually the case that the cell lists are updated only periodically. That can be emulated with that package if you set the cutoff for building the cell lists to be greater than the actual cutoff you are using to compute the interactions, and add a conditional inside the evaluated function to compute the interactions only if the actual cutoff is satisfied.

All that said, this is not typically how a simulation is run, which is somewhat simpler. One uses cell lists to build lists of neighbours of each particle, and carry those lists of neighbours for the next steps. Given the lists of neighbours for each particle, computing the interactions is quite straightforward and, depending on the function being computed for each pair, very easy to vectorize, simd, etc.