Accelerate pairwise Lennard-Jones force computation

Right, I’d thought you tried NearestNeighbor already. I’ll have to code up the tree from the blogpost I linked to be sure, but I think BallTree is not taking advantage of the kind of data you’re storing, whereas an implementation that could take advantage would be significantly faster. I may be wrong on this one.

I’d use an Octree then.

To me it seems that building the neighbor lists is significantly faster than doing even a single force calculation (and you typically only rebuild the lists every ten or so timesteps). This means that it would be very beneficial to sacrifice some performance in building the lists using another type of data structure if that increases the performance of the force calculation. Am I wrong?

The use of Cell Lists (what I’m doing) has been pretty much the standard approach in all molecular dynamics simulation packages for decades, so my guess is that everything else has been tried. Of course the devil lives in the implementation details, and it is not impossible that you or someone finds an implementation of trees that beats my implementation there. But if I was going to (I am, actually :slight_smile: ) spend time on this, it would be on the perfectioning of the cell list approach.

Note, however, that a lot of work and details go into supporting general periodic boundary conditions. The simple neighbourlist function there, which is not using them, has to perform little extra work because the problem has no PBCs, in my case.

Just as a side-note, I can also split the time required for building the list and the time required to computing the neighbour list:

julia> @btime Box(limits($x,$y),0.05); # define box (limits create no-PBC boundaries)
  115.798 μs (120 allocations: 6.58 KiB)

julia> @btime CellList($x,$y,$box); # Build cell lists
  2.281 ms (13439 allocations: 3.63 MiB)

julia> @btime CellListMap.neighbourlist($box,$cl,parallel=true); # 4 cores
  2.321 ms (90 allocations: 5.26 MiB)

julia> @btime CellListMap.neighbourlist($box,$cl,parallel=false);
  7.358 ms (16 allocations: 3.00 MiB)

Also it is to note that building cell lists is not the ultimate goal of the package, but to map pairwise dependent functions on the pairs that are withing the cutoff desired. For that one does not really need to build the list of pairs, and the use of the package for building a list of pairs is one specific application of it.

You are completely right. There are two things here: My goal is not specifically use the package to run MD simulations, but to use it to analyze trajectories and compute other distance-dependent functions in optimization problems. This means that the particle positions can change a lot at every “step”, so the construction of the list must be done in every step (not like in a MD code, in which one knows that the particles move only a little between steps). If I was actually implementing a MD code for production, I would call that less often to build the list, as MD packages do (I am forcing NAMD to compute the list at every step in my comparisons, just to be clear). At the same time, I would like that the cell lists package is good enough to be used at those less frequent steps in a MD code, or at least I would like to know what it takes for it to be.

2 Likes

I like that benchmark. Thanks for sharing.

Maybe it is mentioned somewhere, but what is the number of particles in the application? That influences a lot which approach is the fastest for the application.

(Maybe off-topic, but I quite liked this book while writing my Bachelor’s thesis on that topic: Numerical Simulation in Molecular Dynamics | SpringerLink They also discuss the neighbour finding methods discussed here. Both of them :wink: )

1 Like

It varies a lot. From 1k to maybe several million particles. Happily enough, for constant density systems, everything scales perfectly linearly already (green and purple lines here):

2 Likes

Just an idea; I don’t know if your code allows this: what if instead of first creating the neighbor list and only then calculating the forces, you calculate the forces during the creation of the neighbor list? This should happen immediately after it is determined that particles i and j are within the cut-off distance from each other. That way, you avoid the need to get the positions of all particles from memory (or L2 cache) again (you should have already accessed them to compute d), since that seems to be the biggest bottleneck.

4 Likes

It is done that way already :wink: (the package can be used to just compute the cell lists, but if you compute something else, like a force, that computation is mapped into the pairs of particles without “saving” the pairs anywhere).

What is built in advance is the cell list, but that is O(n).

My initial point here is that even if the list is built (thus abstracting all the work involved in running only over the close particles), NAMD is still faster. I think the leads I have now are the padding and the locality.

There is one detail, however, that is that I cannot avoid a branch. This is now the most inner loop looks like:

        for j in 1:n
            @inbounds pⱼ = cellⱼ.particles[pp[j].index_in_cell]
            xpⱼ = pⱼ.coordinates
            d2 = norm_sqr(xpᵢ - xpⱼ)
            if d2 <= cutoff_sq
                output = f(xpᵢ, xpⱼ, pᵢ.index, pⱼ.index, d2, output)
            end
        end

Here I’m sitting on particle i and running over particles j . In the first line inside the loop the coordinates of particle j are fetched, and then I compute the squared distance. The most inner function has to be computed only if d2 <= cutoff_sq. This branch I think (?) impairs good vectorization to take place.

I could avoid this branch using the c = d2 <= cutoff trick for some specific case, but the idea of the package is to allow that the function f(...) be any user defined function, so I cannot anticipate what is going to happen inside it.

For the LJ case, f(...) is the function that compute the pairwise-forces.

(by the way, I think now that in terms of locality my code is already pretty fine, because of the way I structured it with some other things in mind).

Add on: I am now structuring the coordinates, internally, in:

struct ParticleWithIndex{N,T}
    index::Int
    coordinates::SVector{N,T}
    real::Bool
end

because I need the coordinates, the index, and the real attribute. Maybe this makes it harder to align computations? Would it be better if these properties where in separate arrays?

I used a struct like yours for a long time, but I swapped to multiple arrays (or to be precise HybridArrays).
Main reason to swap:

  • Data analysis/plots often need something like [ p.coordinates[k] for p in particles]. Doing stuff like this too often counteracts the simulation speed-up of the core simulation. So, I ended up to use a representation which fits also to the needs of my data analysis/plotting pipeline.
  • If the number of timesteps is known, you can pre-allocate a big array for all the data you want to store.
  • With more complicated models, that struct might also grows or you end up with a mixture.

Btw, you are probably aware of this: GitHub - JuliaArrays/StructArrays.jl: Efficient implementation of struct arrays in Julia
(Anyway, those are just my two cents. I’m sure you know very well what you are doing :wink: )

EDIT: Hybrid Arrays is actually very essential here, since it’s allows to have a few dimensions of your arrays static, and others not. Hence, you can align your data nicely and it gives me usually quite solid speed-ups.

1 Like

That is completely internal in my package, so the “utilty” reason is of less relevance here. The reason why I am taking this approach was effectively to reduce the number of memory accesses (and code cleanness). For instance, a particle p can be retrieved with a single fetch using p = particles[i], and that gives me “for free” its index and its real attribute as fields of that immutable type.

On one hand, I guess that is good. But on the other hand I don’t know if that may be causing simd/alignment troubles since the structure has non-64bit multiple size.

If the ParticleWithIndex{N,T} had exatly size 256bits, maybe that would be better? If index and real where both 32bits (I noticed now that the order of the field affects the size… if I change index to 32bits and change the order of the fields I get a 32 byte struct… I’ll try that.

edit: “padding” those structs (that is, making them have 32bytes) improves (a little) the performance. Padding the input coordinates and output vectors is more complicated given the generality of what the user may provide, but at least I should implement that alternative as a possible tunable interface.

2 Likes

FWIW, the NAMD paper describes their parallelization strategy:

The parallel decomposition strategy used by NAMD is to treat
the simulation cell (the volume of space containing the atoms) as
a three-dimensional patchwork quilt, with each patch of sufficient
size that only the 26 nearest-neighboring patches are involved in
bonded, van der Waals, and short-range electrostatic interactions.
More precisely, the patches fill the simulation space in a regular
grid and atoms in any pair of non-neighboring patches are separated
by at least the cutoff distance at all times during the simulation. Each
hydrogen atom is stored on the same patch as the atom
to which it is bonded, and atoms are reassigned to patches at
regular intervals. The number of patches varies from one to several
hundred and is determined by the size of the simulation independently
of the number of processors. Additional parallelism may be
generated through options that double the number (and have the
size) of patches in one or more dimensions.

When NAMD is run, patches are distributed as evenly as
possible, keeping nearby patches on the same processor when there
are more patches than processors, or spreading them across the
machine when there are not. Then, a (roughly 14 times) larger
number of compute objects responsible for calculating atomic
interactions either within a single patch or between neighboring
patches is distributed across the processors, minimizing
communication by grouping compute objects responsible for the same
patch together on the same processors. At the beginning of the
simulation, the actual processor time consumed by each compute
object is measured, and this data is used to redistribute compute
objects to balance the workload between processors. This
measurement-based load balancing contributes greatly to the
parallel efficiency of NAMD.

Using forces calculated by compute objects, each patch is
responsible for integrating the equations of motion for the atoms it
contains. This can be done independently of other patches, but
occasionally requires global data affecting all atoms, such as a
change in the size of the periodic cell due to a constant pressure
algorithm. Although the integration algorithm is the clearly visible
“outer loop” in a serial program, NAMD’s message-driven design
could have resulted in much obfuscation (as was experienced even
in the simpler NAMD 1.X4). This was averted by using Charm++
threads to write a top-level function that is called once for each
patch at program start and does not complete until the end of the
simulation. This design has allowed pressure and temperature
control methods and even a conjugate gradient minimizer to be
implemented in NAMD without writing any new code for parallel
communication.

Posting it here as I think, this might give someone an idea on how to speedup the julia code, hopefully? (Or whether it is possible to achieve the same speedup in Julia at all)

2 Likes

What I’m trying now is to get a serial performance similar to that of NAMD. Their parallelization strategy is more sophisticated, but to be truth that is not where my package is not doing well, for shared memory the scaling is not bad. The part they describe about the patches etc is the same I’m doing with the cell lists.

To actually get a MD code for production what we need is to port everything to the GPU.

I’m pretty sure it is. But if was able to do that in 2 months of development Julia would have been miraculous :rofl:. These packages (NAMD, Gromacs, etc) are optimized to the last bit, for literally decades and dozens of developers, and also for multiple platforms. It is not realistic to expect to get something really competitive developing something alone and in a short period of time.

2 Likes

I don’t have anything to add to the conversation, just wanted to say that I am following this thread closely and it is really nice to see you and others pushing Julia to the limit :slight_smile:

Really looking forward to using the CellListMap for my own tool, currently using NearestNeighbours package and really want to see what difference it would make

Kind regards

2 Likes

If:

  1. The distances are euclidean distances in 2D or 3D
  2. You have to compute some property that is dependent on a pairwise function to be computed within a cutoff that is significantly smaller than the system size.

CellListMap.jl is a good alternative. More still if you don’t really need the list of pairs, but to compute something (like a potential energy, or whatever) that is dependent on these pairs. Then it is sill more appropriate.

edit: If you have periodic boundary conditions, CellListMap handles them as well.

For the neighbour lists there is one specific helper function: https://m3g.github.io/CellListMap.jl/stable/examples/#Neighbour-list, which does perform well (faster than NearestNeighbours.jl) in the cases 1 and 2 above.

Having users trying to use it and sharing their experiences and reporting issues is important for me, so if you want to try to use and need any assistance, let me know. The package is still in its “0.X” state, so I can improve the interface.

Answering your points:

  1. Indeed they are.

  2. Yep, my cases looks a lot like this picture:

image
LINK OPENS PDF Redirect Notice

Clicking the link is not necessary. Where I look at a particle, in this case at the middle of the grey circle and it only interacts with the 8 neighbouring elements in the much larger data set - so seems to perfectly fit with what you write.

As you mention, indices of neighbours are irrelevant for me, if I can do all the calculations at the right time / in the most efficient manner. The functions I would commonly run look like this:

image

Where q is the Euclidean distance divided by a constant “h”. The alpha_D is a constant. So this fits perfectly with what you describe.

In some cases I would need to run this more advanced function:

Where dWdq is the first function differentiated w.r.t. q, “h” is a constant, the denominator is the Euclidean distance, but then the nominator is the location of one particle minus the location of another particle. So if you have two vectors, p = [x y z], it would be [x1-x2 y1-y2 z1-z2] - sorry for the bad notation just trying to explain briefly.

This seems also doable with your approach if I can get it to return this information? Else I will have to do the neighbour indices approach as before.

Kind regards

Yes, that is exactly the type of problem the package is designed for. You just have to implement the function for one pair, in the form:

function f(x,y,i,j,d2,output)
    output += # your function of coordinates x and y, indexes i, and j, and distance sqrt(d2)
    return output
end

and pass this function to the map_pairwise! function of CellListMap.jl. This is exactly what it is meant for. The examples listed in the previous link I sent show different situations in which I do the kind of thing you are doing, but simply for other computations inside f.

I see, I think this settles it then - I must try to give it a go and return with my feedback :slight_smile:

1 Like

Could it be a saving if we would compute only the long range part of the force above a certain cut-off?

(sorry if it has already been discussed - I didn’t follow the complete deiscussion)

No long-range forces here. (for those the methods are different, in MD simulations ewald sums, but that is another story).

I mean the “longer range” part of the expression:
-12*ε*(σ^12/d2^7 - σ^6/d2^4)*r ≈ 12*ε*(σ^6/d2^4)*r for larger d2

Ah, well, that would not conform what any other package does (so, it cannot explain any difference here), and its implications on the long time dynamics of the particles would have to be validated.