 # Accelerate pairwise Lennard-Jones force computation

Someone can see a chance of significantly accelerating this function?

``````julia> using StaticArrays, FastPow, BenchmarkTools

julia> function forces(x,y,d2,ε,σ,f)
r = y - x
@fastpow dudr = -12*ε*(σ^12/d2^7 - σ^6/d2^4)*r
f = f + dudr
f = f - dudr
return f
end
forces (generic function with 2 methods)

julia> setup = (
x = rand(SVector{3,Float64});
y = rand(SVector{3,Float64});
d2 = rand();
ε = rand();
σ = rand();
f = [ zeros(SVector{3,Float64}) for i in 1:2 ]
);

julia> @benchmark forces(x,y,d2,ε,σ,f) setup=setup
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min … max):  21.672 ns … 72.733 ns  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     21.889 ns              ┊ GC (median):    0.00%
Time  (mean ± σ):   23.080 ns ±  3.105 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

██    ▃▄ ▁▆▂  ▁      ▁▁   ▁   ▁    ▂         ▂▃    ▁▂       ▂
██▁▆█▁██▁███▃██▄▃▇▇▃▁██▃▁██▁▄▃█▇▁▁██▅▁▃▇█▆▁▃▁██▁▁▁▁██▁▁▁▃▁▃ █
21.7 ns      Histogram: log(frequency) by time      32.5 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

``````

Sigma and epsilon are constants, so we can pass sigma them already elevated to its powers, but that is not accelerating the function too much.

But any other low level trick is accepted.

this is already `O(ns)`, maybe you can optimize at one upper level?

1 Like

maybe using:

``````δ2= one(d2)/d2
@fastpow dudr = -12*ε*(σ^12*δ2^7 - σ^6*δ2^4)*r
``````

it could save you one division.
also, factorizing could help somehow, as i doubt `@fastpow` also factorizes.

``````julia> function forces_3(x,y,d2,ε,σ,f)
r = y - x
δ2= one(d2)/d2
σ3 = σ*σ*σ
σ6 = σ*σ*σ
δ22 = δ2*δ2
δ24 = δ22*δ22
σ6δ24 = σ6*δ24
dudr = -12*ε*(σ6δ24*(σ6δ24*d2   - one(σ6δ24 ))*r
f = f + dudr
f = f - dudr
return f
end
``````
1 Like

I see that the trick I was going to suggest is already what `@fastpow` does, so yeah, I doubt there’s much to be done here. The only thing I can think of is you might be able to use some kind of loop unrolling and avx or simd instructions at a higher level (I assume you’re going to pairwise LJ across multiple particles).

1 Like

Thanks. The upper loop where that is inserted is optimized in many ways. I am sad that I could not use simd instructions, because in practice the indexes of the particles are discontinuous in memory (because I’m using cell lists). On that perspective the problem is more similar to this one (I simplified a little bit the problem removing some constants):

``````using StaticArrays, FastPow, BenchmarkTools

function forces!(x,f,d,idxs)
@inbounds for id in axes(idxs,1)
i = idxs[id,1]
j = idxs[id,2]
r = x[j] - x[i]
@fastpow dudr = -12*(1/d^7 - 1/d^4)*r
f[i] = f[i] + dudr
f[j] = f[j] - dudr
end
return f
end

@benchmark forces!(x,f,d,idxs) setup = (
x = [ rand(SVector{3,Float64}) for _ in 1:1000 ];
f = [ zeros(SVector{3,Float64}) for _ in 1:1000 ];
d = rand();
idxs = rand(1:1000,20,2)
)

``````

I don’t get any speedup from simd, turbo, etc, because of the discontinuity of the memory positions accessed. I have tried to copy the data before doing that computation, but was never worth the effort (and the size of the loop shown there, 20, is typical, so also limited in terms of taking advantage of simd).

It sounds like you’re pretty close to optimal then. For some purposes, the next step would be to move to an even higher level, that is use a multiscale technique where you compute the dynamics of a continuum for some number of timesteps, then generate a detailed particle set consistent with the continuum, do some detailed calculations to ensure that your continuum behavior captures the effect of the details, and then transport the solution back to the continuum level.

Depending on what you want, that’s a very powerful technique that can vastly speed up the simulation of long times (large total time / molecular interaction time) without losing the effect of details (like crack formation, or surface chemistry or whatever)

1 Like

Yes, that is all great. Perhaps we arrive at the point of building a simulation package with all those capabilities.

For the time being, I am writing this package to compute short-ranged pairwise interactions (that is, those that within a cutoff below which no approximation is accepted). It is part of a more general solution that must deal with long-range interactions with different approaches (my specific short-term use don’t need those, though). In the molecular dynamics field what is done is to build lists of neighbours for each particle within a range greater than the cutoff by some margin (the Verlet lists) and update the lists less frequently with some guarantee that no particle moved more than the margin distance.

I will be posting something more complete soon, but what I’m doing now is to see how good, or bad, is my cell list implementation in that package. It is fast to compute neighbour lists compared to other implementations I have tested, and I was now comparing how fast is it in comparison to some important molecular dynamics simulations packages do (specifically, with NAMD). What I’m getting is that a text-book MD code for a simulation of a noble-gas (with only short-range, LJ, interactions), written on top of my package to compute the pairwise interactions, without any further optimization, is 2-3x slower than NAMD (which is a highly specialized code). Not bad IMO, but I was trying to see if there are optimizations to be done that reduce this margin without much effort.

1 Like

You seem not to be using `id`, is that a typo?

Easier to write `rand(SVector{3,Float64}, 1000)`and `zeros(SVector{3,Float64}, 1000)`.

No performance tips so far, but it looks like you could simplify the expression symbolically, by pulling out common factors, is there no gain in that? Can really `@fastpow` achieve that?

1 Like

Yes, that can be done, but in my tests the differences were very small (hard to notice in the middle of benchmarking noise), so I kept it the closer as possible to a text-book for now.

I was thinking if there was some low-level implementation of specific powers, some “fastmath” type of thing that could make a significant difference there (just thought that asking is never bad) - since, for instance, there are variations of trigonometric functions that have very different speeds.

Yes, sorry, will fix now.

I think that level of performance is quite amazing honestly. But yes, it means there’s potentially more! Where does the current code spend its time according to profile? Is it all in these pairwise calcs?

It turns out `@fastpow` is not as sophisticated as I was thinking it might be:

``````julia> @macroexpand @fastpow x^6-x^12
:(begin
var"##279" = x * x
var"##280" = var"##279" * x
var"##281" = var"##280" * var"##280"
var"##281"
end - begin
var"##282" = x * x
var"##283" = var"##282" * x
var"##284" = var"##283" * var"##283"
var"##285" = var"##284" * var"##284"
var"##285"
end)
``````

It recalculates things that were previously calculated. So for your code for example it’d be better to do:

``````           @fastpow dudr = -12*ε*(σ^12/d2^7 - σ^6/d2^4)*r
``````

do something like:

``````@fastpow sig6 = sig^6
sig12 = sig6*sig6
@fastpow d24 = d2^4
@fastpow d27 = d24*d2^3
dudr = -12 *eps*(sig12/d27 - sig6/d24)*r
``````
1 Like

Or using CommonSubexpressions.jl you could do:

``````@macroexpand@cse -12*eps*(x^6*x^6/(d2^4*d2^3) - x^6/d2^4)*r
quote
var"##365" = x ^ 6
var"##366" = var"##365" * var"##365"
var"##367" = d2 ^ 4
var"##368" = d2 ^ 3
var"##369" = var"##367" * var"##368"
var"##370" = var"##366" / var"##369"
var"##371" = var"##365" / var"##367"
var"##372" = var"##370" - var"##371"
var"##373" = -12 * eps * var"##372" * r
var"##373"
end
``````

I can’t figure out an easy way to apply `@fastpow` to the results of `@cse` but you could do that as well if you figure out the appropriate method.

``````setup = (
x = [ rand(SVector{3,Float64}) for _ in 1:1000 ];
f = [ zeros(SVector{3,Float64}) for _ in 1:1000 ];
d = rand();
idxs = rand(1:1000,20,2)
);
``````

what `setup` actually contains is this

``````jl> setup
20×2 Matrix{Int64}:
344  366
503  271
424  566
498   26
668  147
236   39
42  817
497  205
41  938
909  295
933  432
26  234
954  143
291   58
919  961
809  615
350  735
993  639
519  931
823  366
``````

That is, it’s just an expression that returns the output of `rand(1:1000, 20, 2)`. You have to give the `@benchmark` macro access to the actual code, not just the variable `setup`, which doesn’t contain anything but a matrix.

``````jl> @benchmark forces!(x,f,d,idxs) setup=setup
BenchmarkTools.Trial: 10000 samples with 929 evaluations.
Range (min … max):   97.094 ns … 535.953 ns  ┊ GC (min … max): 0.00% … 0.00%
``````

vs

``````jl> @benchmark forces!(x,f,d,idxs)  setup = (
x = [ rand(SVector{3,Float64}) for _ in 1:1000 ];
f = [ zeros(SVector{3,Float64}) for _ in 1:1000 ];
d = rand();
idxs = rand(1:1000,20,2)
)
BenchmarkTools.Trial: 10000 samples with 956 evaluations.
Range (min … max):  78.975 ns … 306.799 ns  ┊ GC (min … max): 0.00% … 0.00%
``````
1 Like

The difference is even more obvious for the smaller example:

``````jl> function forces(x,y,d2,ε,σ,f)
r = y - x
@fastpow dudr = -12*ε*(σ^12/d2^7 - σ^6/d2^4)*r
f = f + dudr
f = f - dudr
return f
end
forces (generic function with 1 method)

jl> setup = (
x = rand(SVector{3,Float64});
y = rand(SVector{3,Float64});
d2 = rand();
ε = rand();
σ = rand();
f = [ zeros(SVector{3,Float64}) for i in 1:2 ]
);

jl> @benchmark forces(x,y,d2,ε,σ,f) setup=setup
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
Range (min … max):  26.788 ns … 202.618 ns  ┊ GC (min … max): 0.00% … 0.00%
``````

vs

``````jl> @benchmark forces(x,y,d2,ε,σ,f)  setup = (
x = rand(SVector{3,Float64});
y = rand(SVector{3,Float64});
d2 = rand();
ε = rand();
σ = rand();
f = [ zeros(SVector{3,Float64}) for i in 1:2 ]
)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max):  5.906 ns … 49.850 ns  ┊ GC (min … max): 0.00% … 0.00%
``````

Thanks for the tips, I will be adding them to the “fast” version of the MD testing code to see where it gets.

That specific computation takes about half of the time. The other half mostly is time that is spent trying to avoid reaching the point of having to compute the interactions. (building cell lists, and a bunch of other tricks).

Sincerely I think that the main difference is somewhere on the fact that I was not able to find a proper point where copying the data to allow simd to take place was worthwhile.

Just so that no one thinks that I’m hiding anything, the “simulation” code is below. I will when appropriate put it in a well organized package, which will part of my test set of of the CellListMap.jl code.

This runs a simulation of 10k Neon atoms using the velocity-verlet integration scheme. It is almost a copy/paste from the basic algorithm one can find in text books (except for the calls to `map_pairwise!` to compute the forces, which is the function that CellListMap.jl provides).

md code using CellListMap.jl
``````import Chemfiles
using CellListMap
using FastPow
using StaticArrays
using Printf
using Parameters
using Statistics: mean
using LinearAlgebra: norm_sqr

#
# Simulation setup
#
@with_kw struct Params{V,N,T,M,UnitCellType}
x0::V = getcoor("ne10k.pdb")
temperature::T = 300.
nsteps::Int = 10_000
dt::T = 1.0 # fs
ibath::Int = 10
print_energy::Int = 50
print_traj::Int = 100
trajfile::String = "traj.xyz"
cutoff::T = 10.
box::Box{UnitCellType,N,T,M} = Box([ 80., 80., 80. ], cutoff)
# Parameters for Neon
mass::T = 20.17900 # g/mol
ε::T = 0.0441795 # kcal/mol
σ::T = 2*1.64009 # Å
kB::T = 0.001985875 # Boltzmann constant kcal / mol K
end

# Structure that carries both the energy and force vector
@with_kw struct UF{T,V}
u::T
f::V
end

# function that computes energy and forces
@fastpow function energy_and_force(x,y,i,j,d2,ε,σ,uf::UF{T,V}) where {T,V}
@unpack u, f = uf
d = sqrt(d2)
u += ε*( σ^12/d^12 - 2*σ^6/d^6 )
r = y - x
dudr = -12*ε*(σ^12/d^13 - σ^6/d^7)*(r/d)
f[i] = f[i] + dudr
f[j] = f[j] - dudr
return UF(u,f)
end

# reduction rule for the (u,f) tuple (u, and f are zeroed outside)
@unpack u, f = uf
end
return UF(u,f)
end

# Kinetic energy and temperature
compute_kinetic(v::AbstractVector,m) = (m/2)*sum(x -> norm_sqr(x), v)
compute_temp(kinetic,kB,n) = 2*kinetic/(3*kB*n)
compute_temp(v::AbstractVector,m,kB) = 2*compute_kinetic(v,m)/(3*kB*length(v))

# Remove drift from velocities
function remove_drift!(v)
vmean = mean(v)
v .= v .- Ref(vmean)
end

# Function to print output data
function print_data(istep,x,params,u,kinetic,trajfile)
@unpack print_energy, print_traj, kB = params
if istep%print_energy == 0
temp = compute_temp(kinetic,kB,length(x))
@printf(
"STEP = %8i U = %12.5f K = %12.5f TOT = %12.5f TEMP = %12.5f\n",
istep, u, kinetic, u+kinetic, temp
)
end
if istep%print_traj == 0 && istep > 0
println(trajfile,length(x))
println(trajfile," step = ", istep)
for i in 1:length(x)
@printf(trajfile,"Ne %12.5f %12.5f %12.5f\n", ntuple(j -> x[i][j], 3)...)
end
end
return nothing
end

# Read coordinates from NAMD-DCD file
function getcoor(file)
traj = redirect_stdout(() -> Chemfiles.Trajectory(file), devnull)
Chemfiles.close(traj)
return copy(reinterpret(reshape,SVector{3,Float64},Chemfiles.positions(frame)))
end

#
# Simulation
#
function simulate(params::Params{V,N,T,UnitCellType}) where {V,N,T,UnitCellType}
@unpack x0, temperature, nsteps, box, dt, ε, σ, mass, kB = params
trajfile = open(params.trajfile,"w")

# To use coordinates in Angstroms, dt must be in 10ps. Usually packages
# use ps and nm internally (thus multiply coordinates by 10 and divide
# the timestep given in fs by 1000)
dt = dt/100

# Initial arrays
x = copy(x0)
f = similar(x)
flast = similar(x)

# Initial velocities
v = randn(eltype(x),size(x))
remove_drift!(v)
# Adjust average to desidred temperature
t0 = compute_temp(v,mass,kB)
@. v = v * sqrt(temperature/t0)
# Remove drift

# Build cell lists for the first time
cl = CellList(x,box)

# preallocate threaded output, since it contains the forces vector
f .= Ref(zeros(eltype(f)))

# Compute energy and forces at initial point
uf = UF(0.,f)
uf = map_pairwise!(
(x,y,i,j,d2,output) -> energy_and_force(x,y,i,j,d2,ε,σ,output),
uf, box, cl, parallel=true,
reduce=reduceuf,
)
u = uf.u
kinetic = compute_kinetic(v,mass)
print_data(0,x,params,u,kinetic,trajfile)

# Simulate
for istep in 1:nsteps

# Update positions (velocity-verlet)
@. x = x + v*dt + 0.5*(f/mass)*dt^2

# Reset energy and forces
flast .= f
u = 0.
f .= Ref(zeros(eltype(f)))
uf = UF(u,f)
ft .= Ref(zeros(eltype(f)))
end

# Compute energy and forces
uf = map_pairwise!(
(x,y,i,j,d2,output) -> energy_and_force(x,y,i,j,d2,ε,σ,output),
uf, box, cl, parallel=true,
reduce=reduceuf,
)
u = uf.u
if u/length(x) > 1e10
println("Simulation is unstable. ")
return nothing
end

# Update velocities
@. v = v + 0.5*((flast + f)/mass)*dt

# Print data and output file
kinetic = compute_kinetic(v,mass)
print_data(istep,x,params,u,kinetic,trajfile)

# Isokinetic bath
if istep%params.ibath == 0
temp = compute_temp(kinetic,kB,length(v))
remove_drift!(v)
@. v = v * sqrt(temperature/temp)
end

# Update cell lists
cl = UpdateCellList!(x,box,cl,aux)

end
close(trajfile)

end

``````

To run it, if someone wants to play with it, put this file in the same directory: https://raw.githubusercontent.com/lmiq/PairVelocities/main/namd_benchmark/ne10k.pdb

just do (with these steps it takes ~5s with 4 cores, in my laptop),

``````p = Params(nsteps=200)
simulate(p)
``````

Uhm, well, good to know. That makes it somewhat cumbersome to write the setup. I didn’t realize that. (this does not affect the overall benchmarks I’m talking about of my program, just to be clear, I just wrote that for that MWE).

@DNF , fixed above the call to the benchmark. Thanks again for noticing that.

It looks like you could do:

``````macro fpcse(ex)
FastPow.fastpow(CommonSubexpressions.cse(ex))
end

@macroexpand @fpcse -12*eps*((x^6/d2^4)*(x^6/d2^3) - (x^6/d2^4))*r
quote
var"#25###540" = begin
var"#26###548" = Main.x * Main.x
var"#27###549" = var"#26###548" * Main.x
var"#28###550" = var"#27###549" * var"#27###549"
var"#28###550"
end
var"#29###541" = begin
var"#30###551" = Main.d2 * Main.d2
var"#31###552" = var"#30###551" * var"#30###551"
var"#31###552"
end
var"#32###542" = var"#25###540" / var"#29###541"
var"#33###543" = begin
var"#34###553" = Main.d2 * Main.d2
var"#35###554" = var"#34###553" * Main.d2
var"#35###554"
end
var"#36###544" = var"#25###540" / var"#33###543"
var"#37###545" = var"#32###542" * var"#36###544"
var"#38###546" = var"#37###545" - var"#32###542"
var"#39###547" = -12 * Main.eps * var"#38###546" * Main.r
var"#39###547"
end

``````

God I love Julia!

1 Like

Sure, but

``````σ6d24 = (σ^3/d2^2)^2
dudr = -12ε * σ6d24 * (σ6d24 * d2 - 1) * r
``````

should have similar performance, and look cleaner, with no macro call.

1 Like

fastpow will give you a big speedup vs calling `^` 3 times I believe. You could still do:

``````sig6d24 = @fastpow (sig^3/d2^2)^2
``````

though

`^` has fast-paths for squares and cubes. Also, ^(::Float, ::Integer) by oscardssmith · Pull Request #42031 · JuliaLang/julia · GitHub will hopefully significantly improve performance for small powers.

1 Like
``````jl> @btime (σ^3/d2^2)^2 setup=(σ=rand()+1; d2=rand()+1);
2.200 ns (0 allocations: 0 bytes)

jl> @btime @fastpow((σ^3/d2^2)^2) setup=(σ=rand()+1; d2=rand()+1);
2.200 ns (0 allocations: 0 bytes)
``````
2 Likes