Accelerate pairwise Lennard-Jones force computation

lmiq · August 30, 2021, 1:34am

Someone can see a chance of significantly accelerating this function?

julia> using StaticArrays, FastPow, BenchmarkTools

julia> function forces(x,y,d2,ε,σ,f)
           r = y - x
           @fastpow dudr = -12*ε*(σ^12/d2^7 - σ^6/d2^4)*r
           f[1] = f[1] + dudr
           f[2] = f[2] - dudr
           return f
       end
forces (generic function with 2 methods)

julia> setup = (
           x = rand(SVector{3,Float64});
           y = rand(SVector{3,Float64});
           d2 = rand();
           ε = rand();
           σ = rand();
           f = [ zeros(SVector{3,Float64}) for i in 1:2 ]
       );

julia> @benchmark forces(x,y,d2,ε,σ,f) setup=setup
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  21.672 ns … 72.733 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     21.889 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.080 ns ±  3.105 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██    ▃▄ ▁▆▂  ▁      ▁▁   ▁   ▁    ▂         ▂▃    ▁▂       ▂
  ██▁▆█▁██▁███▃██▄▃▇▇▃▁██▃▁██▁▄▃█▇▁▁██▅▁▃▇█▆▁▃▁██▁▁▁▁██▁▁▁▃▁▃ █
  21.7 ns      Histogram: log(frequency) by time      32.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Sigma and epsilon are constants, so we can pass sigma them already elevated to its powers, but that is not accelerating the function too much.

But any other low level trick is accepted.

jling · August 30, 2021, 1:43am

this is already O(ns), maybe you can optimize at one upper level?

longemen3000 · August 30, 2021, 1:54am

maybe using:

δ2= one(d2)/d2
@fastpow dudr = -12*ε*(σ^12*δ2^7 - σ^6*δ2^4)*r

it could save you one division.
also, factorizing could help somehow, as i doubt @fastpow also factorizes.

julia> function forces_3(x,y,d2,ε,σ,f)
           r = y - x
           δ2= one(d2)/d2
           σ3 = σ*σ*σ
           σ6 = σ*σ*σ
           δ22 = δ2*δ2
           δ24 = δ22*δ22
           σ6δ24 = σ6*δ24
          dudr = -12*ε*(σ6δ24*(σ6δ24*d2   - one(σ6δ24 ))*r
           f[1] = f[1] + dudr
           f[2] = f[2] - dudr
           return f
       end

dlakelan · August 30, 2021, 2:12am

I see that the trick I was going to suggest is already what @fastpow does, so yeah, I doubt there’s much to be done here. The only thing I can think of is you might be able to use some kind of loop unrolling and avx or simd instructions at a higher level (I assume you’re going to pairwise LJ across multiple particles).

lmiq · August 30, 2021, 10:16am

Thanks. The upper loop where that is inserted is optimized in many ways. I am sad that I could not use simd instructions, because in practice the indexes of the particles are discontinuous in memory (because I’m using cell lists). On that perspective the problem is more similar to this one (I simplified a little bit the problem removing some constants):

using StaticArrays, FastPow, BenchmarkTools

function forces!(x,f,d,idxs)
    @inbounds for id in axes(idxs,1)
        i = idxs[id,1]
        j = idxs[id,2]
        r = x[j] - x[i] 
        @fastpow dudr = -12*(1/d^7 - 1/d^4)*r
        f[i] = f[i] + dudr
        f[j] = f[j] - dudr
   end
   return f
end

@benchmark forces!(x,f,d,idxs) setup = ( 
    x = [ rand(SVector{3,Float64}) for _ in 1:1000 ];
    f = [ zeros(SVector{3,Float64}) for _ in 1:1000 ];
    d = rand();
    idxs = rand(1:1000,20,2)
)

I don’t get any speedup from simd, turbo, etc, because of the discontinuity of the memory positions accessed. I have tried to copy the data before doing that computation, but was never worth the effort (and the size of the loop shown there, 20, is typical, so also limited in terms of taking advantage of simd).

dlakelan · August 30, 2021, 5:25pm

It sounds like you’re pretty close to optimal then. For some purposes, the next step would be to move to an even higher level, that is use a multiscale technique where you compute the dynamics of a continuum for some number of timesteps, then generate a detailed particle set consistent with the continuum, do some detailed calculations to ensure that your continuum behavior captures the effect of the details, and then transport the solution back to the continuum level.

Depending on what you want, that’s a very powerful technique that can vastly speed up the simulation of long times (large total time / molecular interaction time) without losing the effect of details (like crack formation, or surface chemistry or whatever)

lmiq · August 30, 2021, 5:43pm

Yes, that is all great. Perhaps we arrive at the point of building a simulation package with all those capabilities.

For the time being, I am writing this package to compute short-ranged pairwise interactions (that is, those that within a cutoff below which no approximation is accepted). It is part of a more general solution that must deal with long-range interactions with different approaches (my specific short-term use don’t need those, though). In the molecular dynamics field what is done is to build lists of neighbours for each particle within a range greater than the cutoff by some margin (the Verlet lists) and update the lists less frequently with some guarantee that no particle moved more than the margin distance.

I will be posting something more complete soon, but what I’m doing now is to see how good, or bad, is my cell list implementation in that package. It is fast to compute neighbour lists compared to other implementations I have tested, and I was now comparing how fast is it in comparison to some important molecular dynamics simulations packages do (specifically, with NAMD). What I’m getting is that a text-book MD code for a simulation of a noble-gas (with only short-range, LJ, interactions), written on top of my package to compute the pairwise interactions, without any further optimization, is 2-3x slower than NAMD (which is a highly specialized code). Not bad IMO, but I was trying to see if there are optimizations to be done that reduce this margin without much effort.

DNF · August 30, 2021, 5:47pm

You seem not to be using id, is that a typo?

Easier to write rand(SVector{3,Float64}, 1000)and zeros(SVector{3,Float64}, 1000).

No performance tips so far, but it looks like you could simplify the expression symbolically, by pulling out common factors, is there no gain in that? Can really @fastpow achieve that?

lmiq · August 30, 2021, 5:52pm

Yes, that can be done, but in my tests the differences were very small (hard to notice in the middle of benchmarking noise), so I kept it the closer as possible to a text-book for now.

I was thinking if there was some low-level implementation of specific powers, some “fastmath” type of thing that could make a significant difference there (just thought that asking is never bad) - since, for instance, there are variations of trigonometric functions that have very different speeds.

Yes, sorry, will fix now.

dlakelan · August 30, 2021, 6:04pm

I think that level of performance is quite amazing honestly. But yes, it means there’s potentially more! Where does the current code spend its time according to profile? Is it all in these pairwise calcs?

It turns out @fastpow is not as sophisticated as I was thinking it might be:

julia> @macroexpand @fastpow x^6-x^12
:(begin
          var"##279" = x * x
          var"##280" = var"##279" * x
          var"##281" = var"##280" * var"##280"
          var"##281"
      end - begin
          var"##282" = x * x
          var"##283" = var"##282" * x
          var"##284" = var"##283" * var"##283"
          var"##285" = var"##284" * var"##284"
          var"##285"
      end)

It recalculates things that were previously calculated. So for your code for example it’d be better to do:

instead of:

           @fastpow dudr = -12*ε*(σ^12/d2^7 - σ^6/d2^4)*r

do something like:

@fastpow sig6 = sig^6
sig12 = sig6*sig6
@fastpow d24 = d2^4
@fastpow d27 = d24*d2^3
dudr = -12 *eps*(sig12/d27 - sig6/d24)*r

dlakelan · August 30, 2021, 6:32pm

Or using CommonSubexpressions.jl you could do:

@macroexpand@cse -12*eps*(x^6*x^6/(d2^4*d2^3) - x^6/d2^4)*r
quote
    var"##365" = x ^ 6
    var"##366" = var"##365" * var"##365"
    var"##367" = d2 ^ 4
    var"##368" = d2 ^ 3
    var"##369" = var"##367" * var"##368"
    var"##370" = var"##366" / var"##369"
    var"##371" = var"##365" / var"##367"
    var"##372" = var"##370" - var"##371"
    var"##373" = -12 * eps * var"##372" * r
    var"##373"
end

I can’t figure out an easy way to apply @fastpow to the results of @cse but you could do that as well if you figure out the appropriate method.

DNF · August 30, 2021, 6:43pm

I’m not sure about this benchmarking approach. When you do this:

setup = ( 
    x = [ rand(SVector{3,Float64}) for _ in 1:1000 ];
    f = [ zeros(SVector{3,Float64}) for _ in 1:1000 ];
    d = rand();
    idxs = rand(1:1000,20,2)
);

what setup actually contains is this

jl> setup
20×2 Matrix{Int64}:
 344  366
 503  271
 424  566
 498   26
 668  147
 236   39
  42  817
 497  205
  41  938
 909  295
 933  432
  26  234
 954  143
 291   58
 919  961
 809  615
 350  735
 993  639
 519  931
 823  366

That is, it’s just an expression that returns the output of rand(1:1000, 20, 2). You have to give the @benchmark macro access to the actual code, not just the variable setup, which doesn’t contain anything but a matrix.

jl> @benchmark forces!(x,f,d,idxs) setup=setup
BenchmarkTools.Trial: 10000 samples with 929 evaluations.
 Range (min … max):   97.094 ns … 535.953 ns  ┊ GC (min … max): 0.00% … 0.00%

vs

jl> @benchmark forces!(x,f,d,idxs)  setup = (
            x = [ rand(SVector{3,Float64}) for _ in 1:1000 ];
            f = [ zeros(SVector{3,Float64}) for _ in 1:1000 ];
            d = rand();
            idxs = rand(1:1000,20,2)
        )
BenchmarkTools.Trial: 10000 samples with 956 evaluations.
 Range (min … max):  78.975 ns … 306.799 ns  ┊ GC (min … max): 0.00% … 0.00%

DNF · August 30, 2021, 6:46pm

The difference is even more obvious for the smaller example:

jl> function forces(x,y,d2,ε,σ,f)
               r = y - x
               @fastpow dudr = -12*ε*(σ^12/d2^7 - σ^6/d2^4)*r
               f[1] = f[1] + dudr
               f[2] = f[2] - dudr
               return f
           end
forces (generic function with 1 method)

jl> setup = (
               x = rand(SVector{3,Float64});
               y = rand(SVector{3,Float64});
               d2 = rand();
               ε = rand();
               σ = rand();
               f = [ zeros(SVector{3,Float64}) for i in 1:2 ]
           );

jl> @benchmark forces(x,y,d2,ε,σ,f) setup=setup
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  26.788 ns … 202.618 ns  ┊ GC (min … max): 0.00% … 0.00%

vs

jl> @benchmark forces(x,y,d2,ε,σ,f)  setup = (
                   x = rand(SVector{3,Float64});
                   y = rand(SVector{3,Float64});
                   d2 = rand();
                   ε = rand();
                   σ = rand();
                   f = [ zeros(SVector{3,Float64}) for i in 1:2 ]
               )
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  5.906 ns … 49.850 ns  ┊ GC (min … max): 0.00% … 0.00%

lmiq · August 30, 2021, 6:50pm

Thanks for the tips, I will be adding them to the “fast” version of the MD testing code to see where it gets.

That specific computation takes about half of the time. The other half mostly is time that is spent trying to avoid reaching the point of having to compute the interactions. (building cell lists, and a bunch of other tricks).

Sincerely I think that the main difference is somewhere on the fact that I was not able to find a proper point where copying the data to allow simd to take place was worthwhile.

Just so that no one thinks that I’m hiding anything, the “simulation” code is below. I will when appropriate put it in a well organized package, which will part of my test set of of the CellListMap.jl code.

This runs a simulation of 10k Neon atoms using the velocity-verlet integration scheme. It is almost a copy/paste from the basic algorithm one can find in text books (except for the calls to map_pairwise! to compute the forces, which is the function that CellListMap.jl provides).

md code using CellListMap.jl

import Chemfiles
using CellListMap
using FastPow
using StaticArrays
using Printf
using Base.Threads
using Parameters
using Statistics: mean
using LinearAlgebra: norm_sqr

#
# Simulation setup
#
@with_kw struct Params{V,N,T,M,UnitCellType}
    x0::V = getcoor("ne10k.pdb")
    temperature::T = 300.
    nsteps::Int = 10_000
    dt::T = 1.0 # fs
    ibath::Int = 10
    print_energy::Int = 50 
    print_traj::Int = 100
    trajfile::String = "traj.xyz"
    cutoff::T = 10.
    box::Box{UnitCellType,N,T,M} = Box([ 80., 80., 80. ], cutoff)
    # Parameters for Neon
    mass::T = 20.17900 # g/mol 
    ε::T = 0.0441795 # kcal/mol
    σ::T = 2*1.64009 # Å
    kB::T = 0.001985875 # Boltzmann constant kcal / mol K
end

# Structure that carries both the energy and force vector
@with_kw struct UF{T,V}
    u::T
    f::V
end

# function that computes energy and forces
@fastpow function energy_and_force(x,y,i,j,d2,ε,σ,uf::UF{T,V}) where {T,V}
    @unpack u, f = uf
    d = sqrt(d2)
    u += ε*( σ^12/d^12 - 2*σ^6/d^6 )
    r = y - x
    dudr = -12*ε*(σ^12/d^13 - σ^6/d^7)*(r/d)
    f[i] = f[i] + dudr
    f[j] = f[j] - dudr
    return UF(u,f)
end

# reduction rule for the (u,f) tuple (u, and f are zeroed outside)
function reduceuf(uf,uf_threaded)
    @unpack u, f = uf
    for it in 1:nthreads()
        u += uf_threaded[it].u
        @. f += uf_threaded[it].f
    end
    return UF(u,f)
end

# Kinetic energy and temperature 
compute_kinetic(v::AbstractVector,m) = (m/2)*sum(x -> norm_sqr(x), v)
compute_temp(kinetic,kB,n) = 2*kinetic/(3*kB*n)
compute_temp(v::AbstractVector,m,kB) = 2*compute_kinetic(v,m)/(3*kB*length(v))

# Remove drift from velocities
function remove_drift!(v)
    vmean = mean(v)
    v .= v .- Ref(vmean)
end

# Function to print output data
function print_data(istep,x,params,u,kinetic,trajfile)
    @unpack print_energy, print_traj, kB = params
    if istep%print_energy == 0
        temp = compute_temp(kinetic,kB,length(x))
        @printf(
            "STEP = %8i U = %12.5f K = %12.5f TOT = %12.5f TEMP = %12.5f\n", 
            istep, u, kinetic, u+kinetic, temp
        )
    end
    if istep%print_traj == 0 && istep > 0
        println(trajfile,length(x))
        println(trajfile," step = ", istep)
        for i in 1:length(x)
           @printf(trajfile,"Ne %12.5f %12.5f %12.5f\n", ntuple(j -> x[i][j], 3)...)
        end
    end
    return nothing
end

# Read coordinates from NAMD-DCD file
function getcoor(file)
    traj = redirect_stdout(() -> Chemfiles.Trajectory(file), devnull)
    frame = Chemfiles.read_step(traj,0)
    Chemfiles.close(traj)
    return copy(reinterpret(reshape,SVector{3,Float64},Chemfiles.positions(frame)))
end

#
# Simulation
#
function simulate(params::Params{V,N,T,UnitCellType}) where {V,N,T,UnitCellType}
    @unpack x0, temperature, nsteps, box, dt, ε, σ, mass, kB = params
    trajfile = open(params.trajfile,"w")

    # To use coordinates in Angstroms, dt must be in 10ps. Usually packages
    # use ps and nm internally (thus multiply coordinates by 10 and divide
    # the timestep given in fs by 1000)
    dt = dt/100

    # Initial arrays
    x = copy(x0)
    f = similar(x)
    flast = similar(x)

    # Initial velocities
    v = randn(eltype(x),size(x))
    remove_drift!(v)
    # Adjust average to desidred temperature
    t0 = compute_temp(v,mass,kB) 
    @. v = v * sqrt(temperature/t0)
    # Remove drift

    # Build cell lists for the first time
    cl = CellList(x,box)

    # preallocate threaded output, since it contains the forces vector
    f .= Ref(zeros(eltype(f)))
    uf_threaded = [ UF(0.,deepcopy(f)) for _ in 1:nthreads() ]
    aux = CellListMap.AuxThreaded(cl)

    # Compute energy and forces at initial point
    uf = UF(0.,f)
    uf = map_pairwise!( 
        (x,y,i,j,d2,output) -> energy_and_force(x,y,i,j,d2,ε,σ,output),
        uf, box, cl, parallel=true,
        reduce=reduceuf,
        output_threaded=uf_threaded
    ) 
    u = uf.u
    kinetic = compute_kinetic(v,mass)
    print_data(0,x,params,u,kinetic,trajfile)

    # Simulate
    for istep in 1:nsteps

        # Update positions (velocity-verlet)
        @. x = x + v*dt + 0.5*(f/mass)*dt^2

        # Reset energy and forces
        flast .= f
        u = 0.
        f .= Ref(zeros(eltype(f)))
        uf = UF(u,f)
        @threads for it in 1:nthreads()
           ft = uf_threaded[it].f 
           ft .= Ref(zeros(eltype(f)))
           uf_threaded[it] = UF(0.,ft)
        end

        # Compute energy and forces
        uf = map_pairwise!( 
            (x,y,i,j,d2,output) -> energy_and_force(x,y,i,j,d2,ε,σ,output),
            uf, box, cl, parallel=true,
            reduce=reduceuf,
            output_threaded=uf_threaded
        ) 
        u = uf.u
        if u/length(x) > 1e10
            println("Simulation is unstable. ")
            return nothing
        end
         
        # Update velocities
        @. v = v + 0.5*((flast + f)/mass)*dt 

        # Print data and output file
        kinetic = compute_kinetic(v,mass)
        print_data(istep,x,params,u,kinetic,trajfile)

        # Isokinetic bath
        if istep%params.ibath == 0
            temp = compute_temp(kinetic,kB,length(v))
            remove_drift!(v)
            @. v = v * sqrt(temperature/temp)
        end

        # Update cell lists
        cl = UpdateCellList!(x,box,cl,aux)

   end
    close(trajfile)

end

To run it, if someone wants to play with it, put this file in the same directory: https://raw.githubusercontent.com/lmiq/PairVelocities/main/namd_benchmark/ne10k.pdb

just do (with these steps it takes ~5s with 4 cores, in my laptop),

p = Params(nsteps=200)
simulate(p)

lmiq · August 30, 2021, 6:52pm

Uhm, well, good to know. That makes it somewhat cumbersome to write the setup. I didn’t realize that. (this does not affect the overall benchmarks I’m talking about of my program, just to be clear, I just wrote that for that MWE).

@DNF , fixed above the call to the benchmark. Thanks again for noticing that.

dlakelan · August 30, 2021, 6:56pm

It looks like you could do:

macro fpcse(ex)
   FastPow.fastpow(CommonSubexpressions.cse(ex))
end

@macroexpand @fpcse -12*eps*((x^6/d2^4)*(x^6/d2^3) - (x^6/d2^4))*r
quote
    var"#25###540" = begin
            var"#26###548" = Main.x * Main.x
            var"#27###549" = var"#26###548" * Main.x
            var"#28###550" = var"#27###549" * var"#27###549"
            var"#28###550"
        end
    var"#29###541" = begin
            var"#30###551" = Main.d2 * Main.d2
            var"#31###552" = var"#30###551" * var"#30###551"
            var"#31###552"
        end
    var"#32###542" = var"#25###540" / var"#29###541"
    var"#33###543" = begin
            var"#34###553" = Main.d2 * Main.d2
            var"#35###554" = var"#34###553" * Main.d2
            var"#35###554"
        end
    var"#36###544" = var"#25###540" / var"#33###543"
    var"#37###545" = var"#32###542" * var"#36###544"
    var"#38###546" = var"#37###545" - var"#32###542"
    var"#39###547" = -12 * Main.eps * var"#38###546" * Main.r
    var"#39###547"
end

God I love Julia!

DNF · August 30, 2021, 7:12pm

Sure, but

σ6d24 = (σ^3/d2^2)^2
dudr = -12ε * σ6d24 * (σ6d24 * d2 - 1) * r

should have similar performance, and look cleaner, with no macro call.

dlakelan · August 30, 2021, 7:38pm

fastpow will give you a big speedup vs calling ^ 3 times I believe. You could still do:

sig6d24 = @fastpow (sig^3/d2^2)^2

though

Oscar_Smith · August 30, 2021, 7:47pm

^ has fast-paths for squares and cubes. Also, https://github.com/JuliaLang/julia/pull/42031 will hopefully significantly improve performance for small powers.

DNF · August 30, 2021, 7:57pm

jl> @btime (σ^3/d2^2)^2 setup=(σ=rand()+1; d2=rand()+1);
  2.200 ns (0 allocations: 0 bytes)

jl> @btime @fastpow((σ^3/d2^2)^2) setup=(σ=rand()+1; d2=rand()+1);
  2.200 ns (0 allocations: 0 bytes)