Help make my ideal gas simulation ideally fast

jgreener64 · May 8, 2020, 9:43pm

I have been meaning to speed up some of the non-bonded force calculation in Molly.jl and I had a play with it today, but found I hit a ceiling.

I have extracted a simplistic molecular dynamics simulation of a Lennard Jones fluid as a MWE and would appreciate any tips on how to make it faster. The code and current benchmark results are below. I’m sure there’s some SIMD magic and the like that I am missing. I know there are no periodic boundary conditions or thermostat - it’s meant to be a mimimal example rather than a meaningful simulation.

I am not looking for algorithmic improvements like neighbour lists here, that would be a separate optimisation.

Bonus points would go to anyone who could make it run fast on the GPU and/or with Zygote. I have got these to work, but with severe performance hits compared to the below. And also bonus points for keeping the ljforce! function relatively easy-to-read; the idea is that people can define their own arbitrary functions of this type.

using StaticArrays
using BenchmarkTools

natoms = 100
nsteps = 1000

mutable struct Vec3D{T} <: FieldVector{3, T}
    x::T
    y::T
    z::T
end

# The force due to a Lennard-Jones potential
@fastmath function ljforce!(accs, coords, i, j)
    ϵ = 0.2
    σ = 1.0
    dx = coords[j].x - coords[i].x
    dy = coords[j].y - coords[i].y
    dz = coords[j].z - coords[i].z
    r2 = dx * dx + dy * dy + dz * dz
    six_term = (σ ^ 2 / r2) ^ 3
    f = min(((24 * ϵ) / r2) * (2 * six_term ^ 2 - six_term), 100.0)
    accs[i].x -= f * dx
    accs[i].y -= f * dy
    accs[i].z -= f * dz
    accs[j].x += f * dx
    accs[j].y += f * dy
    accs[j].z += f * dz
    return accs
end

# Calculate all against all forces
function forces(coords)
    natoms = size(coords, 1)
    accs = [Vec3D{Float64}(0.0, 0.0, 0.0) for _ in 1:natoms]
    for i in 1:natoms
        for j in 1:(i - 1)
            ljforce!(accs, coords, i, j)
        end
    end
    return accs # Assume mass is 1 so F = a
end

# Velocity-free Verlet integration
function simulate(natoms, nsteps)
    coords = [Vec3D{Float64}(rand(3)...) for _ in 1:natoms] .* 10.0
    # Storing previous coordinates imitates velocity
    coords_last = coords .+ [Vec3D{Float64}(rand(3)...) for _ in 1:natoms] .* 0.01
    for i in 1:nsteps
        accs = forces(coords)
        coords_next = 2 .* coords .- coords_last .+ accs .* 0.01
        coords_last = coords
        coords = coords_next
    end
    return coords
end

On my machine:

accs = [Vec3D{Float64}(0.0, 0.0, 0.0) for _ in 1:natoms]
coords = [Vec3D{Float64}(rand(3)...) for _ in 1:natoms]
@benchmark ljforce!($accs, $coords, 10, 20)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.099 ns (0.00% GC)
  median time:      10.200 ns (0.00% GC)
  mean time:        10.250 ns (0.00% GC)
  maximum time:     81.601 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

and

@benchmark simulate($natoms, $nsteps)

BenchmarkTools.Trial: 
  memory estimate:  7.86 MiB
  allocs estimate:  203204
  --------------
  minimum time:     43.903 ms (0.00% GC)
  median time:      45.272 ms (0.00% GC)
  mean time:        45.331 ms (0.99% GC)
  maximum time:     48.995 ms (5.27% GC)
  --------------
  samples:          111
  evals/sample:     1

[Edited to interpolate the variables during benchmarking.]

LaurentPlagne · May 8, 2020, 9:56pm

You could store 1/r2.
Maybe use gflop.jl to see how fast you already are

Oscar_Smith · May 8, 2020, 9:56pm

The first thing I would do to this is make your vectors feel more like numbers. Specifically, I would define operations on them and make them immutable.

For example

import Base.+
import  Base.-
import  Base.*
import LinearAlgebra.dot
function +(x::Vec3D{T}, y::Vec3D{T}) where T
    Vec3D{T}(x.x+y.x, x.y+y.y, x.z+y.z)
end
function -(x::Vec3D{T}, y::Vec3D{T}) where T
    Vec3D{T}(x.x-y.x, x.y-y.y, x.z-y.z)
end
function *(x::Vec3D{T}, c::Real) where T
    Vec3D{T}(x.x*c, x.y*c, x.z*c)
end
function dot(x::Vec3D{T}, y::Vec3D{T}) where T
    return x.x*y.x + x.y*y.y + x.z*y.z
end

This lets you re-write ljforce! as

@fastmath function ljforce!(accs, coords, i, j)
    ϵ = 0.2
    σ = 1.0
    d = coords[j] - coords[i]
    r2 = dot(d,d)
    six_term = (σ ^ 2 / r2) ^ 3
    f = min(((24 * ϵ) / r2) * (2 * six_term ^ 2 - six_term), 100.0)
    accs[i] -= f*d
    accs[j] += f * d
    return accs
end

stevengj · May 8, 2020, 10:06pm

Making them immutable should certainly help performance, as otherwise an array of Vec3D is an array of pointers to objects rather than an array of “inline” structs.

But no need to re-invent the wheel here. Replace your Vec3D type with SVector from StaticArrays, which does everything you could want. Then you’ll be able to do:

function ljforce!(accs, coords, i, j)
    ϵ = 0.2
    σ = 1.0
    dr = coords[j] - coords[i]
    invr2 = inv(sum(abs2, dr))
    six_term = (σ ^ 2 * invr2) ^ 3
    f = min((24ϵ * invr2) * (2 * six_term ^ 2 - six_term), 100.0)
    fdr = f * dr
    accs[i] -= fdr
    accs[j] += fdr
    return accs
end

In your case I would tend to just inline the whole ljforce! function into the forces function, and add @inbounds to the for j loop. That will give the compiler more opportunities for SIMD, and also allow you to hoist calculations like 24ϵ out of the loop.

tbeason · May 8, 2020, 10:18pm

Isn’t FieldVector an SVector?

longemen3000 · May 8, 2020, 10:35pm

i would separate the Lennard-Jones force definition from the update on forces, by providing the function itself:

function lennard_jones(r2)
    ϵ = 0.2
    σ = 1.0
    six_term = (σ ^ 2 / r2) ^ 3
    f = min(((24 * ϵ) / r2) * (2 * six_term ^ 2 - six_term), 100.0)
    return f
end

then pass the function as an argument to ljforce!

function ljforce!(potential,accs, coords, i, j)
    ....
    f = potential(r2)
    ...
end

jgreener64 · May 9, 2020, 12:11am

Thanks for the comments so far everyone.

The simulate time goes from 45 ms to 36 ms if you use SVector rather than Vec3D, which is a subtype of FieldVector.

That makes sense in the case of the MWE. I had hoped I could keep it as a separate function though, as I want users to be able to write custom force functions without worrying about the logic of looping over atoms (which will change due to the neighbour list later).

stevengj · May 9, 2020, 12:42am

It might be possible to inline even in that case by specializing on the function type and telling them to use @inline. First I would check whether it makes any difference.

AlfTetzlaff · May 9, 2020, 10:01am

When benchmarking, interpolate the variables used with ‘$’ like so:

@benchmark ljforce!($accs, $coords, 10, 20)

simulate(natoms, nsteps) fails with “type SArray has no field x” on my machine and I can’t figure out why, but that’s probably unimportant when moving to SVectors anyway.
Your outer force for loop should start at 2

jgreener64 · May 9, 2020, 10:47am

I edited the original post with this.

I was getting this too but put it down to re-evaluating blocks of code without updating everything relevant. It went away when I evaluated from the top.

AlfTetzlaff · May 14, 2020, 11:58pm

Hi! I thought why not make it really fast and wrote a GPU kernel for all-to-all force calculations. I more or less copied the implementation from here, where you also find an explanation how the code works. You can find the result, including visualization here. There are some optimizations left and the symmetry Fij = Fji is unused, but that’s probably hard to implement on a GPU. I tested vs the CPU code for correctness but cannot make any guarantees. On my GTX 980 I reach around 70FPS/13ms with 16384 particles. Note that the number of atoms must be a multiple of the number of threads at the moment. I think you have implemented a GPU kernel for non-bonded forces as well; so you can compare now

EDIT: I benchmarked against my Ryzen 1700X (3.4GHz) on a single core.
gpu_scaling

jgreener64 · May 15, 2020, 9:50am

Great work, I’ll take a look.

AlfTetzlaff · May 18, 2020, 11:01am

I updated the gist; it now includes a kernel which can cope with an arbitrary number of atoms. Furthermore it now accepts the force function as argument. The speed is about the same as for the other kernel.

jgreener64 · May 18, 2020, 11:10am

Nice!