Accelerate pairwise Lennard-Jones force computation

Hehe, doing this made everything work with the padded vectors (hell Julia is cool):

replace using StaticArrays by

import StaticArrays.@SVector
import StaticArrays.SMatrix
struct SVector{N,T} <: StaticArrays.FieldVector{N,T}
    x::T
    y::T
    z::T
    w::T
end
SVector{N,T}(x,y,z) where {N,T} = SVector{N,T}(x,y,z,zero(T))

Unfortunately it got slower :joy: (it was a long shot that this would work right away for everything I do in the package…)

julia> @btime CellListMap.florpi()
  115.328 ms (1455535 allocations: 62.46 MiB)
(true, [0.0007883474482653071, -0.00356623718786355, -0.0004074200882398694, 0.0003623877989466285, -0.001044133453861484])

You should make sure it does arithmetic with w as well, so that it can SIMD.
I don’t already have a solution for this.

1 Like

A padded version for the three-dimensional case in StaticArrays would be a nice thing to have. Three dimensional particles are very common :slight_smile: if the hardware developers only knew about that in advance, they could have designed SIMD instructions to be ideal for 192bits.

1 Like

Just to add that defining the arithmetics for the β€œnew” type didn’t make it faster either:

norm_sqr(p :: SVector{3,Float64}) = p.x^2 + p.y^2 + p.z^2 + p.w^2
norm(p :: SVector{3,Float64}) = sqrt(norm_sqr(p))
import Base.+, Base.-, Base.*
-(p1::SVector{3,Float64},p2::SVector{3,Float64}) = SVector{3,Float64}(p1.x-p2.x,p1.y-p2.y,p1.z-p2.z,p1.w-p2.w)
+(p1::SVector{3,Float64},p2::SVector{3,Float64}) = SVector{3,Float64}(p1.x+p2.x,p1.y+p2.y,p1.z+p2.z,p1.w+p2.w)
*(c::Float64,p1::SVector{3,Float64}) = SVector{3,Float64}(c*p1.x,c*p1.y,c*p1.z,c*p1.w)
*(p1::SVector{3,Float64},c::Float64) = SVector{3,Float64}(c*p1.x,c*p1.y,c*p1.z,c*p1.w)

Probably trying to do this correctly and seeing if it is worth for the complete computation set will take some more work.

My understanding and personal experience so far is for anything short ranged and analytic like LJ, or basic 3-body or EAM the neighbourlist will be the bottleneck. Even when you buffer it you still need to go through and to computer all the distances (squared). Once you have those in a long list then you can SIMD the assembly and that will be very fast. But it won’t help because of the cost of the neighbourlist?

1 Like

Yes, absolutely. One needs to avoid at maximum having to compute the squared distance between particles. What I have implemented so far allows defining the size of the cells (cutoff, cutoff/2, cutoff/3), which reduce the number of distances computed at the expense of running over more cells, and I have also implemented a scheme in which one projects the coordinates of the particles along the vector connecting the cell centers, sort the particles along this direction and compute the distances only for those pairs whose projections are closer than the cutoff. This all reduces significantly the number of distances that one has to compute.

That implemented, now it seems that my package spends 60% of the time computing the remaining forces. Therefore, I would had to remove everything else to get closer to what NAMD obtains. I can only imagine, for now, that NAMD is able to align the force computation such that SIMD instructions are take place very effectively. I will try to now to build lists of particles, pad the vectors as suggested by Elrod, and β€œturbo” the force computation to see how faster that gets, but all attempts I made so far didn’t pay of the price of having to build the list.

In a shared memory computer the scaling on CPUs is not much worse than NAMD’s (I bet on multiple computers it will be much, much, worse, since NAMD is specialized for that).

Of course, for practical speedup, there is the path of putting everything on the GPU. Initially I thought that that would be very difficult, but now the package already does not allocate anything and the kernel are all custom kernels anyway, maybe it is easier than what I initially anticipated.

1 Like

It isn’t just about memory

(This requires ArrayInterface >= 3.1.30 and LoopVectorization >= 0.12.69, which as of writing this are the latest versions [for things like properly propagating static size information of the reinterpreted array to LV].)
Benchmarks of SVector{3,Float64} vs SVector{4,Float64}:

julia> @benchmark forces!($x3,$f3,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
 Range (min … max):  41.045 ns … 69.736 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     44.280 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   44.333 ns Β±  0.524 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                        β–β–ˆβ–ˆβ–†β–‡β–ˆ          ▁▁▂▁▁ β–‚
  β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ƒβ–„β–ƒβ–ƒβ–β–β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–ˆ
  41 ns        Histogram: log(frequency) by time      45.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark forces_turbo!($x3,$f3,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
 Range (min … max):  40.308 ns … 62.864 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     40.906 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   40.992 ns Β±  0.532 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–„β–‡        β–†β–ˆ         ▂▁
  β–‚β–‚β–‚β–‚β–„β–ˆβ–ˆβ–ˆβ–ƒβ–‚β–‚β–β–β–‚β–…β–ˆβ–ˆβ–†β–„β–ƒβ–ƒβ–ƒβ–…β–‡β–†β–…β–ˆβ–ˆβ–ƒβ–‚β–ƒβ–„β–ƒβ–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  40.3 ns         Histogram: frequency by time        42.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark forces!($x4,$f4,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  28.011 ns … 48.468 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     28.552 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   28.554 ns Β±  0.454 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   ▁▃          ▄▁     β–‚β–ˆβ–„
  β–‚β–ˆβ–ˆβ–…β–„β–„β–„β–„β–„β–†β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–†β–†β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  28 ns           Histogram: frequency by time        29.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark forces_turbo!($x4,$f4,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  27.562 ns … 52.618 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     27.753 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   27.768 ns Β±  0.405 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–‡β–ˆ    β–…β–ˆ
  β–‚β–†β–ˆβ–ˆβ–‡β–ƒβ–ƒβ–„β–ˆβ–ˆβ–‡β–ƒβ–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  27.6 ns         Histogram: frequency by time          29 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Code
using StaticArrays, FastPow, BenchmarkTools, LoopVectorization
function forces_turbo!(x::Vector{SVector{N,T}},f::Vector{SVector{N,T}},d,idxs) where {N,T}
  X = reinterpret(reshape, T, x)
  F = reinterpret(reshape, T, f)
  forces_turbo!(X,F,d,idxs)
  return f
end
@inline function forces_turbo!(X,F,d,idxs)
  @turbo for id in axes(idxs,1)
    i = idxs[id,1]
    j = idxs[id,2]
    for k in axes(X,1)
      r = X[k,j] - X[k,i]
      dudr = -12*(1/d^7 - 1/d^4)*r
      F[k,i] = F[k,i] + dudr
      F[k,j] = F[k,j] - dudr
    end
  end
end

function forces!(x,f,d,idxs)
    @inbounds for id in axes(idxs,1)
        i = idxs[id,1]
        j = idxs[id,2]
        r = x[j] - x[i] 
        @fastpow dudr = -12*(1/d^7 - 1/d^4)*r
        f[i] = f[i] + dudr
        f[j] = f[j] - dudr
   end
   return f
end

x3 = [ rand(SVector{3,Float64}) for _ in 1:1000 ];
f3 = [ zeros(SVector{3,Float64}) for _ in 1:1000 ];
x4 = [ rand(SVector{4,Float64}) for _ in 1:1000 ];
f4 = [ zeros(SVector{4,Float64}) for _ in 1:1000 ];
d = rand();
idxs = rand(1:1000,20,2);

f3_2 = copy(f3); f4_2 = copy(f4);
forces!(x3,f3,d,idxs) β‰ˆ forces_turbo!(x3,f3_2,d,idxs)
forces!(x4,f4,d,idxs) β‰ˆ forces_turbo!(x4,f4_2,d,idxs)

@benchmark forces!($x3,$f3,$d,$idxs)
@benchmark forces_turbo!($x3,$f3,$d,$idxs)

@benchmark forces!($x4,$f4,$d,$idxs)
@benchmark forces_turbo!($x4,$f4,$d,$idxs)

Here is the asm of the @turbo loop with SVector{3,Float64}.
Here is the asm of the @turbo loop with SVector{4,Float64}.
The above benchmarks were done on a Cascadelake CPU, so you can click the Cascadelake box (under Microarchitecture), make sure uiCA is checked among the tools (it is the most accurate), and then click Run! to get an estimated throughput (average number of clock cycles per iteration completed when doing many iterations under certain memory assumptions) and analysis of the assembly and port use.

You can use @code_native syntax=:intel debuginfo=:none to get the assembly of the loops for your architecture to take a look as well.

My original point here was to say that LoopVectorization can SIMD 192 byte vectors just fine.
However, the estimated throughput is lower. All the masked operations require 2 ports instead of 1.

However, the performance difference we see here is much larger than the estimated 5 vs 4.28 cycles for this architecture.
I suspect that difference comes from many of the loads and stores in the SVector{3,Float64} case not being aligned, crossing 64 byte boundaries. Loads and stores crossing such a boundary count double.

Thus, the memory accesses are much faster in the SVector{4,Float64} case than in the SVector{3,Float64} case.
Note that in both cases, memory accesses dominated the cost of the loop. E.g., in the SVector{4,Float64} case, the two ports experiencing the heaviest load were ports 2 and 3, both of which only handled memory.
In the SVector{3,Float64} case, port 5 was hit hardest, and also restricted entirely to memory related operations.

Note that uiCA assumes all memory is in the L1 cache; by saying this is memory dominated, I mean the CPU’s execution resources in the best case scenario are mostly spent rearranging memory (which is pretty normal).

EDIT:
You could also speed it up slightly by first calculating dinv = inv(d), and then using dinv^4 and dinv^7.

3 Likes

The issue with these benchmarks is that, in the code, the-interparticle distance d is taken to be constant for all particle pairs. Using that information, the Julia compiler does the entire LJ-force computation outside of the loop, as far as I can tell. Therefore, the results are slightly misleading. Still your solution gives a noticeable speed-up if I assume that the neighbor pair-distances are stored in the array d. (I also added a version where the force calculation is implemented slightly more efficiently in the way that you suggested, which gives an additional ~10% speed-up.)

using StaticArrays, FastPow, BenchmarkTools, LoopVectorization
function forces_turbo!(x::Vector{SVector{N,T}},f::Vector{SVector{N,T}},d,idxs) where {N,T}
  X = reinterpret(reshape, T, x)
  F = reinterpret(reshape, T, f)
  forces_turbo!(X,F,d,idxs)
  return f
end

function forces_turbo2!(x::Vector{SVector{N,T}},f::Vector{SVector{N,T}},d,idxs) where {N,T}
    X = reinterpret(reshape, T, x)
    F = reinterpret(reshape, T, f)
    forces_turbo2!(X,F,d,idxs)
    return f
  end

@inline function forces_turbo!(X,F,d,idxs)
  @turbo for id in axes(idxs,1)
    i = idxs[id,1]
    j = idxs[id,2]
    dij = d[id] 
    for k in axes(X,1)
      r = X[k,j] - X[k,i]
      dudr = -12*(1/dij^7 - 1/dij^4)*r
      F[k,i] = F[k,i] + dudr
      F[k,j] = F[k,j] - dudr
    end
  end
end

@inline function forces_turbo2!(X,F,d,idxs)
    @turbo for id in axes(idxs,1)
      i = idxs[id,1]
      j = idxs[id,2]
      dij = d[id] 
      inv_dij = 1.0/dij
      inv_dij4 = inv_dij^4
      inv_dij7 = inv_dij^7
      dudr1 = (-12.0*(inv_dij7 - inv_dij4))
      for k in axes(X,1)
        r = X[k,j] - X[k,i]
        dudr = dudr1*r
        F[k,i] = F[k,i] + dudr
        F[k,j] = F[k,j] - dudr
      end
    end
  end

function forces!(x,f,d,idxs)
    @inbounds for id in axes(idxs,1)
        i = idxs[id,1]
        j = idxs[id,2]
        r = x[j] - x[i]
        dij = d[id] 
        @fastpow dudr = -12*(1/dij^7 - 1/dij^4)*r
        f[i] = f[i] + dudr
        f[j] = f[j] - dudr
   end
   return f
end

function forces2!(x,f,d,idxs)
    @inbounds @fastmath @simd for id in axes(idxs,1)
        i = idxs[id,1]
        j = idxs[id,2]
        r = x[j] - x[i]
        dij = d[id] 
        inv_dij = 1.0/dij
        inv_dij2 = inv_dij*inv_dij
        inv_dij4 = inv_dij2*inv_dij2
        inv_dij7 = inv_dij*inv_dij2*inv_dij4
        dudr = (-12.0*(inv_dij7 - inv_dij4))*r
        f[i] += dudr
        f[j] -= dudr
   end
   return f
end

x3 = [ rand(SVector{3,Float64}) for _ in 1:1000 ];
f3 = [ zeros(SVector{3,Float64}) for _ in 1:1000 ];
x4 = [ rand(SVector{4,Float64}) for _ in 1:1000 ];
f4 = [ zeros(SVector{4,Float64}) for _ in 1:1000 ];
d = rand(20);
idxs = rand(1:1000,20,2);

f3_2 = copy(f3); f4_2 = copy(f4);

display(@benchmark forces!($x3,$f3,$d,$idxs))   #68ns
display(@benchmark forces2!($x3,$f3,$d,$idxs))  #66ns
display(@benchmark forces_turbo!($x3,$f3,$d,$idxs)) #66ns
display(@benchmark forces_turbo2!($x3,$f3,$d,$idxs)) #61ns  

display(@benchmark forces!($x4,$f4,$d,$idxs))   #53ns
display(@benchmark forces2!($x4,$f4,$d,$idxs))  #48ns
display(@benchmark forces_turbo!($x4,$f4,$d,$idxs)) #55ns
display(@benchmark forces_turbo2!($x4,$f4,$d,$idxs)) #50ns  

The results might of course be different on your machine…

4 Likes

LoopVectorization seems to be unrolling too much here. @turbo unroll=1 improves performance to match the equivalent forces! versions for me.
The unrolling itself probably isn’t that problematic, but there is some severe misoptimization going on in the unrolled code vs what you’d get if you just copy/pasted 4 instances of the non-unrolled version.

EDIT:
And yes, I was aware that the power was getting hoisted out.
One of the changes I made to LoopVectorization was to allow it to optimize the power despite it getting hoisted out. Older versions of LoopVectorization would only optimize non-hoisted powers.

EDIT:
I’ve fixed the codegen on LV master. Should be more or less the same now.

1 Like

By the way, if I artificially change the implementation (with the new version of LoopVectorization) such that all memory accesses/writes are sequential, it still takes 35ns on my machine to run, instead of roughly 48 ns in the realistic case. To me this indicates that if NAMD still is a factor of 2 faster, they must have done some clever higher level optimization, since the difference cannot simply be explained by SIMD. Or am I missing something?

Code:

using StaticArrays, LoopVectorization, BenchmarkTools

function forces_aligned_turbo!(force_array::Vector{SVector{N,T}},r_array::Vector{SVector{N,T}},d_array) where {N,T}
    X = reinterpret(reshape, T, r_array)
    F = reinterpret(reshape, T, force_array)
    forces_aligned_turbo!(F,X,d_array)
    return force_array
end

@inline function forces_aligned_turbo!(F,X,d_array)
    @turbo for id in axes(X,2)
        inv_dij = 1.0/d_array[id]
        dudrtemp = -12*(inv_dij^7 - inv_dij^4)
        for k in axes(X,1)
            r = X[k, id]
            dudr = dudrtemp*r
            F[k, id] = dudr
        end
    end
end

force_array = [zeros(SVector{4,Float64}) for _ in 1:20]
r_array = [rand(SVector{4,Float64}) for _ in 1:20]
d_array = rand(20)
@benchmark forces_aligned_turbo!($force_array, $r_array, $d_array) # 35 ns

I have written a more complete example now.

Here I compute the forces between 10_000 particles, and I checked that in a typical scenario this involves computing the forces between 8_000_000 pairs of particles. Thus, the test below does that.

In the test, the computation takes ~90ms. That is worse that what happens inside my code, because the complete simulation of 1k steps takes 90s here, including cell list building and trajectory propagation. From profiling, a typical computation of these 8 million forces takes in average 60ms inside my code.

NAMD does the same thing my code does in roughly half of the time. Thus, it is doing what the MWE example bellow is doing much faster also, both relative to my code and from the MWE. The room for improvement is on the access of the data, because I’m focusing only on the number of computations that are effectively performed.

That means that what is happening in this example is an upper bound to the bad distribution of the data (in memory?) that leads to the slower performance.

How can (anything is valid now) improve the performance of the code below, by a factor probably of 4x, while computing the same 8_000_000 force pairs?

using BenchmarkTools
using StaticArrays

function force_pair!(x,y,i,j,d2,Ξ΅,Οƒ6,f)
    r = y - x
    d8 = (d2^2)^2
    Οƒ6d8 = Οƒ6*inv(d8)
    d23inv = inv(d2^3)
    dudr = -12Ξ΅ * Οƒ6d8 * (Οƒ6*d23inv - 1)*r
    @inbounds f[i] += dudr
    @inbounds f[j] -= dudr
    return f
end

function compute_force!(f,x,y,pairlist,distances)
    Οƒ = 3.28018
    Ξ΅ = 0.0441795
    Οƒ6 = Οƒ^6
    @inbounds for ipair in 1:length(pairlist)
        i = pairlist[ipair][1]
        j = pairlist[ipair][2]
        d2 = distances[ipair]
        f = force_pair!(x[i],y[j],i,j,d2,Ξ΅,Οƒ6,f)
    end
    return f
end

# data
n = 10_000 # number of particles
npairs = 8026946 # number of pairs within cutoff
pairlist = [ SVector{2,Int}(rand(1:n),rand(1:n)) for _ in 1:npairs ]
distances = [ 144*rand() for _ in 1:npairs ]
x = [ 12*rand(SVector{3,Float64}) for _ in 1:n ]
y = [ 12*rand(SVector{3,Float64}) for _ in 1:n ]

# Initialize force
f = zeros(SVector{3,Float64},length(x))

# Benchmark
@benchmark compute_force!($f,$x,$y,$pairlist,$distances)

# Here, I get: (Samsung Laptop i7 8th Gen)
#
# BenchmarkTools.Trial: 57 samples with 1 evaluation.
#  Range (min … max):  87.758 ms … 102.155 ms  β”Š GC (min … max): 0.00% … 0.00%
#  Time  (median):     88.581 ms               β”Š GC (median):    0.00%
#  Time  (mean Β± Οƒ):   88.972 ms Β±   1.902 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%
# 
#          β–ƒβ–ˆβ–†β–ƒ ▁▃ ▁ ▁ ▁  β–ƒ                                       
#   β–„β–β–„β–β–‡β–β–β–ˆβ–ˆβ–ˆβ–ˆβ–„β–ˆβ–ˆβ–„β–ˆβ–‡β–ˆβ–„β–ˆβ–β–‡β–ˆβ–„β–„β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–„ ▁
#   87.8 ms         Histogram: frequency by time         91.5 ms <
# 
#  Memory estimate: 0 bytes, allocs estimate: 0.
# 
#  >>> This means that 1000 computations of this would take 90s
#  >>> NAMD does a simulation of 1k steps in 44s in my machine. 
#  >>> My simulation code does 1k steps in 87s. 
# 
#  (therefore the loop above is *worse* than what happens in my code, 
#  because for the same number of compuations it takes 60% of the time,
#  according to the profile)
1 Like

On my machine putting @inline before the definition of force_pair! reduces median time from 81 ms to 66 ms.

1 Like

EDIT: I added both @inline and @fastmath, as well as used SVector{4,Float64}.
Original time without any of these changes was about 60 ms per.

julia> @benchmark compute_force!($f,$x,$y,$pairlist,$distances)
BenchmarkTools.Trial: 207 samples with 1 evaluation.
 Range (min … max):  23.495 ms …  26.798 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     24.147 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   24.222 ms Β± 566.488 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–„β–„β–ˆβ–‡β–‚             ▁▂▄▂▂
  β–„β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–β–ˆβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–β–β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–ƒβ–ƒβ–ƒβ–„β–ƒβ–ƒβ–β–ƒβ–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–ƒβ–β–β–ƒβ–β–β–β–β–β–β–ƒ β–ƒ
  23.5 ms         Histogram: frequency by time         26.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         for i in 1:10; compute_force!(f,x,y,pairlist,distances); end
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β”Œ cpu-cycles               1.04e+09  100.0%  #  4.5 cycles per ns
β”” task-clock               2.33e+08  100.0%  # 232.7 ms
β”Œ instructions             2.09e+09  100.0%  #  2.0 insns per cycle
β”‚ branch-instructions      8.03e+07  100.0%  #  3.8% of instructions
β”” branch-misses            1.05e+02  100.0%  #  0.0% of branch instructions
β”Œ L1-dcache-load-misses    3.35e+08  100.0%  # 59.7% of dcache loads
β”‚ L1-dcache-loads          5.62e+08  100.0%
β”‚ cache-misses             1.18e+07  100.0%  # 19.6% of cache references
β”” cache-references         6.05e+07  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

10 calls took around 230 ms, or about 23 ms per call.
Average of 2.0 instructions per cycle.

If I reduce n and npairs

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         for i in 1:1_000_000; compute_force!(f,x,y,pairlist,distances); end
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β”Œ cpu-cycles               9.32e+08  100.0%  #  4.6 cycles per ns
β”” task-clock               2.05e+08  100.0%  # 204.6 ms
β”Œ instructions             2.82e+09  100.0%  #  3.0 insns per cycle
β”‚ branch-instructions      1.39e+08  100.0%  #  4.9% of instructions
β”” branch-misses            9.42e+05  100.0%  #  0.7% of branch instructions
β”Œ L1-dcache-load-misses    4.16e+03  100.0%  #  0.0% of dcache loads
β”‚ L1-dcache-loads          7.76e+08  100.0%
β”‚ cache-misses             3.29e+02  100.0%  # 62.2% of cache references
β”” cache-references         5.29e+02  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> n, npairs
(100, 100)

We have 3 instructions per cycle.

So, can you implement your problem in some way such that you can evaluate as you generate pairs, or in any other way to evaluate points with better locality, you could potentially get another 50% performance increase through better IPC.

With the full size problem, I’m now at about 3x faster than the original version.
If you’re able to improve locality by changing your algorithm to work on hot blocks somehow, you could get another 50% to 4.5x.

2 Likes

Just so I really understand, better locality here means that the coordinates involved are closer to each other in the array of coordinates, correct?

That is not β€œnatural”, but can be done, by reordering the particles according to their positions from time to time.

Here (for my records):

@inline @fastmath:
90ms (no change)

SVector{4} only:
105ms

SVector{4} + @inline:
74 ms

SVector{4} + @inline @fastmath:
74 ms

SVector{4} + @inline
sort!(pairlist, by = el -> el[1]) # actually this is closer to what I get 

39ms

SVector{4} + @inline
sort!(pairlist)

34 ms (sorting takes 275 ms)

SVector{3} + @inline
sort!(pairlist, by = el -> el[1])

51 ms (probably something to this one is what I already have)

SVector{4} + @inline
ilist = rand(1:n,npairs)
pairlist = [ SVector{2,Int}(ilist[i],rand(max(1,ilist[i]-20):min(ilist[i]+20,n))) for i in 1:npairs ]
sort!(pairlist, by = el -> el[1])

31 ms (simulating locality)

My brief conclusions here: the padding, properly implemented makes the greatest difference. The access order of the first index is important, but that is almost like that in my code. One needs to inline so that the padding works as expected.

Well, it is O(n) and can be done while building the cell lists. May be worthwhile. But I think my code already does some of that (not with that purpose, but it came naturally).

Indeed (as I tested above). This is what I essentially already have, and explains why the MWE is worse than my code.

1 Like

Is that expensive?

Your data should all fit in the L2 cache, but the random accesses means the prefetcher can’t fetch ahead of a time, hence presumably L2 latency gets you.

So, basically, you want all your x, y, and f memory accesses to be in the L1 cache.
Two ways are:

  1. accessing with a pattern (e.g., sequentially) to trigger the prefetchers to fetch the memory.
  2. repeatedly accessing the same memory before it leaves the L1 cache. After you use memory, it’ll be in the L1 cache until it gets evicted. So if you can cluster the pairlist somehow so that repeated accesses to the same memory are together (and hence those after the first will occur before it leaves the L1 cache), that should help.

Sorting by only the first element is obviously not ideal, but it is an easy way to test and yields a 10% improvement:

julia> pairlist_sorted = sort(pairlist, by = first);

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         for i in 1:10; compute_force!(f,x,y,pairlist_sorted,distances); end
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β”Œ cpu-cycles               9.38e+08  100.0%  #  4.6 cycles per ns
β”” task-clock               2.05e+08  100.0%  # 205.3 ms
β”Œ instructions             2.09e+09  100.0%  #  2.2 insns per cycle
β”‚ branch-instructions      8.03e+07  100.0%  #  3.8% of instructions
β”” branch-misses            1.31e+02  100.0%  #  0.0% of branch instructions
β”Œ L1-dcache-load-misses    1.82e+08  100.0%  # 32.4% of dcache loads
β”‚ L1-dcache-loads          5.62e+08  100.0%
β”‚ cache-misses             1.30e+07  100.0%  # 39.7% of cache references
β”” cache-references         3.28e+07  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         for i in 1:10; compute_force!(f,x,y,pairlist,distances); end
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β”Œ cpu-cycles               1.06e+09  100.0%  #  4.6 cycles per ns
β”” task-clock               2.30e+08  100.0%  # 230.3 ms
β”Œ instructions             2.09e+09  100.0%  #  2.0 insns per cycle
β”‚ branch-instructions      8.03e+07  100.0%  #  3.8% of instructions
β”” branch-misses            9.60e+01  100.0%  #  0.0% of branch instructions
β”Œ L1-dcache-load-misses    3.35e+08  100.0%  # 59.7% of dcache loads
β”‚ L1-dcache-loads          5.62e+08  100.0%
β”‚ cache-misses             1.22e+07  100.0%  # 20.0% of cache references
β”” cache-references         6.10e+07  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Also, you don’t want to actually ever call sort. No sense turning an O(N) algorithm into an O(N*log(N)) algorithm.

1 Like

That is out of question. But what do do is this: Fastest way to partition array, given a condition

This takes quite a significant part of the time (but results in some of that locality and avoids a lot of force computations, and results to be very effective). Yet it is the most time-consuming operation except the calculation of the forces for the pairs.

If I understood the problem correctly, you have a 2D grid where some positions are occupied by particles and you want to calculate some forces between close neighbors (a neighbor being a particle within some cutoff distance) and you’re currently indexing linearly over an array, right?

If so, this sounds like an application for a quadtree of particles instead of a list. It comes at the cost of building the tree (where you have to take care to build a β€œgood” tree that doesn’t devolve into a linked list), but once you have that, you should be able to query relevant neighbors much faster. It’s theoretically possible to build this tree as an abstraction over a regular array (i.e., all data stays close together), though I haven’t done that yet. I only know of it’s practical application in computer graphics, where such quadtrees are often used in querying of 2D data like pixels or for intersecting line segments with arbitrary objects, where calculating the true intersection is expensive and it’s desirable to check quickly whether an object has to be considered for intersection at all. E.g. there’s an implementation for a cache friendly, fast insertion/removal quadtree here, though it does come with some caveats in regards to how its implemented (not sure if they’re useful to/adaptable for you)

1 Like

Indeed, I think the main optimization is in the data structure. If I change the array of SArrays to another data structure I get a major (from 60ms to 30ms) speed-up, without(!) doing any simd/inbounds/fastmath/turbo.

I created one 1d-array of length n_pairs called neigh_list in which the neighbor indices are stored. In another (nx2 2d-array), called neigh_list_info, I store the starting index of the list of neighbors of each particle and the number of neighbors of that particle. Right now, I give each particle the same number of neighbors, but that is easily changed.

Apologies for the messy code; I didn’t have much time.

using BenchmarkTools
using StaticArrays, LoopVectorization

@inline function force_pair!(x,y,i,j,d2,Ξ΅,Οƒ6,f)
    r = y - x
    d8 = (d2^2)^2
    Οƒ6d8 = Οƒ6*inv(d8)
    d23inv = inv(d2^3)
    dudr = -12Ξ΅ * Οƒ6d8 * (Οƒ6*d23inv - 1)*r
    f[i] = f[i] + dudr
    f[j] = f[j] - dudr
end

function compute_force!(f,x,y,pairlist,distances)
    Οƒ = 3.28018
    Ξ΅ = 0.0441795
    Οƒ6 = Οƒ^6
    for ipair in 1:length(pairlist)
        i = pairlist[ipair][1]
        j = pairlist[ipair][2]
        d2 = distances[ipair]
        force_pair!(x[i],y[j],i,j,d2,Ξ΅,Οƒ6,f)
    end
end

function create_neighbour_list(n, npairs_per_neighbour)
    neigh_list_info = zeros(Int64, 2, n)
    neig_list = zeros(Int64, n*npairs_per_neighbour)
    index = 1
    for i = 1:n
        neigh_list_info[1, i] = index
        neigh_list_info[2, i] = npairs_per_neighbour
        for j = 1:npairs_per_neighbour
            while true
                neigh_index = rand(1:n)
                if neigh_index != i # we dont want that a particle is its own neighbour, so i hardcode that with the while-loop
                    neig_list[index] = neigh_index
                    break
                end
            end
            index += 1
        end
    end
    return neigh_list_info, neig_list
end


# data
n = 10_000 # number of particles
npairs = 8e6 # number of pairs within cutoff
npairs_per_neighbour = 800
pairlist = [ SVector{2,Int}(rand(1:n),rand(1:n)) for _ in 1:npairs ]
distances = [ 144*rand() for _ in 1:npairs ]
x = [ 12*rand(SVector{4,Float64}) for _ in 1:n ]
y = [ 12*rand(SVector{4,Float64}) for _ in 1:n ]
# Initialize force
f = zeros(SVector{4,Float64},length(x))



function compute_force2!(f,x,y,neig_list, neigh_list_info,distances)
    Οƒ = 3.28018
    Ξ΅ = 0.0441795
    Οƒ6 = Οƒ^6
    n = size(neigh_list_info)[2]
    for particle_i = 1:n
        first_index = neigh_list_info[1, particle_i]
        lastindex = first_index + neigh_list_info[2, particle_i] - 1
        xi = x[particle_i]
        for i = first_index:lastindex
            particle_j = neig_list[i]
            d2 = distances[i]
            xj = x[particle_j]
            force_pair!(xi,xj,particle_i,particle_j,d2,Ξ΅,Οƒ6,f)
        end
    end
end

display(@benchmark compute_force!($f,$x,$y,$pairlist,$distances)) # 61 ms
neigh_list_info, neig_list = create_neighbour_list(n, npairs_per_neighbour)
display(@benchmark compute_force2!($f,$x,$y,$neig_list, $neigh_list_info, $distances)) # 33ms
2 Likes

Yes, nice, that is roughly equivalent to what I get sorting the pairs by the first index. That is the idea. Apparently the best improvement I can do now is to couple that with the padding of the coordinate vectors.

@Sukera , indeed. What I do is something of the sort. I am excluding from these comparisons the part that corresponds to building the neighbour list, because it seems that, given the list there are still some improvements that must be made to achieve the performance of the β€œfamous” package. For the specific type of data we have (which is 2D or, most commonly, 3D), my package already provides a neighbourlist function which retrieves the pairs faster than the all Tree based algorithms I compared to (I have to take into account the time to building the list/tree here).

Example:

julia> using CellListMap, NearestNeighbors, StaticArrays, BenchmarkTools

julia> x = rand(SVector{3,Float64},10_000); y = rand(SVector{3,Float64},10_000);

julia> ball_tree = NearestNeighbors.BallTree(x);

julia> NearestNeighbors.inrange(ball_tree,y,0.05,true);

julia> @btime NearestNeighbors.BallTree($x);
  3.102 ms (10 allocations: 453.55 KiB)

julia> @btime NearestNeighbors.inrange($ball_tree,$y,0.05,true);
  18.620 ms (26097 allocations: 1.81 MiB)

julia> @btime CellListMap.neighbourlist($x,$y,0.05);
  10.077 ms (13459 allocations: 6.35 MiB)

julia> @btime CellListMap.neighbourlist($x,$y,0.05,parallel=true);
  4.870 ms (13651 allocations: 8.89 MiB)

Normally it’s a 3D space

1 Like