Accelerate pairwise Lennard-Jones force computation

I will mark this as a solution. It is probably the best alternative. The speedup for the whole code is small, so I have to move on to the other bottlenecks to get closer to NAMD. :grinning:

Thanks all for the nice tips and discussion.

Not totally convinced that the indirection prevents all vectorization on pack of particles. I thing the gather/scatter ops tends to become more efficient on recent architecture. @Elrod surely knows more about that.

1 Like

I’m confused by this and what @lmiq said: β€œSigma and epsilon are constants, so we can pass … them already elevated to its powers, but that is not accelerating the function too much.”

Why should Οƒ^3 be computed here and not before the call? Is it correct to think that \sigma^3 gets constant propagated by the compiler and therefore is not happening? So there are only two square operations?

1 Like

I only skimmed the previous few comments, so I don’t really know the context of where we may have gather/scatter.
On x86 CPUs, scatter is still AVX512-only. However, it has been getting faster among them, down from a reciprocal throughput of 11 cycles per scattering 8x Float64 for Skylake-X to 7 cycles for Ice/Tiger/Rocket Lake.
Hopefully Golden Cove continues the trend.

512-bit gathers for 8x Float64, on the other hand, have remained steady at 5 cycles.

The 256 bit AVX2 gather instruction has however sped up across generations. That goes for both Intel and Zen.

1 Like

I think it is just that that computation is fast enough so that it does not really matter for the overall benchmark here.

The MWE benchmark for that is this one. But, anyway, it doesn’t change the fact that the package I’m comparing to is 2x faster in the same machine. Thus, processors being better or not, that does not explain by itself the difference. But it is good to know that that changes significantly from machine to machine and on time, so I may need to benchmark is a few different architectures before reaching definitive conclusions.

Just to know, does NAMD uses Float32 or Float64?

NAMD uses Float64 for propagating the trajectories (afaik, I should probably check that in the newest version I downloaded a few days ago). But it may be that some intermediary computations the data is converted to 32 to compute auxiliary arrays. I’ve been thinking about that today. That is one possible line of research.

FWIW, I got

julia> @benchmark forces_turbo!($x,$f,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 990 evaluations.
 Range (min … max):  43.041 ns … 77.720 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     43.939 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   43.931 ns Β±  0.797 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

     β–‚β–ˆβ–„β–ƒ                    ▂▄▃▂▁▂▂
  β–‚β–ƒβ–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–β–β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–„
  43 ns           Histogram: frequency by time        45.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark forces!($x,$f,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 991 evaluations.
 Range (min … max):  41.420 ns … 68.561 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     44.527 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   44.579 ns Β±  0.556 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                   β–†β–‡β–†β–„β–ˆβ–‡β–„      ▁  β–‚β–‚     ▁▂▂ β–‚
  β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ƒβ–ƒβ–ƒβ–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–…β–†β–ˆβ–ˆβ–ˆ β–ˆ
  41.4 ns      Histogram: log(frequency) by time      46.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So I don’t see much difference here either.
Given these inputs, I don’t think you could reach 20ns on this machine.

Exactly, I do not think that with that access to the elements I can take advantage of simd instructions. I need to change something of another order in the algorithm, but for that I need to copy the data in order at some point (and probably accumulate more instructions, as with ~20 I guess simd won’t help much either).

They are using SIMD instructions, because the SVectors give you contiguous blocks of length 3.
This means it doesn’t need gather/scatter.
It’s also why it isn’t clear to me how code like this could get a 2x speedup.

I would guess there is some algorithmic difference, or some other difference somewhere else.

You could try padding the SVectors (i.e. SVector{4,Float64}). That may help.

julia> @benchmark forces!($x,$f,$d,$idxs)
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range (min … max):  25.277 ns … 50.061 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     25.336 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   25.375 ns Β±  0.372 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–†β–ˆβ–ˆβ–†β–…β–„β–‚                                                    β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–„β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–…β–†β–‡β–‡β–ˆβ–‡β–ˆβ–ˆ β–ˆ
  25.3 ns      Histogram: log(frequency) by time      26.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
2 Likes

Isn’t the cube a significant part of this?

Οƒ6d24 = (Οƒ^3/d2^2)^2

There is a cube, two squares, and a divide, which I suppose can be computed as four multiplies and a divide. Doesn’t it save two multiplies by computing the cube beforehand?
I have benchmarked and agree that the time is about the same even if you take the cube out:

julia> @btime (Οƒ^3/d2^2)^2 setup=(Οƒ=rand()+1;d2=rand()+1);
  1.267 ns (0 allocations: 0 bytes)

julia> @btime (Οƒ3/d2^2)^2 setup=(Οƒ3=(rand()+1)^3;d2=rand()+1);
  1.272 ns (0 allocations: 0 bytes)

But something is strange. Looking at the code, it seems that the cube is indeed computed by multiplication:

julia> f(Οƒ,d2) = (Οƒ^3/d2^2)^2
f (generic function with 1 method)

julia> @code_llvm f(1.1,1.5)
;  @ REPL[40]:1 within `f'
define double @julia_f_1719(double %0, double %1) {
top:
; β”Œ @ intfuncs.jl:313 within `literal_pow'
; β”‚β”Œ @ operators.jl:560 within `*' @ float.jl:332
    %2 = fmul double %0, %0
    %3 = fmul double %2, %0
; β””β””
; β”Œ @ intfuncs.jl:312 within `literal_pow'
; β”‚β”Œ @ float.jl:332 within `*'
    %4 = fmul double %1, %1
; β””β””
; β”Œ @ float.jl:335 within `/'
   %5 = fdiv double %3, %4
; β””
; β”Œ @ intfuncs.jl:312 within `literal_pow'
; β”‚β”Œ @ float.jl:332 within `*'
    %6 = fmul double %5, %5
; β””β””
  ret double %6
}

How is that those 2 out of 5 multiplies get performed for free?

1 Like

Ah, the fact that the memory access is not sequential doesn’t matter? I may be pursuing the wrong lead.

(I’m sure that is not the only difference, we are talking about something I implemented in the last two months vs. a package of decades of development by a first class group at Urbana Champaign - I’m actually pretty happy with the result for now, but it is my interest to reduce that gap if that can be done while keeping the package generic enough)

I’m afraid that these benchmarks can be misleading. These small benchmarks still confuse me: How to benchmark properly? Should defaults change?

My benchmark with <26 ns (vs the 44 ns from earlier) was using SVector{4,Float64}, so we got almost the 2x performance boost out of that.
We can still do SIMD loads/stores without gather/scatter, because the SVectors are contiguous in memory.

Being random means it is unfriendly to the hardware prefetcher, so it matters in that regard.
For CPUs with wider vectors (e.g. AVX512) it also matters, because it means we can’t benefit from them (can’t load across multiple iterations).

3 Likes

Should we, on modern hardware, always expect a performance gain when switching from 64 to 32 floats?

I can make my code run with 32bits floats, and the types appear to be propagating nicely everywhere. But I do not see absolutely any speedup.

32 bit floats use half the memory, and you can do vectorized operations on twice as many per clock.

1 Like

Yes, the concept is clear. But, in practice, should e get speedups in general (I mean, except on those applications where there is a clear benefit on loop vectorizations?).

In my code I get, for example:

julia> @btime CellListMap.florpi(Float32)
  63.527 ms (58588 allocations: 19.42 MiB)
(true, Float32[0.00078833575, -0.0035662432, -0.00040742423, 0.00036238355, -0.0010439187])

julia> @btime CellListMap.florpi(Float64)
  62.663 ms (58588 allocations: 26.50 MiB)
(true, [0.0007883474482653071, -0.00356623718786355, -0.0004074200882398694, 0.0003623877989466285, -0.001044133453861484])

Thus, it uses much less memory, as expected, but I don’t see any speedup. I did check that the type is properly propagated everywhere (thus it is not that internally things became 64bits by error).

In this example, there are only 3x Float64 at a time.
You should pad these to 4x Float64, which will increase the memory by 33%, but align all the memory accesses, which will increase performance.

With 4x Float64, CPUs with AVX(2)+ won’t benefit from switching to 4x Float32, but the Apple M1 (for example) probably will.

Of course, you can fit more Float32 in cache, but that doesn’t seem to be helping/relevant here based on timings.
The GC also wouldn’t have to trigger as often, but we’d need to see the mean/time distribution to get an idea if that matters.

2 Likes

Do you have already a solution for padding static arrays in general, without changing their behavior, including for dispatch?

Meaning, I have a lot of function which rely of the size of the array, such as:

translation_image(x::SVector{N,T},unit_cell_matrix,indices::CartesianIndex{N}) where {N,T} =
    x + unit_cell_matrix*SVector{N,T}(ntuple(i -> indices[i],N))

to which I cannot simply pass a new set of coordinates with the wrong dimension and zeros in the last coordinates.

It may be worth the effort to rethink how to write these functions, but maybe there is already a workaround.

Maybe this works?

julia> struct  Point{N,T} <: FieldVector{N,T}
           x::T
           y::T
           z::T
           w::T
       end

julia> x = Point{3,Float64}(1,1,1,10)
3-element Point{3, Float64} with indices SOneTo(3):
 1.0
 1.0
 1.0

julia> norm(x)
1.7320508075688772

Perhaps just tabulate this interaction and then in the code just use Lagrange 4 point to interpolate?
Typically we do this for the central part of nucleon-nucleon interaction and the correlation part of the trial wave function.
For L-J force, things seems can basically all be analytically calculated.

1 Like