# Speeding up Ewald summation

Hi Julia!

I am writing a small molecular dynamics engine in Julia to teach myself the language and I noticed that 67% of the time spent running the following Ewald summation function (to calculate coulomb energy of an infinite crystal lattice in reciprocal space) is spent in garbage collection. Besides algorithmic improvements (neighbor lists, etc.), which can certainly be made here, what can I do to make it faster in terms of Julia tricks / good practices? Also, is there realistically any benefit from type annotations in code like here? I want my code to be as fast as possible of course, that’s why I like Julia.

For your information, the Topology type is a struct with some integers, floats, and arrays describing the molecular system. I also asked this question on Code Review on Stack Overflow, link here: performance - Ewald summation in Julia - Code Review Stack Exchange

Thank you! BTW I’m loving Julia so far. So clean and fast (mostly).

``````function getEwaldEnergy(top::Topology, xyz::Array{Float64})::Float64
# cubic box only for now
tolerance::Float64 = 1.0e-24
tryUntil::Int64 = 4  # 5 box sizes gets it to five decimal places
L::Float64 = top.box
rc::Float64 = L / 2.0

α::Float64 = 3.5 / (top.box / 2)
σ::Float64 = 1 / (sqrt(2) * α)

λ::Float64 = 1.0
ns::Array{Int64} = getCombinations(-tryUntil, tryUntil, 3)

# self-energy corrections
selfEnergy::Float64 = 0.0
for i::Int64 in 1:top.nAtoms
selfEnergy += (1 / (sqrt(2 * pi) * σ)) * (top.charges[i] * top.charges[i])
end

# short-range energy
shortRangeEnergy::Float64 = 0.0
shortRangeEnergyBefore::Float64 = 0.0
for l::Int64 in 1:size(ns)
shortRangeEnergyBefore = shortRangeEnergy
for i in 1:top.nAtoms
for j::Int64 in 1:top.nAtoms
ci::Float64 = top.charges[i]
cj::Float64 = top.charges[j]

idx::Array{Array{Int64}} = top.bondIdx
bonded::Bool = false

# check if i/j pair is bonded
for k::Int64 in 1:length(idx)
if (i == idx[k] && j == idx[k]) || (j == idx[k] && i == idx[k])
bonded = true
break
end
end

if bonded == false && !(i == j && ns[l, :] == [0.0, 0.0, 0.0])
r::Float64 = norm(xyz[i, :] - xyz[j, :] + L * ns[l, :])
if r < rc - λ
shortRangeEnergy += 0.5 * ((ci * cj) / r) * erfc(r / (sqrt(2) * σ))
elseif r < rc && r > rc - λ
# switching function turns on
shortRangeEnergy += 0.5 * ((ci * cj) * (rc - r)^2 * (-2 * rc + 2 * r + 3 * λ) * erfc(r / (sqrt(2) * σ))) / (2 * r * λ^3)
end
end
end
end
end

# long-range energy
longRangeEnergy::Float64 = 0.0
longRangeEnergyBefore::Float64 = 0.0
volume::Float64 = L^3
ms = ns

longRangeEnergyBefore = copy(longRangeEnergy)
for l in 1:size(ms)  # no k = [0, 0, 0]
if ms[l, :] == [0.0, 0.0, 0.0]
continue
end

# define reciprocal space vectors
kVector::Array{Float64} = (2 * pi) * (ms[l, :] / L)
k::Float64 = norm(kVector)

# check this equal 1 good
# println(exp(-im * dot(kVector, L * [ms[l, 1], ms[l, 2], ms[l, 3]])))
strucFactor::ComplexF64 = 0.0
for i in 1:top.nAtoms
strucFactor += top.charges[i] * exp(im * dot(kVector, xyz[i, :]))
end

longRangeEnergy += ((4 * pi) / (volume)) * ((exp(-σ^2 * k^2 / 2) / k^2) * (abs(strucFactor)^2))
end

# this roughly cancels out long range energy due to bonded interactions (Tuckerman pg. 663)
bondEnergy::Float64 = 0.0
idx = top.bondIdx
for b::Int64 in 1:size(idx)
for i in 1:top.nAtoms
for j in 1:top.nAtoms
if ((idx[b] == i && idx[b] == j) || (idx[b] == j && idx[b] == i)) # && !(i in isdone)
# println("firing j: ", j)
ci::Float64 = top.charges[i]
cj::Float64 = top.charges[j]
# rx::Float64 = xyz[i, 1] - xyz[j, 1]
# ry::Float64 = xyz[i, 2] - xyz[j, 2]
# rz::Float64 = xyz[i, 3] - xyz[j, 3]
rij::Float64 = norm(xyz[i, :] - xyz[j, :])
bondEnergy += (ci * cj * erf(rij / (sqrt(2) * σ))) / rij
end
end
end
end

ewaldEnergy = shortRangeEnergy + longRangeEnergy - selfEnergy - bondEnergy

return ewaldEnergy
end
``````
2 Likes

style things:

1. your type annotations aren’t helping
2. `size(x,1)` is nicer than `size(x)`

for performance the biggest thing (I think) is that `x[i, :]` creates a copy, and if you want views (which you almost certainly do), you can put `@views` before the for loops.

5 Likes

Note that this makes quite a few arrays, not just the slices, but `L * ns[l, :]` is another, and `-`, and `+`. You can save a few by broadcasting these, `.+` `.*` etc, or save all of them by not making slices. Maybe just `sqrt(sum(abs2(xyz[i, k] - xyz[j, k] + L * ns[l, k]) for k in axes(xyz,2)))`.

4 Likes
• When you remove the type annotations be sure to replace

``````strucFactor::ComplexF64 = 0.0
``````

with

``````strucFactor = zero(ComplexF64)
``````

to avoid type instability.

• `exp(im * ...)` can be replaced with `cis(...)`

• `(abs(strucFactor)^2)` cab be replaced with `abs2(strucFactor)`

2 Likes

It is easier to store coordinates as vectors of static vectors (from StaticArrays.jl). That will make the code cleaner and solve allocations that occur in operations like `norm(x[i] - x[j])` where `x[i or j]` vectors of coordinates.

ps. Take a look at http://github.com/m3g/CellListMap.jl for short range interactions.

2 Likes

Am I guessing correctly that this is an `nAtoms` × 3 storing the xyz coordinates of atoms? Not only is this slice allocating a copy (slow), and is in the wrong order for locality (slow), but it’s also a lot faster to unroll such tiny loops over 3 components. All three of these problems are fixed by using StaticArrays.

That is, your `xyz` array should instead be a length-`nAtoms` 1d array whose elements are `SVector{3,Float64}`.

I wouldn’t be surprised if you gained an order of magnitude in speed by switch to an array of `SVector` here (and everywhere in your code — it’s really the right data structure for coordinate arrays).

4 Likes

I previously thought this was always the right answer, but Chris Elrod showed me how life is more complicated at times

``````n = 256
a = randn(3, n)
b = randn(n, 3)

function optimal(a::Array{T}) where T
s = zero(T)
for i in axes(a, 2)
s += a[1, i]^2 + a[2, i]^2 + a[3, i]^2
end
s
end

function less_optimal(b::Array{T}) where T
s = zero(T)
for i in axes(b, 1)
s += b[i, 1]^2 + b[i, 2]^2 + b[i, 3]^2
end
s
end

@btime optimal(\$a)      # 242.270 ns (0 allocations: 0 bytes)
@btime less_optimal(\$b) # 185.330 ns (0 allocations: 0 bytes)
``````

Increase `n` to make the arrays larger than cache and your statement holds, but for small `n` and the other side length being 3, the prefetcher can sometimes make the “less optimal” memory layout more favorable.

7 Likes

In MD simulations one generally needs to compute only the distances of particles that are close to each other, and these are essentially random pairs of triplets of coordinates of the vectors. Thus the only locality that is guaranteed is that of the three coordinates of each particle, and the 3xN (or static arrays) layout is better, summing all the distances is not a realistic application in this case.

1 Like

Thank you everyone! I will try your suggestions.

Ha, I am actually thinking of writing Ewald summation in julia myself. Perhaps a bit off-topic - is there any existing Ewald summation implementation in Julia already?

As for the coding style, the previously answers are already very good. The key is to avoid slicing inside the loop, since julia defaults to making copies. Even with views you will have some kind of memory allocation, which should be avoided inside nested loop.

I just want to add that writing out the operation explicitly, e.g. replacing:

``````norm(xyz[i, :] - xyz[j, :] + L * ns[l, :])
``````

with

``````rij = 0.
for k=1:3
rij +=  (xyz[i, k] - xyz[j, k] + L * ns[l, k]) ^ 2
end
``````

should also recover the full speed of julia.
The speed should be equivalent to:

``````norm(xyz[i] - xyz[j] + L * ns[l])
``````

where `xyz` and `ns` are `Vector` of static vectors storing the coordinates (`SVector{3, Float64}`).
So probably worth switching to StaticArrays.jl for storing positions/lattice vectors as it make code more concise.

Another trick I found is that you can add `@time` prefix to the `for` loop, this allows the number of allocations to be printed. You should aim for zero allocations inside the `for` loops.

It’s not equivalent, because (a) the loops/arrays you wrote are still in the wrong order for locality, (b) they may not be unrolled the way `SVector` operations are, and (c) the `norm` function is more complicated than `sqrt(x^2 + y^2 + z^2)` because `norm` avoids spurious overflow/underflow.

1 Like

I don’t think so. If someone puts up a working version I will be happy to help with optimizations, etc.

We have one here DFTK.jl/ewald.jl at master · JuliaMolSim/DFTK.jl · GitHub although that code is quite old and not particularly optimized

2 Likes

do you mean that it should be norm(xyz[:, i] - xyz[:, j] + L * ns[:, l]) or something to take advantage of column-major ordering, which I think Julia uses? (here I am assuming xyz becomes transpose(xyz). How much of a speedup does that give in your experience?

What I meant by 'equivalent’ was the column major version

``````rij = 0.
for k=1:3
rij +=  (xyz[k, i] - xyz[k, j] + L * ns[k, l]) ^ 2
end
``````

vs

``````norm(xyz[i] - xyz[j] + L * ns[l])
``````

but thanks for @stevengj pointing out they are not actually equivalent.

So to test it:

``````julia> using LinearAlgebra

julia> using BenchmarkTools

julia> using StaticArrays

julia> x = rand(3, 10000);

julia> x1 = [SVector{3, Float64}(a) for a in eachcol(x)]

julia> function with_loop(x)
out = zeros(10000)
for i=1:10000, j=1:10000
tmp = 0.
for k=1:3
@inbounds tmp += (x[k, i] + x[k, j]) ^ 2
end
@inbounds out[i] += sqrt(tmp)
end
out
end

julia> function with_norm(x)
out = zeros(10000)
for i = 1:10000
for j = 1:10000
@inbounds out[i] += norm(x[i] + x[j])
end
end
out
end

julia> all(with_loop(x) .≈ with_norm(x1))
true
``````

then I got:

``````julia> @btime with_loop(x);
180.568 ms (2 allocations: 78.17 KiB)

julia> @btime with_norm(x1);
163.951 ms (2 allocations: 78.17 KiB)
``````

As you can see, the second version using `norm` is about 10% faster, and has fewer lines of code/less easy to go wrong.

(on julia-1.7.2 intel laptop 8750H)