Speeding up Ewald summation

dkonstan · May 10, 2022, 4:48pm

Hi Julia!

I am writing a small molecular dynamics engine in Julia to teach myself the language and I noticed that 67% of the time spent running the following Ewald summation function (to calculate coulomb energy of an infinite crystal lattice in reciprocal space) is spent in garbage collection. Besides algorithmic improvements (neighbor lists, etc.), which can certainly be made here, what can I do to make it faster in terms of Julia tricks / good practices? Also, is there realistically any benefit from type annotations in code like here? I want my code to be as fast as possible of course, that’s why I like Julia.

For your information, the Topology type is a struct with some integers, floats, and arrays describing the molecular system. I also asked this question on Code Review on Stack Overflow, link here: performance - Ewald summation in Julia - Code Review Stack Exchange

Thank you! BTW I’m loving Julia so far. So clean and fast (mostly).

function getEwaldEnergy(top::Topology, xyz::Array{Float64})::Float64
    # cubic box only for now
    tolerance::Float64 = 1.0e-24
    tryUntil::Int64 = 4  # 5 box sizes gets it to five decimal places
    L::Float64 = top.box[1]
    rc::Float64 = L / 2.0

    α::Float64 = 3.5 / (top.box[1] / 2)
    σ::Float64 = 1 / (sqrt(2) * α)

    λ::Float64 = 1.0
    ns::Array{Int64} = getCombinations(-tryUntil, tryUntil, 3)

    # self-energy corrections
    selfEnergy::Float64 = 0.0
    for i::Int64 in 1:top.nAtoms
        selfEnergy += (1 / (sqrt(2 * pi) * σ)) * (top.charges[i] * top.charges[i])
    end

    # short-range energy
    shortRangeEnergy::Float64 = 0.0
    shortRangeEnergyBefore::Float64 = 0.0
    for l::Int64 in 1:size(ns)[1]
        shortRangeEnergyBefore = shortRangeEnergy
        for i in 1:top.nAtoms
            for j::Int64 in 1:top.nAtoms
                ci::Float64 = top.charges[i]
                cj::Float64 = top.charges[j]

                idx::Array{Array{Int64}} = top.bondIdx
                bonded::Bool = false

                # check if i/j pair is bonded
                for k::Int64 in 1:length(idx)
                    if (i == idx[k][1] && j == idx[k][2]) || (j == idx[k][1] && i == idx[k][2])
                        bonded = true
                        break
                    end
                end

                if bonded == false && !(i == j && ns[l, :] == [0.0, 0.0, 0.0])
                    r::Float64 = norm(xyz[i, :] - xyz[j, :] + L * ns[l, :])
                    if r < rc - λ
                        shortRangeEnergy += 0.5 * ((ci * cj) / r) * erfc(r / (sqrt(2) * σ))
                    elseif r < rc && r > rc - λ
                        # switching function turns on
                        shortRangeEnergy += 0.5 * ((ci * cj) * (rc - r)^2 * (-2 * rc + 2 * r + 3 * λ) * erfc(r / (sqrt(2) * σ))) / (2 * r * λ^3)
                    end
                end
            end
        end
    end


    # long-range energy
    longRangeEnergy::Float64 = 0.0
    longRangeEnergyBefore::Float64 = 0.0
    volume::Float64 = L^3
    ms = ns

    longRangeEnergyBefore = copy(longRangeEnergy)
    for l in 1:size(ms)[1]  # no k = [0, 0, 0]
        if ms[l, :] == [0.0, 0.0, 0.0]
            continue
        end

        # define reciprocal space vectors
        kVector::Array{Float64} = (2 * pi) * (ms[l, :] / L)
        k::Float64 = norm(kVector)

        # check this equal 1 good
        # println(exp(-im * dot(kVector, L * [ms[l, 1], ms[l, 2], ms[l, 3]])))
        strucFactor::ComplexF64 = 0.0
        for i in 1:top.nAtoms
            strucFactor += top.charges[i] * exp(im * dot(kVector, xyz[i, :]))
        end

        longRangeEnergy += ((4 * pi) / (volume)) * ((exp(-σ^2 * k^2 / 2) / k^2) * (abs(strucFactor)^2))
    end


    # this roughly cancels out long range energy due to bonded interactions (Tuckerman pg. 663)
    bondEnergy::Float64 = 0.0
    idx = top.bondIdx
    for b::Int64 in 1:size(idx)[1]
        for i in 1:top.nAtoms
            for j in 1:top.nAtoms
                if ((idx[b][1] == i && idx[b][2] == j) || (idx[b][1] == j && idx[b][2] == i)) # && !(i in isdone)
                    # println("firing j: ", j)
                    ci::Float64 = top.charges[i]
                    cj::Float64 = top.charges[j]
                    # rx::Float64 = xyz[i, 1] - xyz[j, 1]
                    # ry::Float64 = xyz[i, 2] - xyz[j, 2]
                    # rz::Float64 = xyz[i, 3] - xyz[j, 3]
                    rij::Float64 = norm(xyz[i, :] - xyz[j, :])
                    bondEnergy += (ci * cj * erf(rij / (sqrt(2) * σ))) / rij
                end
            end
        end
    end

    ewaldEnergy = shortRangeEnergy + longRangeEnergy - selfEnergy - bondEnergy

    return ewaldEnergy
end

Oscar_Smith · May 10, 2022, 4:58pm

style things:

your type annotations aren’t helping
size(x,1) is nicer than size(x)[1]

for performance the biggest thing (I think) is that x[i, :] creates a copy, and if you want views (which you almost certainly do), you can put @views before the for loops.

mcabbott · May 10, 2022, 5:14pm

Note that this makes quite a few arrays, not just the slices, but L * ns[l, :] is another, and -, and +. You can save a few by broadcasting these, .+ .* etc, or save all of them by not making slices. Maybe just sqrt(sum(abs2(xyz[i, k] - xyz[j, k] + L * ns[l, k]) for k in axes(xyz,2))).

PeterSimon · May 10, 2022, 6:19pm

When you remove the type annotations be sure to replace
```
strucFactor::ComplexF64 = 0.0
```
with
```
strucFactor = zero(ComplexF64)
```
to avoid type instability.
exp(im * ...) can be replaced with cis(...)
(abs(strucFactor)^2) cab be replaced with abs2(strucFactor)

lmiq · May 10, 2022, 6:32pm

It is easier to store coordinates as vectors of static vectors (from StaticArrays.jl). That will make the code cleaner and solve allocations that occur in operations like norm(x[i] - x[j]) where x[i or j] vectors of coordinates.

ps. Take a look at http://github.com/m3g/CellListMap.jl for short range interactions.

stevengj · May 10, 2022, 6:36pm

Am I guessing correctly that this is an nAtoms × 3 storing the xyz coordinates of atoms? Not only is this slice allocating a copy (slow), and is in the wrong order for locality (slow), but it’s also a lot faster to unroll such tiny loops over 3 components. All three of these problems are fixed by using StaticArrays.

That is, your xyz array should instead be a length-nAtoms 1d array whose elements are SVector{3,Float64}.

I wouldn’t be surprised if you gained an order of magnitude in speed by switch to an array of SVector here (and everywhere in your code — it’s really the right data structure for coordinate arrays).

baggepinnen · May 10, 2022, 6:47pm

I previously thought this was always the right answer, but Chris Elrod showed me how life is more complicated at times

n = 256
a = randn(3, n)
b = randn(n, 3)

function optimal(a::Array{T}) where T
    s = zero(T)
    for i in axes(a, 2)
        s += a[1, i]^2 + a[2, i]^2 + a[3, i]^2
    end
    s
end

function less_optimal(b::Array{T}) where T
    s = zero(T)
    for i in axes(b, 1)
        s += b[i, 1]^2 + b[i, 2]^2 + b[i, 3]^2
    end
    s
end

@btime optimal($a)      # 242.270 ns (0 allocations: 0 bytes)
@btime less_optimal($b) # 185.330 ns (0 allocations: 0 bytes)

Increase n to make the arrays larger than cache and your statement holds, but for small n and the other side length being 3, the prefetcher can sometimes make the “less optimal” memory layout more favorable.

lmiq · May 10, 2022, 7:42pm

In MD simulations one generally needs to compute only the distances of particles that are close to each other, and these are essentially random pairs of triplets of coordinates of the vectors. Thus the only locality that is guaranteed is that of the three coordinates of each particle, and the 3xN (or static arrays) layout is better, summing all the distances is not a realistic application in this case.

dkonstan · May 11, 2022, 12:04pm

Thank you everyone! I will try your suggestions.

zhubonan · May 24, 2022, 11:05am

Ha, I am actually thinking of writing Ewald summation in julia myself. Perhaps a bit off-topic - is there any existing Ewald summation implementation in Julia already?

As for the coding style, the previously answers are already very good. The key is to avoid slicing inside the loop, since julia defaults to making copies. Even with views you will have some kind of memory allocation, which should be avoided inside nested loop.

I just want to add that writing out the operation explicitly, e.g. replacing:

norm(xyz[i, :] - xyz[j, :] + L * ns[l, :])

with

rij = 0.
for k=1:3
    rij +=  (xyz[i, k] - xyz[j, k] + L * ns[l, k]) ^ 2
end

should also recover the full speed of julia.
The speed should be equivalent to:

norm(xyz[i] - xyz[j] + L * ns[l])

where xyz and ns are Vector of static vectors storing the coordinates (SVector{3, Float64}).
So probably worth switching to StaticArrays.jl for storing positions/lattice vectors as it make code more concise.

Another trick I found is that you can add @time prefix to the for loop, this allows the number of allocations to be printed. You should aim for zero allocations inside the for loops.

stevengj · May 24, 2022, 11:19am

It’s not equivalent, because (a) the loops/arrays you wrote are still in the wrong order for locality, (b) they may not be unrolled the way SVector operations are, and (c) the norm function is more complicated than sqrt(x^2 + y^2 + z^2) because norm avoids spurious overflow/underflow.

lmiq · May 24, 2022, 12:18pm

I don’t think so. If someone puts up a working version I will be happy to help with optimizations, etc.

antoine-levitt · May 24, 2022, 12:29pm

We have one here DFTK.jl/ewald.jl at master · JuliaMolSim/DFTK.jl · GitHub although that code is quite old and not particularly optimized

dkonstan · May 24, 2022, 4:10pm

do you mean that it should be norm(xyz[:, i] - xyz[:, j] + L * ns[:, l]) or something to take advantage of column-major ordering, which I think Julia uses? (here I am assuming xyz becomes transpose(xyz). How much of a speedup does that give in your experience?

zhubonan · May 25, 2022, 5:03pm

What I meant by 'equivalent’ was the column major version

rij = 0.
for k=1:3
    rij +=  (xyz[k, i] - xyz[k, j] + L * ns[k, l]) ^ 2
end

vs

norm(xyz[i] - xyz[j] + L * ns[l])

but thanks for @stevengj pointing out they are not actually equivalent.

So to test it:

julia> using LinearAlgebra

julia> using BenchmarkTools

julia> using StaticArrays

julia> x = rand(3, 10000);

julia> x1 = [SVector{3, Float64}(a) for a in eachcol(x)]

julia> function with_loop(x)
          out = zeros(10000)
          for i=1:10000, j=1:10000
              tmp = 0.
              for k=1:3
                  @inbounds tmp += (x[k, i] + x[k, j]) ^ 2
              end
              @inbounds out[i] += sqrt(tmp)
          end
          out
       end

julia> function with_norm(x)
          out = zeros(10000)
          for i = 1:10000
              for j = 1:10000
              @inbounds out[i] += norm(x[i] + x[j])
              end
          end
          out
       end

julia> all(with_loop(x) .≈ with_norm(x1))
true

then I got:

julia> @btime with_loop(x);
  180.568 ms (2 allocations: 78.17 KiB)

julia> @btime with_norm(x1);
  163.951 ms (2 allocations: 78.17 KiB)

As you can see, the second version using norm is about 10% faster, and has fewer lines of code/less easy to go wrong.

(on julia-1.7.2 intel laptop 8750H)

Topic		Replies	Views
Why I love Julia Community announcement	34	4760	October 11, 2019
Speed up very elemental functions Performance	6	334	November 16, 2022
Compare julia sum to a cpp implementation - julia is extremely slow?! Performance question	35	1760	October 7, 2019
Assistance in speeding up Array operations and summations New to Julia performance , arrays , sum	12	534	August 23, 2023
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4742	July 19, 2021

Speeding up Ewald summation

Related topics