Speeding up Ewald summation

Hi Julia!

I am writing a small molecular dynamics engine in Julia to teach myself the language and I noticed that 67% of the time spent running the following Ewald summation function (to calculate coulomb energy of an infinite crystal lattice in reciprocal space) is spent in garbage collection. Besides algorithmic improvements (neighbor lists, etc.), which can certainly be made here, what can I do to make it faster in terms of Julia tricks / good practices? Also, is there realistically any benefit from type annotations in code like here? I want my code to be as fast as possible of course, that’s why I like Julia.

For your information, the Topology type is a struct with some integers, floats, and arrays describing the molecular system. I also asked this question on Code Review on Stack Overflow, link here: performance - Ewald summation in Julia - Code Review Stack Exchange

Thank you! BTW I’m loving Julia so far. So clean and fast (mostly).

function getEwaldEnergy(top::Topology, xyz::Array{Float64})::Float64
    # cubic box only for now
    tolerance::Float64 = 1.0e-24
    tryUntil::Int64 = 4  # 5 box sizes gets it to five decimal places
    L::Float64 = top.box[1]
    rc::Float64 = L / 2.0

    α::Float64 = 3.5 / (top.box[1] / 2)
    σ::Float64 = 1 / (sqrt(2) * α)

    λ::Float64 = 1.0
    ns::Array{Int64} = getCombinations(-tryUntil, tryUntil, 3)

    # self-energy corrections
    selfEnergy::Float64 = 0.0
    for i::Int64 in 1:top.nAtoms
        selfEnergy += (1 / (sqrt(2 * pi) * σ)) * (top.charges[i] * top.charges[i])
    end

    # short-range energy
    shortRangeEnergy::Float64 = 0.0
    shortRangeEnergyBefore::Float64 = 0.0
    for l::Int64 in 1:size(ns)[1]
        shortRangeEnergyBefore = shortRangeEnergy
        for i in 1:top.nAtoms
            for j::Int64 in 1:top.nAtoms
                ci::Float64 = top.charges[i]
                cj::Float64 = top.charges[j]

                idx::Array{Array{Int64}} = top.bondIdx
                bonded::Bool = false

                # check if i/j pair is bonded
                for k::Int64 in 1:length(idx)
                    if (i == idx[k][1] && j == idx[k][2]) || (j == idx[k][1] && i == idx[k][2])
                        bonded = true
                        break
                    end
                end

                if bonded == false && !(i == j && ns[l, :] == [0.0, 0.0, 0.0])
                    r::Float64 = norm(xyz[i, :] - xyz[j, :] + L * ns[l, :])
                    if r < rc - λ
                        shortRangeEnergy += 0.5 * ((ci * cj) / r) * erfc(r / (sqrt(2) * σ))
                    elseif r < rc && r > rc - λ
                        # switching function turns on
                        shortRangeEnergy += 0.5 * ((ci * cj) * (rc - r)^2 * (-2 * rc + 2 * r + 3 * λ) * erfc(r / (sqrt(2) * σ))) / (2 * r * λ^3)
                    end
                end
            end
        end
    end


    # long-range energy
    longRangeEnergy::Float64 = 0.0
    longRangeEnergyBefore::Float64 = 0.0
    volume::Float64 = L^3
    ms = ns

    longRangeEnergyBefore = copy(longRangeEnergy)
    for l in 1:size(ms)[1]  # no k = [0, 0, 0]
        if ms[l, :] == [0.0, 0.0, 0.0]
            continue
        end

        # define reciprocal space vectors
        kVector::Array{Float64} = (2 * pi) * (ms[l, :] / L)
        k::Float64 = norm(kVector)

        # check this equal 1 good
        # println(exp(-im * dot(kVector, L * [ms[l, 1], ms[l, 2], ms[l, 3]])))
        strucFactor::ComplexF64 = 0.0
        for i in 1:top.nAtoms
            strucFactor += top.charges[i] * exp(im * dot(kVector, xyz[i, :]))
        end

        longRangeEnergy += ((4 * pi) / (volume)) * ((exp(-σ^2 * k^2 / 2) / k^2) * (abs(strucFactor)^2))
    end


    # this roughly cancels out long range energy due to bonded interactions (Tuckerman pg. 663)
    bondEnergy::Float64 = 0.0
    idx = top.bondIdx
    for b::Int64 in 1:size(idx)[1]
        for i in 1:top.nAtoms
            for j in 1:top.nAtoms
                if ((idx[b][1] == i && idx[b][2] == j) || (idx[b][1] == j && idx[b][2] == i)) # && !(i in isdone)
                    # println("firing j: ", j)
                    ci::Float64 = top.charges[i]
                    cj::Float64 = top.charges[j]
                    # rx::Float64 = xyz[i, 1] - xyz[j, 1]
                    # ry::Float64 = xyz[i, 2] - xyz[j, 2]
                    # rz::Float64 = xyz[i, 3] - xyz[j, 3]
                    rij::Float64 = norm(xyz[i, :] - xyz[j, :])
                    bondEnergy += (ci * cj * erf(rij / (sqrt(2) * σ))) / rij
                end
            end
        end
    end

    ewaldEnergy = shortRangeEnergy + longRangeEnergy - selfEnergy - bondEnergy

    return ewaldEnergy
end
2 Likes

style things:

  1. your type annotations aren’t helping
  2. size(x,1) is nicer than size(x)[1]

for performance the biggest thing (I think) is that x[i, :] creates a copy, and if you want views (which you almost certainly do), you can put @views before the for loops.

5 Likes

Note that this makes quite a few arrays, not just the slices, but L * ns[l, :] is another, and -, and +. You can save a few by broadcasting these, .+ .* etc, or save all of them by not making slices. Maybe just sqrt(sum(abs2(xyz[i, k] - xyz[j, k] + L * ns[l, k]) for k in axes(xyz,2))).

4 Likes
  • When you remove the type annotations be sure to replace

    strucFactor::ComplexF64 = 0.0
    

    with

    strucFactor = zero(ComplexF64)
    

    to avoid type instability.

  • exp(im * ...) can be replaced with cis(...)

  • (abs(strucFactor)^2) cab be replaced with abs2(strucFactor)

2 Likes

It is easier to store coordinates as vectors of static vectors (from StaticArrays.jl). That will make the code cleaner and solve allocations that occur in operations like norm(x[i] - x[j]) where x[i or j] vectors of coordinates.

ps. Take a look at http://github.com/m3g/CellListMap.jl for short range interactions.

2 Likes

Am I guessing correctly that this is an nAtoms × 3 storing the xyz coordinates of atoms? Not only is this slice allocating a copy (slow), and is in the wrong order for locality (slow), but it’s also a lot faster to unroll such tiny loops over 3 components. All three of these problems are fixed by using StaticArrays.

That is, your xyz array should instead be a length-nAtoms 1d array whose elements are SVector{3,Float64}.

I wouldn’t be surprised if you gained an order of magnitude in speed by switch to an array of SVector here (and everywhere in your code — it’s really the right data structure for coordinate arrays).

4 Likes

I previously thought this was always the right answer, but Chris Elrod showed me how life is more complicated at times

n = 256
a = randn(3, n)
b = randn(n, 3)

function optimal(a::Array{T}) where T
    s = zero(T)
    for i in axes(a, 2)
        s += a[1, i]^2 + a[2, i]^2 + a[3, i]^2
    end
    s
end

function less_optimal(b::Array{T}) where T
    s = zero(T)
    for i in axes(b, 1)
        s += b[i, 1]^2 + b[i, 2]^2 + b[i, 3]^2
    end
    s
end

@btime optimal($a)      # 242.270 ns (0 allocations: 0 bytes)
@btime less_optimal($b) # 185.330 ns (0 allocations: 0 bytes)

Increase n to make the arrays larger than cache and your statement holds, but for small n and the other side length being 3, the prefetcher can sometimes make the “less optimal” memory layout more favorable.

7 Likes

In MD simulations one generally needs to compute only the distances of particles that are close to each other, and these are essentially random pairs of triplets of coordinates of the vectors. Thus the only locality that is guaranteed is that of the three coordinates of each particle, and the 3xN (or static arrays) layout is better, summing all the distances is not a realistic application in this case.

1 Like

Thank you everyone! I will try your suggestions.

Ha, I am actually thinking of writing Ewald summation in julia myself. Perhaps a bit off-topic - is there any existing Ewald summation implementation in Julia already?

As for the coding style, the previously answers are already very good. The key is to avoid slicing inside the loop, since julia defaults to making copies. Even with views you will have some kind of memory allocation, which should be avoided inside nested loop.

I just want to add that writing out the operation explicitly, e.g. replacing:

norm(xyz[i, :] - xyz[j, :] + L * ns[l, :])

with

rij = 0.
for k=1:3
    rij +=  (xyz[i, k] - xyz[j, k] + L * ns[l, k]) ^ 2
end

should also recover the full speed of julia.
The speed should be equivalent to:

norm(xyz[i] - xyz[j] + L * ns[l])

where xyz and ns are Vector of static vectors storing the coordinates (SVector{3, Float64}).
So probably worth switching to StaticArrays.jl for storing positions/lattice vectors as it make code more concise.

Another trick I found is that you can add @time prefix to the for loop, this allows the number of allocations to be printed. You should aim for zero allocations inside the for loops.

It’s not equivalent, because (a) the loops/arrays you wrote are still in the wrong order for locality, (b) they may not be unrolled the way SVector operations are, and (c) the norm function is more complicated than sqrt(x^2 + y^2 + z^2) because norm avoids spurious overflow/underflow.

1 Like

I don’t think so. If someone puts up a working version I will be happy to help with optimizations, etc.

We have one here DFTK.jl/ewald.jl at master · JuliaMolSim/DFTK.jl · GitHub although that code is quite old and not particularly optimized

2 Likes

do you mean that it should be norm(xyz[:, i] - xyz[:, j] + L * ns[:, l]) or something to take advantage of column-major ordering, which I think Julia uses? (here I am assuming xyz becomes transpose(xyz). How much of a speedup does that give in your experience?

What I meant by 'equivalent’ was the column major version

rij = 0.
for k=1:3
    rij +=  (xyz[k, i] - xyz[k, j] + L * ns[k, l]) ^ 2
end

vs

norm(xyz[i] - xyz[j] + L * ns[l])

but thanks for @stevengj pointing out they are not actually equivalent.

So to test it:

julia> using LinearAlgebra

julia> using BenchmarkTools

julia> using StaticArrays

julia> x = rand(3, 10000);

julia> x1 = [SVector{3, Float64}(a) for a in eachcol(x)]

julia> function with_loop(x)
          out = zeros(10000)
          for i=1:10000, j=1:10000
              tmp = 0.
              for k=1:3
                  @inbounds tmp += (x[k, i] + x[k, j]) ^ 2
              end
              @inbounds out[i] += sqrt(tmp)
          end
          out
       end

julia> function with_norm(x)
          out = zeros(10000)
          for i = 1:10000
              for j = 1:10000
              @inbounds out[i] += norm(x[i] + x[j])
              end
          end
          out
       end

julia> all(with_loop(x) .≈ with_norm(x1))
true

then I got:

julia> @btime with_loop(x);
  180.568 ms (2 allocations: 78.17 KiB)

julia> @btime with_norm(x1);
  163.951 ms (2 allocations: 78.17 KiB)

As you can see, the second version using norm is about 10% faster, and has fewer lines of code/less easy to go wrong.

(on julia-1.7.2 intel laptop 8750H)