Speeding up Ewald summation

Hi Julia!

I am writing a small molecular dynamics engine in Julia to teach myself the language and I noticed that 67% of the time spent running the following Ewald summation function (to calculate coulomb energy of an infinite crystal lattice in reciprocal space) is spent in garbage collection. Besides algorithmic improvements (neighbor lists, etc.), which can certainly be made here, what can I do to make it faster in terms of Julia tricks / good practices? Also, is there realistically any benefit from type annotations in code like here? I want my code to be as fast as possible of course, that’s why I like Julia.

For your information, the Topology type is a struct with some integers, floats, and arrays describing the molecular system. I also asked this question on Code Review on Stack Overflow, link here: performance - Ewald summation in Julia - Code Review Stack Exchange

Thank you! BTW I’m loving Julia so far. So clean and fast (mostly).

function getEwaldEnergy(top::Topology, xyz::Array{Float64})::Float64
    # cubic box only for now
    tolerance::Float64 = 1.0e-24
    tryUntil::Int64 = 4  # 5 box sizes gets it to five decimal places
    L::Float64 = top.box[1]
    rc::Float64 = L / 2.0

    α::Float64 = 3.5 / (top.box[1] / 2)
    σ::Float64 = 1 / (sqrt(2) * α)

    λ::Float64 = 1.0
    ns::Array{Int64} = getCombinations(-tryUntil, tryUntil, 3)

    # self-energy corrections
    selfEnergy::Float64 = 0.0
    for i::Int64 in 1:top.nAtoms
        selfEnergy += (1 / (sqrt(2 * pi) * σ)) * (top.charges[i] * top.charges[i])
    end

    # short-range energy
    shortRangeEnergy::Float64 = 0.0
    shortRangeEnergyBefore::Float64 = 0.0
    for l::Int64 in 1:size(ns)[1]
        shortRangeEnergyBefore = shortRangeEnergy
        for i in 1:top.nAtoms
            for j::Int64 in 1:top.nAtoms
                ci::Float64 = top.charges[i]
                cj::Float64 = top.charges[j]

                idx::Array{Array{Int64}} = top.bondIdx
                bonded::Bool = false

                # check if i/j pair is bonded
                for k::Int64 in 1:length(idx)
                    if (i == idx[k][1] && j == idx[k][2]) || (j == idx[k][1] && i == idx[k][2])
                        bonded = true
                        break
                    end
                end

                if bonded == false && !(i == j && ns[l, :] == [0.0, 0.0, 0.0])
                    r::Float64 = norm(xyz[i, :] - xyz[j, :] + L * ns[l, :])
                    if r < rc - λ
                        shortRangeEnergy += 0.5 * ((ci * cj) / r) * erfc(r / (sqrt(2) * σ))
                    elseif r < rc && r > rc - λ
                        # switching function turns on
                        shortRangeEnergy += 0.5 * ((ci * cj) * (rc - r)^2 * (-2 * rc + 2 * r + 3 * λ) * erfc(r / (sqrt(2) * σ))) / (2 * r * λ^3)
                    end
                end
            end
        end
    end


    # long-range energy
    longRangeEnergy::Float64 = 0.0
    longRangeEnergyBefore::Float64 = 0.0
    volume::Float64 = L^3
    ms = ns

    longRangeEnergyBefore = copy(longRangeEnergy)
    for l in 1:size(ms)[1]  # no k = [0, 0, 0]
        if ms[l, :] == [0.0, 0.0, 0.0]
            continue
        end

        # define reciprocal space vectors
        kVector::Array{Float64} = (2 * pi) * (ms[l, :] / L)
        k::Float64 = norm(kVector)

        # check this equal 1 good
        # println(exp(-im * dot(kVector, L * [ms[l, 1], ms[l, 2], ms[l, 3]])))
        strucFactor::ComplexF64 = 0.0
        for i in 1:top.nAtoms
            strucFactor += top.charges[i] * exp(im * dot(kVector, xyz[i, :]))
        end

        longRangeEnergy += ((4 * pi) / (volume)) * ((exp(-σ^2 * k^2 / 2) / k^2) * (abs(strucFactor)^2))
    end


    # this roughly cancels out long range energy due to bonded interactions (Tuckerman pg. 663)
    bondEnergy::Float64 = 0.0
    idx = top.bondIdx
    for b::Int64 in 1:size(idx)[1]
        for i in 1:top.nAtoms
            for j in 1:top.nAtoms
                if ((idx[b][1] == i && idx[b][2] == j) || (idx[b][1] == j && idx[b][2] == i)) # && !(i in isdone)
                    # println("firing j: ", j)
                    ci::Float64 = top.charges[i]
                    cj::Float64 = top.charges[j]
                    # rx::Float64 = xyz[i, 1] - xyz[j, 1]
                    # ry::Float64 = xyz[i, 2] - xyz[j, 2]
                    # rz::Float64 = xyz[i, 3] - xyz[j, 3]
                    rij::Float64 = norm(xyz[i, :] - xyz[j, :])
                    bondEnergy += (ci * cj * erf(rij / (sqrt(2) * σ))) / rij
                end
            end
        end
    end

    ewaldEnergy = shortRangeEnergy + longRangeEnergy - selfEnergy - bondEnergy

    return ewaldEnergy
end
2 Likes

style things:

  1. your type annotations aren’t helping
  2. size(x,1) is nicer than size(x)[1]

for performance the biggest thing (I think) is that x[i, :] creates a copy, and if you want views (which you almost certainly do), you can put @views before the for loops.

5 Likes

Note that this makes quite a few arrays, not just the slices, but L * ns[l, :] is another, and -, and +. You can save a few by broadcasting these, .+ .* etc, or save all of them by not making slices. Maybe just sqrt(sum(abs2(xyz[i, k] - xyz[j, k] + L * ns[l, k]) for k in axes(xyz,2))).

3 Likes
  • When you remove the type annotations be sure to replace

    strucFactor::ComplexF64 = 0.0
    

    with

    strucFactor = zero(ComplexF64)
    

    to avoid type instability.

  • exp(im * ...) can be replaced with cis(...)

  • (abs(strucFactor)^2) cab be replaced with abs2(strucFactor)

1 Like

It is easier to store coordinates as vectors of static vectors (from StaticArrays.jl). That will make the code cleaner and solve allocations that occur in operations like norm(x[i] - x[j]) where x[i or j] vectors of coordinates.

ps. Take a look at http://github.com/m3g/CellListMap.jl for short range interactions.

1 Like

Am I guessing correctly that this is an nAtoms × 3 storing the xyz coordinates of atoms? Not only is this slice allocating a copy (slow), and is in the wrong order for locality (slow), but it’s also a lot faster to unroll such tiny loops over 3 components. All three of these problems are fixed by using StaticArrays.

That is, your xyz array should instead be a length-nAtoms 1d array whose elements are SVector{3,Float64}.

I wouldn’t be surprised if you gained an order of magnitude in speed by switch to an array of SVector here (and everywhere in your code — it’s really the right data structure for coordinate arrays).

4 Likes

I previously thought this was always the right answer, but Chris Elrod showed me how life is more complicated at times

n = 256
a = randn(3, n)
b = randn(n, 3)

function optimal(a::Array{T}) where T
    s = zero(T)
    for i in axes(a, 2)
        s += a[1, i]^2 + a[2, i]^2 + a[3, i]^2
    end
    s
end

function less_optimal(b::Array{T}) where T
    s = zero(T)
    for i in axes(b, 1)
        s += b[i, 1]^2 + b[i, 2]^2 + b[i, 3]^2
    end
    s
end

@btime optimal($a)      # 242.270 ns (0 allocations: 0 bytes)
@btime less_optimal($b) # 185.330 ns (0 allocations: 0 bytes)

Increase n to make the arrays larger than cache and your statement holds, but for small n and the other side length being 3, the prefetcher can sometimes make the “less optimal” memory layout more favorable.

5 Likes

In MD simulations one generally needs to compute only the distances of particles that are close to each other, and these are essentially random pairs of triplets of coordinates of the vectors. Thus the only locality that is guaranteed is that of the three coordinates of each particle, and the 3xN (or static arrays) layout is better, summing all the distances is not a realistic application in this case.

1 Like

Thank you everyone! I will try your suggestions.