Can I make my code faster with parallelism, or just plain better coding?

MrRobot · February 9, 2019, 7:57pm

I am trying to optimize a lennard-jones force calculation. When I first wrote it, It used millions of allocations and over 1 giB of memory, and these both increased if I looped over the algorithm. I have gotten it down to 2 allocations and 224 bytes of memory. The biggest factor was wrapping everything (functions and variable initialization) in a big function. However, it is not fast enough yet. I need another order of magnitude or else the code will be too long to execute. It will take weeks to do what Fortran or C can do in a day.

I tried adding @threads and using 4 threads but it speeds it up by less than 10% and the results are highly variable. I do not know if I am using it properly, and I am not knowledgeable enough on the other parallel options.

The code below is actually in two parts since I am also experimenting with Array of Structs vs. Struct of Arrays. They both seem about the same speed… I have labelled all parts clearly. I leave both in (sorry it makes the code seem long) since one method may be preferable.

Any advice on how to either parallelize it, or write the code better would be appreciated! An order of magnitude speedup would be amazing…

Each test has 2 functions, the main begins with TotalEnergy and the inner begins with Single. These are the only two functions that can be played with, everything else just sets up the necessary structs and default values.

The first 95 lines can be skipped, sorry for the length, but they set it up… The issue is only the functions that start with either Single or TotalEnergy/

i.e.,

function SingleLJ2!(f1::Vector{Float64},f2::Vector{Float64},
        coord1::XYZ,coord2::XYZ, ϵ::Float64, σ::Float64, box_size)

    dx = vector1D(coord1.x, coord2.x, box_size[1])
    dy = vector1D(coord1.y, coord2.y, box_size[2])
    dz = vector1D(coord1.z, coord2.z, box_size[3])

    rij_sq = dx*dx + dy*dy + dz*dz

    if  rij_sq > 1.0
        fill!(f1,0.0)
        fill!(f2,0.0)
        return
    else
        sr2 = σ^2 / rij_sq
        rmag = sqrt(rij_sq)
        sr6 = sr2 ^ 3
        sr12 = sr6 ^ 2
        f = ((24 * ϵ) / (rij_sq)) * (2 * sr12 - sr6)
        f1[1] = -f * dx
        f1[2] = -f * dy
        f1[3] = -f * dz
        f2 .= .-f1
    end
    nothing
end
function TotalEnergyStructOfArrays(array::StructOfArrays,vdwTable::Table,
                        n::Int64,box_size::Vector{Float64})

    f1 = Vector{Float64}(undef,3)
    f2 = Vector{Float64}(undef,3)
    for k=1:1 # this loop is redundant now, it used to increase memory and allocations
    @threads for i = 1:(n-1)
        for j = (i+1):n
            ti = array.type[i]
            tj = array.type[j]
            SingleLJ2!(f1,f2,array.r[i],array.r[i],vdwTable.ϵij[ti,tj],
                    vdwTable.σij[ti,tj],box_size)
            array.f[i].x += f1[1]
            array.f[i].y += f1[2]
            array.f[i].z += f1[3]
            array.f[j].x += f2[1]
            array.f[j].y += f2[2]
            array.f[j].z += f2[3]
        end
    end
    end
    return
end

Here is the full code for if anyone wished to run it, but the above segment of code is all that I am interested in optimizing (and the equivalent Array of Structures version) and also parallelizing.

#############################################################################
#
#    A general struct (Table) and two functions vector1D and vector.
#        - vector1D and vector are taken from Molly
#     https://github.com/jgreener64/Molly.jl/blob/master/src/md.jl
################################################################################
using Distributed
using Base.Threads
struct Table  # given dummy values below
    ϵij::Array{Float64,2}
    σij::Array{Float64,2}
end
mutable struct XYZ
    x::Float64
    y::Float64
    z::Float64
end
mutable struct ArrayOfStructs
    r::XYZ
    v::XYZ
    f::XYZ
    type::Int64
    cgnr::Int64
    name::String
    mass::Float64
    qq::Float64
    mol::String
end

mutable struct StructOfArrays
    r::Vector{XYZ}
    v::Vector{XYZ}
    f::Vector{XYZ}
    type::Vector{Int64}
    cgnr::Vector{Int64}
    name::Vector{String}
    mass::Vector{Float64}
    qq::Vector{Float64}
    mol::Vector{String}
end


"Vector between two coordinate values, accounting for mirror image seperation"
function vector1D(c1::Float64, c2::Float64, box_size::Float64)
    if c1 < c2
        return (c2 - c1) < (c1 - c2 + box_size) ? (c2 - c1) : (c2 - c1 - box_size)
    else
        return (c1 - c2) < (c2 - c1 + box_size) ? (c2 - c1) : (c2 - c1 + box_size)
    end
end

vector(coords_one::XYZ, coords_two::XYZ, box_size::Float64) =
         [vector1D(coords_one.x, coords_two.x, box_size),
        vector1D(coords_one.y, coords_two.y, box_size),
        vector1D(coords_one.z, coords_two.z, box_size)]

################################################################################
#                  Some general parameters used by all trials
################################################################################
function main()
box_size = [2.0,2.0,2.0]
x = 3011 # dummy number of atoms
m = 4    # dummy number of atom types (for table of parameters...)
println("This program is using ", Threads.nthreads(), " threads")
vdwTable = Table([rand() for i=1:m,j=1:m],
            [rand() for i=1:m,j=1:m]) #,
################################################################################
#                   Start of trials
################################################################################

arrayOfStructs = Vector{ArrayOfStructs}(undef,x)

for i=1:x
    arrayOfStructs[i] = ArrayOfStructs( XYZ(box_size[1] * rand(), box_size[1] * rand(), box_size[1] * rand() ),
                                        XYZ(0.0, 0.0, 0.0),
                                        XYZ(0.0, 0.0, 0.0),
                                        ceil(rand()*m),15,"hi",14.1,1.2,"mol" )
end

structOfArrays = StructOfArrays([XYZ(box_size[1] * rand(), box_size[1] * rand(), box_size[1] * rand()) for i=1:x],
                [XYZ(0.0, 0.0, 0.0) for i=1:x ],
                [XYZ(0.0, 0.0, 0.0) for i=1:x ],
                [ceil(rand())*m for i=1:x],
                [12 for i=1:x],
                ["hi" for i=1:x],
                [rand()*5.1 for i=1:x],
                [rand()*1.3 for i=1:x],
                ["string" for i=1:x])

################################################################################
#
#         Testing Array of Structs
#
################################################################################
function SingleLJ!(f1::Vector{Float64},f2::Vector{Float64},
        coord1::XYZ,coord2::XYZ, ϵ::Float64, σ::Float64, box_size)

    # method 1 # 9.99 M allocations: 685.959 MiB
        #diff = Vector{Float64}(undef,3)
        #diff[1] = vector1D(coord1.x, coord2.x, box_size[1])
        #diff[2] = vector1D(coord1.y, coord2.y, box_size[2])
        #diff[3] = vector1D(coord1.z, coord2.z, box_size[3])
    # method 2 # 9.99 M allocations: 685.959 MiB

    #    diff[1] = coord1.x - coord2.x
    #    diff[2] = coord1.y - coord2.y
    #    diff[3] = coord1.z - coord2.z
    #    diff .= diff .- round.(diff./box_size) .* box_size    # technically a bit more rigorous than method 1 and 3 # mirror image seperation
    #method 3   29.97 M allocations: 1.414 GiB
        #diff = vector(coord1, coord2, box_size[1]) # THis is very heavy on allocations

    #rij_sq = sum(diff.^2) # use with methods 1,2,3
            
    # method 4 2 allocations: 224 bytes
    dx = vector1D(coord1.x, coord2.x, box_size[1])
    dy = vector1D(coord1.y, coord2.y, box_size[2])
    dz = vector1D(coord1.z, coord2.z, box_size[3])

    rij_sq = dx*dx + dy*dy + dz*dz  # use only with method 4

    if  rij_sq > 1.0
        fill!(f1,0.0)
        fill!(f2,0.0)
        return
    else
        sr2 = σ^2 / rij_sq
        rmag = sqrt(rij_sq)
        sr6 = sr2 ^ 3
        sr12 = sr6 ^ 2
        f = ((24 * ϵ) / (rij_sq)) * (2 * sr12 - sr6)
        #f1[1] = -f * diff[1]
        #f1[2] = -f * diff[2]
        #f1[3] = -f * diff[3]
        #f2 .= .-f1

        f1[1] = -f * dx
        f1[2] = -f * dy
        f1[3] = -f * dz
        f2 .= .-f1
    end
    nothing
end
function TotalEnergyArrayOfStructs(array::Vector{ArrayOfStructs},vdwTable::Table,
                        n::Int64,box_size::Vector{Float64})

    f1 = Vector{Float64}(undef,3)
    f2 = Vector{Float64}(undef,3)
    for k=1:1 # this loop is redundant... I do not understand why allocations increase linearly with this loop...
    @threads for i = 1:(n-1)
        for j = (i+1):n
            ti = array[i].type
            tj = array[j].type
            SingleLJ!(f1,f2,array[i].r,array[j].r,vdwTable.ϵij[ti,tj],
                    vdwTable.σij[ti,tj],box_size)
            array[i].f.x += f1[1]
            array[i].f.y += f1[2]
            array[i].f.z += f1[3]
            array[j].f.x += f2[1]
            array[j].f.y += f2[2]
            array[j].f.z += f2[3]
        end
    end
    end
    return
end
TotalEnergyArrayOfStructs(arrayOfStructs,vdwTable,x,box_size)  # precompile
println("------------------------Start of Testing AoS------------------------------")
print("array of structs: ")
@time TotalEnergyArrayOfStructs(arrayOfStructs,vdwTable,x,box_size)
println("--------------------------End of Testing AoS------------------------------")

################################################################################
#
#                           Testing Struct of Arrays
#
################################################################################

function SingleLJ2!(f1::Vector{Float64},f2::Vector{Float64},
        coord1::XYZ,coord2::XYZ, ϵ::Float64, σ::Float64, box_size)

    dx = vector1D(coord1.x, coord2.x, box_size[1])
    dy = vector1D(coord1.y, coord2.y, box_size[2])
    dz = vector1D(coord1.z, coord2.z, box_size[3])

    rij_sq = dx*dx + dy*dy + dz*dz

    if  rij_sq > 1.0
        fill!(f1,0.0)
        fill!(f2,0.0)
        return
    else
        sr2 = σ^2 / rij_sq
        rmag = sqrt(rij_sq)
        sr6 = sr2 ^ 3
        sr12 = sr6 ^ 2
        f = ((24 * ϵ) / (rij_sq)) * (2 * sr12 - sr6)
        f1[1] = -f * dx
        f1[2] = -f * dy
        f1[3] = -f * dz
        f2 .= .-f1
    end
    nothing
end
function TotalEnergyStructOfArrays(array::StructOfArrays,vdwTable::Table,
                        n::Int64,box_size::Vector{Float64})

    f1 = Vector{Float64}(undef,3)
    f2 = Vector{Float64}(undef,3)
    for k=1:1 # this loop is redundant now, it used to increase memory and allocations
    @threads for i = 1:(n-1)
        for j = (i+1):n
            ti = array.type[i]
            tj = array.type[j]
            SingleLJ2!(f1,f2,array.r[i],array.r[i],vdwTable.ϵij[ti,tj],
                    vdwTable.σij[ti,tj],box_size)
            array.f[i].x += f1[1]
            array.f[i].y += f1[2]
            array.f[i].z += f1[3]
            array.f[j].x += f2[1]
            array.f[j].y += f2[2]
            array.f[j].z += f2[3]
        end
    end
    end
    return
end

TotalEnergyStructOfArrays(structOfArrays,vdwTable,x,box_size)  # precompile
println("------------------------Start of Testing SoA------------------------------")
print("Struct of Arrays: ")
@time TotalEnergyStructOfArrays(structOfArrays,vdwTable,x,box_size)
println("--------------------------End of Testing SoA------------------------------")

end

main()

mauro3 · February 9, 2019, 8:30pm

That is a lot of code you expect someone to review. You might get lucky but if not, try to focus your questions to much simpler things.

Have you read the performance tips section of the manual?
are the Fortran codes you’re comparing to threaded as well? I would expect that you can get Julia to perform within a factor of 2, probably less. But to be able to judge you need to compare single threaded to single threaded.

antoine-levitt · February 9, 2019, 9:03pm

Here are a few scattershot things to try. Try GitHub - JuliaArrays/StaticArrays.jl: Statically sized arrays for Julia for your size-3 arrays. Use @code_warntype to make sure there are no funny business going on. Profile to see whether the bottleneck is in the actual computation or if it is just moving memory around. Avoid mutable structs.

edit: more advanced micro-optimizations: x*x*x might be faster than x^3 (x^3 is not allowed to round intermediate results, while x*x*x is). @inbounds might help. Saving floating point divisions (which are pretty expensive) by computing things like 1/rij_sq might be faster.

MrRobot · February 9, 2019, 9:23pm

it is alot… However, there are only 2 functions that can be editted, and they are each around 15-20 lines, so not big… there is unfortunately a lot of stuff not part of the problem in the first 95 lines of the code.

MrRobot · February 9, 2019, 9:23pm

Very little actually needs to be optimized, only the functions that start with Single or TotalEnergy. Everything else is unfortunately long, but background noise.

ffevotte · February 9, 2019, 9:32pm

Here is a version of the SoA code which works faster:

struct Table  # given dummy values below
    ϵij::Array{Float64,2}
    σij::Array{Float64,2}
end

mutable struct XYZ
    x::Float64
    y::Float64
    z::Float64
end

struct StructOfArrays
    r::Vector{XYZ}
    v::Vector{XYZ}
    f::Vector{XYZ}
    typ::Vector{Int64}
    cgnr::Vector{Int64}
    name::Vector{String}
    mass::Vector{Float64}
    qq::Vector{Float64}
    mol::Vector{String}
end

function vector1D(c1::Float64, c2::Float64, box_size::Float64)
    if c1 < c2
        return (c2 - c1) < (c1 - c2 + box_size) ? (c2 - c1) : (c2 - c1 - box_size)
    else
        return (c1 - c2) < (c2 - c1 + box_size) ? (c2 - c1) : (c2 - c1 + box_size)
    end
end

function SingleLJ2!(f1::XYZ, f2::XYZ,
                    coord1::XYZ,coord2::XYZ, ϵ::Float64, σ::Float64, box_size)

    dx = vector1D(coord1.x, coord2.x, box_size[1])
    dy = vector1D(coord1.y, coord2.y, box_size[2])
    dz = vector1D(coord1.z, coord2.z, box_size[3])

    rij_sq = dx*dx + dy*dy + dz*dz

    if  rij_sq > 1.0
        return
    else
        sr2 = σ^2 / rij_sq
        rmag = sqrt(rij_sq)
        sr6 = sr2 ^ 3
        sr12 = sr6 ^ 2
        f = ((24 * ϵ) / rij_sq) * (2 * sr12 - sr6)

        x = -f * dx
        y = -f * dy
        z = -f * dz

        f1.x += x
        f1.y += y
        f1.z += z
        f2.x -= x
        f2.y -= y
        f2.z -= z
    end
    nothing
end

function TotalEnergyStructOfArrays(array::StructOfArrays,vdwTable::Table,
                                   n::Int64,box_size)

    @inbounds for i = 1:(n-1)
        ti = array.typ[i]
        for j = (i+1):n
            tj = array.typ[j]
            SingleLJ2!(array.f[i], array.f[j],
                       array.r[i], array.r[i],
                       vdwTable.ϵij[ti,tj],
                       vdwTable.σij[ti,tj],
                       box_size)
        end
    end
    return
end


################################################################################
#                  Some general parameters used by all trials
################################################################################
box_size = (2.0, 2.0, 2.0)
x = 3011 # dummy number of atoms
m = 4    # dummy number of atom types (for table of parameters...)
vdwTable = Table([rand() for i=1:m,j=1:m],
                 [rand() for i=1:m,j=1:m]) #,


################################################################################
#                   Start of trials
################################################################################

structOfArrays = StructOfArrays([XYZ(box_size[1] * rand(), box_size[1] * rand(), box_size[1] * rand()) for i=1:x],
                                [XYZ(0.0, 0.0, 0.0) for i=1:x ],
                                [XYZ(0.0, 0.0, 0.0) for i=1:x ],
                                [ceil(rand())*m for i=1:x],
                                [12 for i=1:x],
                                ["hi" for i=1:x],
                                [rand()*5.1 for i=1:x],
                                [rand()*1.3 for i=1:x],
                                ["string" for i=1:x])


################################################################################
#
#                           Testing Struct of Arrays
#
################################################################################

using BenchmarkTools
println("------------------------Start of Testing SoA------------------------------")
print("Struct of Arrays: ")
@btime TotalEnergyStructOfArrays($structOfArrays,$vdwTable,$x,$box_size)
println("--------------------------End of Testing SoA------------------------------")

This version benchmarks at 64ms (no allocations) on my machine, as compared to your original code which took around 210ms (3 allocations).

Some of the most important points to take into account (some of which were mentioned in previous answers):

avoid using general Arrays for fixed-size arrays (such as box_size)
in your case the 3-element arrays in SingleLJ2! can even be avoided altogether
@inbounds helps some more

IIRC the first two points above took me all the way down to 80ms. @inbounds saves an additional 15ms or so in this case. Hopefully the results are not perturbed, but I could not test them easily (it would have been good practice to have a way to easily and automatically check that).

Profiling shows that the bottleneck is in arithmetic operations in SingleLJ2!. I did not find any easy way to reduce the cost of those (but did not think about it for long either)…

aaowens · February 9, 2019, 9:35pm

It probably would be faster and cleaner with StaticArrays. You can avoid using a mutable struct and instead have a regular (immutable) struct, which should be more efficient. For example, you could replace

            array.f[i].x += f1[1]
            array.f[i].y += f1[2]
            array.f[i].z += f1[3]
            array.f[j].x += f2[1]
            array.f[j].y += f2[2]
            array.f[j].z += f2[3]

with

array.f[i] += f1
array.f[j] += f2

once you make the appropriate conversions.

bennedich · February 9, 2019, 9:48pm

Let’s look at it from this angle: Your inner loop is executed 3011*3010÷2 = 4531555 times. Your struct of arrays version runs in ~157 ms on my machine. That’s 157e6 / 4531555 = 34.6 ns per iteration, or 34.6 * 2.9 = 100.3 clock cycles per iteration on my 2.9 GHz machine.

Your inner loop contains several conditional branches, a bunch of memory accesses, division, square root (which seems unused? so probably optimized away by compiler), and quite a bit of other logic. It seems like a bit of a stretch to me to speed that up by an order of magnitude to reach ~10 clock cycles per iteration. Perhaps if you can get the loop to vectorize (SIMD), but with that vector1D I don’t know how feasible that is.

Then again, if you have C and Fortran versions of this code running an order of magnitude faster, it is of course possible. (Is that the case? If so, what’s the performance difference, and could you paste that code here?)

A few quick ideas (struct of arrays version):

Put @inbounds on the outer for loop.
– 157 ms → 142 ms
Replace f2 .= .-f1 with f2[1] = -f1[1]; f2[2] = -f1[2]; f2[3] = -f1[3]
– 142 ms → 114 ms
Put @inline infront of function SingleLJ2!
– 114 ms → 84 ms

Not an order of magnitude, but a factor of 2 speedup, with minimal changes.

To speed it up further, I’d experiment with StaticArrays, or deal with tuples instead of vectors, or extract all vector elements to individual variables (there’s a small overhead each time you access a vector). I’d also manually inline the functions and see if the code could be simplified/sped up that way.

This is all just micro-optimization, of course you should also look at the big picture, to see if you need to do all of these 4 million iterations, if you can reuse results, etc.

Forget about multi-threading for now. Your code is not thread-safe. Modifying it to be thread-safe might either make it slower (e.g. due to locking), or make it a lot more complicated, which increases the risk of bugs and makes it harder to spot more effective optimizations. Multi-threading is usually the very last optimization step I do (if I do it at all). If this thing is going to run for days, it might just be easier to start several processes of it to run in parallel, with different start parameters, and achieve multi-threading that way.

Btw, as long as you don’t use global variables, I don’t think there’s a need to wrap everything in a main function?

bennedich · February 9, 2019, 9:54pm

How would that be more efficient? Wouldn’t that allocate a new XYZ struct for each addition?

bennedich · February 9, 2019, 10:12pm

Is there a bug in your code btw? Shouldn’t this:

SingleLJ2!(f1, f2, array.r[i], array.r[i], ...

be:

SingleLJ2!(f1, f2, array.r[i], array.r[j], ...

This change makes it ~20 % slower…

I’d strongly recommend adding a simple test that makes sure the results are correct. It’s so easy otherwise to make mistakes like this and think that the code is suddenly faster, when in reality it no longer does the same thing.

MrRobot · February 9, 2019, 10:26pm

Yep you got me there. The results from this aren’t really testable(everything is generated as random numbers, so no physical system comes of this) hence why I didn’t spot it(edit: for real systems, yes, I do have many checks, a significant one being that it reproduces the required thermodynamic properties on test molecules that are very well documented). Also as I will mention in my next message which I will tag you in, I spent my time on the Array of Structures version, it did not have this bug ^^

MrRobot · February 9, 2019, 10:32pm

@bennedich @ffevotte Thanks for all of your help, the speedup is impressive! And, more importantly, I learned a bunch too

As a new user I can only tag two people… I also appreciate the comments from "antoine-levitt, mauro3 and aaowens!

I have a question though for bennedich and ffevotte… you both worked on the Structure of Arrays version, is there a reason? I personally much prefer working with Array of Structures, but I know people say that SoA is better. However, interestingly, when I add your combined recommendations, the AoS is about 4% faster. Not much, but still… Now, should I be wary of possible future expansions of my code where using AoS may come to bite me?

Thanks

aaowens · February 9, 2019, 10:42pm

I think this works as an example:

using StaticArrays, BenchmarkTools

mutable struct XYZ
  x::Float64
  y::Float64
  z::Float64
end
A = [XYZ(rand(), rand(), rand()) for i = 1:100000]
B = deepcopy(A)
As = [SVector{3}(rand(), rand(), rand()) for i = 1:100000]
Bs = deepcopy(As)
function test1(A, B)
  for i in eachindex(A)
    A[i].x += B[i].x
    A[i].y += B[i].y
    A[i].z += B[i].z
  end
end
function test2(A, B)
  for i in eachindex(A)
    A[i] += B[i]
  end
end
julia> @btime test1($A, $B)
  446.745 μs (0 allocations: 0 bytes)

julia> @btime test2($As, $Bs)
  210.350 μs (0 allocations: 0 bytes)

A struct won’t be allocated like a traditional array. It temporarily lives on the stack. The vector of SVector, As does get allocated, but the memory layout is just 3 floats in a row for each element. In contrast, I’m pretty sure the vector of mutable struct XYZ allocates a vector of pointers to separate XYZ objects . Possibly the compiler optimizes some of this away, I don’t know.

Also, operations on StaticArrays get SIMD vectorization naturally. If I look at the @code_llvm for test2, I see SIMD operations, but not in test1.

MrRobot · February 9, 2019, 10:59pm

I have been wondering about if using the XYZ struct is the way to go… it makes it hard to vectorize things, but variable.x is easier to read and understand… I will try using static arrays on my test set and see how it goes.

stevengj · February 9, 2019, 11:37pm

Have you thought about using an algorithm that is not O(n^2)? Say a fast-multipole algorithm or similar?

MrRobot · February 10, 2019, 12:51am

There are a lot of things I want to try, multiple time steps (RESPA) is one of them as well, especially for the electrostatics… I am not looking forward to being crushed by Ewalds, but first I must master the basics, and then I can move on.

Also in simulation we cheat and make neighborlists so we don’t bother interacting with the particles far far away, but alot of time, is still spent on the ones in range… too much time

I believe people have been playing around with fast multipole for lennard jones, I know they use them for electrostatics… I should revisit what their progress has been.

bennedich · February 10, 2019, 6:48am

When optimizing or refactoring code like this, I always employ a simple form of Golden Master Testing. Basically, all you’re testing then, is that your working version is equal to the original version. Whether it was correct or not to begin with, who knows, but at least it works the same.

This is usually trivial to set up. Example:

Random.seed!(0)          # fix random seed, so you always get the same data

A = rand(100, 100)        # create input data
B = complex_algorithm(A)  # output data returned as a matrix

# Now calculate some hash of the output (problem specific)
hash = sum(sum(B))

# And test against what you got with the original version of your code
expected = 8425.2178752625
hash ≈ expected || println("Expected $expected, got $hash")

Of course, a sum is not a perfect hash (and if you have NaNs it’s worthless), but I don’t think you need to go overboard in creating a perfect hash. A simple sum will almost always catch refactoring errors (like the one you had), and the above takes a minute or so to implement.

(Note: Just pay attention that if you rearrange floating point operations, floats may differ slightly due to rounding artifacts, so don’t do exact comparisons.)

There are many variations of this. A more robust option is to keep a copy of you original code, and call both that and your new code with the given input, and make sure that the results are equal.

MrRobot · February 15, 2019, 4:29am

@aaowens

How did you get the SVector to add inplace in a for loop?

When I use your code, it works on my computer, and the timing I get is the same as yours.

But, if I try to update a SVector in my SingleLJ2! function, it fails, I get ERROR: LoadError: setindex! ::SArray{Tuple{3},Float64,1,3}, value, ::Int) is not defined.

In this case I am passing a SVector{3} to a function (SingleLJ2!), and in that function I am trying to add to each of its 3 indices in a for loop, just like in your example. You example works… mine does not :S

I get that SVectors are immutable, but why does your example seem to break this? If I add to my SVector a Vector of size 3, that works, but I get a ton of allocations and memory usage when I do this inside my function!

Actually, all of the extra allocations are because I have to fill the SVector manually as f1 = [-fdx,-fdy,-f*dz]… i then fill the second SVector as f2 = -f1, and this generates no allocations… So I also am curious about this too… how do I avoid allocations if I fill the SVector with an array…

mauro3 · February 15, 2019, 8:35am

I think you misread @aaowens’s code, this A[i] += abc means “copy abc to location i of vector A”. This works because A is mutable (even though its elements are not). Thus A[i][1] += B[i][1] would not work because A[i][1] is a field of a SVector and thus cannot be updated.

To fill an SVector, see Constructing SVector with a loop - #7 by tkf. But for your example just do SVector(-fdx, -fdy, f*dz).

kristoffer.carlsson · February 15, 2019, 8:38am

Not all allocations are equal. “Allocations” of isbitstype structs are on the stack and since they are created from scratch the compiler knows that nothing holds a reference to them and can optimize based on that. Mutable structs can have other references to them so the compiler must defensive.

For things like 3-dimensional vectors in this case I would recommend always using immutable struct and writing the code in a more functional manner (similar as how you would write it in math / physics).

Topic		Replies	Views
Speeding up force calculations and mutable structs Performance	20	1270	August 5, 2020
Performance improvements for simple molecular surface caclulation Performance	15	680	September 14, 2020
Reducing allocations to optimize (to match Java speed) Performance	22	1405	December 10, 2018
Trying to understand how to write good code (example) New to Julia	34	3087	February 16, 2017
Learning to optimize Performance	12	1266	March 15, 2021

Can I make my code faster with parallelism, or just plain better coding?

Related topics