# optimising for loop performance

The following code (molecular dynamics simulation) was ported more or less directly from Fortran, but being new to Julia programming the performance isn’t as good. I followed the performance tips as much as I could, but I’m guessing there are ways to squeeze out additional speedups. It would be a good test-bed for me to learn how to use benchmarking tools in Julia, unfortunately I wasn’t able to interpret the results to find out possible bottlenecks (JuliaPro is giving me lots of errors so I run everything from the terminal, and ProfileView didn’t install cleanly).

Ay advice on how to improve this code for efficiency and/or to learn how to identify and optimise slow bits of code would be appreciated.

``````using BenchmarkTools, Compat

function potential(r::Array{Float64, 2},a::Array{Float64, 2})

Epot::Float64 = 0.0

fill!(a, 0.0)

rij = Array{Float64, 1}(3)
n::Int = size(r,2)

for i = 1:n-1
for j = i+1:n

@views rij = r[:,i] - r[:,j]
dist2::Float64 = dot(rij,rij)
r6::Float64 = (1.0/dist2)^3.0
r12::Float64 = r6*r6

Epot += r12 - r6

for k=1:3
tmp::Float64  = (r6 - 2.0*r12)/dist2 * rij[k]
a[k,i] -= tmp
a[k,j] += tmp
end

end
end

return Epot

end

n = 55

r=rand(3,n)
a=zeros(3,n)

function myfunc()

for i = 1:1000
Epot = potential(r,a)
end

end

@time myfunc()

@benchmark potential(r,a)
# @profile potential(r,a)
# @profile myfunc()
# Profile.print(format=:flat, sortedby=:count)
``````
1 Like

In Julia, you normally do not have to annotate types.
Annotations can be nice as documentation, but one of the strengths of Julia is how easy it makes writing generic code. Julia uses type inference to infer all the types at compile time, and compiles separate versions of the function for each set of input types so you can get the best of both worlds – flexibility, without giving up any runtime performance.

Here is a version focusing on being more flexible:

``````function potential!(a::AbstractMatrix, r::AbstractMatrix{T}) where T

m, n = size(r)
@boundscheck begin # following advice of tkoolen
if ( m > size(a,1) || n > size(a,2) )
throw(BoundsError("Dimensions of a: \$(size(a)); dimensions of r: \$(size(r))"))
end
end

Epot = zero(T)

fill!(a, 0)
rij = Vector{T}(undef, m)
# equivalent to
# rij = Array{T, 1}(undef, m)

@inbounds for i = 1:n-1, j = i+1:n
# alternatively, you can use views and `@.`
# @views @. rij = r[:,i] - r[:,j]
for k = 1:m
rij[k] = r[k,i] - r[k,j]
end
dist2 = sum(abs2, rij) # following DNF
r6 = (1/dist2)^3
r12 = r6*r6

Epot += r12 - r6

for k=1:m
tmp  = (r6 - 2*r12)/dist2 * rij[k]
a[k,i] -= tmp
a[k,j] += tmp
end
end

return Epot

end
``````

Note, if I hadn’t changed the name above, you would have to restart Julia to use it. When you create multiple functions with the same name but different specified argument types, Julia will choose the most concrete version it can for the inputs. That means, because you specifically specified `potential(r::Array{Float64, 2},a::Array{Float64, 2})`, your version will always be preferred whenever calling `potential(::Array{Float64, 2},::Array{Float64, 2})`.
I originally did not rename it, but following DNF’s post I did. It is standard Julia convention to end function names that mutate inputs with a bang “!”, and to also list the mutated arguments first (so I also swapped the order of the inputs).

There are three performance improvements in the above version:

1. You were allocating a new `rij` vector every time. You can avoid this via `@.` to broadcast, manually adding dots so the line reads `@views rij .= r[:,i] .- r[:,j]`, or just expanding it into a for loop. Unfortunately, the for loop is the fastest, because array views in Julia currently allocate memory. That will probably be fixed in most cases.
2. `@inbounds`. Fortran does not check array bounds at runtime, while Julia does. We can turn this off in code blocks within functions via `@inbounds`.
3. As pointed out by DNF, `sum(abs2, x)` is much faster for `x <: Vector`, when `x` is short. Speeds are similar for long `x` (eg, 1000).
4. Integer exponents are much faster:
``````julia> @benchmark (1.0/rand())^3.0
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     49.376 ns (0.00% GC)
median time:      50.876 ns (0.00% GC)
mean time:        51.771 ns (0.00% GC)
maximum time:     79.993 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     987

julia> @benchmark (1/rand())^3
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.610 ns (0.00% GC)
median time:      3.138 ns (0.00% GC)
mean time:        3.169 ns (0.00% GC)
maximum time:     20.509 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````

One final comment before moving on, when benchmarking use `\$` to interpolate non-constant variables into the expression you’re benchmarking.

``````julia> @benchmark potential!(\$a, \$r)
BenchmarkTools.Trial:
memory estimate:  112 bytes
allocs estimate:  1
--------------
minimum time:     18.338 μs (0.00% GC)
median time:      18.474 μs (0.00% GC)
mean time:        18.752 μs (0.00% GC)
maximum time:     36.766 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

We’re now only allocating memory for the one time we’re creating `rij`, before the for loop.

If you want even more performance, you can use StaticArrays.

``````using StaticArrays
function potential!(a::AbstractMatrix, r::Vector{SVector{M,T}}) where {M,T}

n = length(r)
@boundscheck begin # following advice of tkoolen
if ( M > size(a,1) || n > size(a,2) )
throw(BoundsError("Dimensions of a: \$(size(a)); dimensions of r: (\$M, \$n)"))
end
end

Epot = zero(T)

fill!(a, 0)

@inbounds for i = 1:n-1, j = i+1:n

rij = r[i] - r[j]
dist2 = rij' * rij
r6 = (1/dist2)^3
r12 = r6*r6

Epot += r12 - r6

for k=1:M
tmp  = (r6 - 2*r12)/dist2 * rij[k]
a[k,i] -= tmp
a[k,j] += tmp
end
end

return Epot

end
rv = [@SVector randn(3) for i = 1:n];
``````

This yields

``````julia> @benchmark potential!(\$a, \$rv)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     10.855 μs (0.00% GC)
median time:      11.107 μs (0.00% GC)
mean time:        11.193 μs (0.00% GC)
maximum time:     44.236 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

Julia will compile a different version of the function for each length of StaticArray you pass it, because the length is known at compile time. For this reason, Julia can also knows how much memory will be required, and just puts everything on the stack, avoiding any runtime heap memory allocations.

For reference, this is the initial version:

``````julia> @benchmark potential(\$r, \$a)
BenchmarkTools.Trial:
memory estimate:  301.75 KiB
allocs estimate:  4456
--------------
minimum time:     186.831 μs (0.00% GC)
median time:      190.487 μs (0.00% GC)
mean time:        217.959 μs (11.18% GC)
maximum time:     46.160 ms (99.44% GC)
--------------
samples:          10000
evals/sample:     1
``````
12 Likes

Great answer. The only thing I would change is to add a bounds check on the dimensions of `a` before the `@inbounds` annotation.

2 Likes

On my computer at least, `sum(abs2, rij)` is much faster than `rij' * rij`.

Also, maybe I missed it, but I cannot see what purpose the last loop fills:

``````for k=1:M
tmp  = (r6 - 2*r12)/dist2 * rij[k]
a[k,i] -= tmp
a[k,j] += tmp
end
``````

Should this just be deleted? In that case, `rij`, is also unneeded, you can just directly accumulate `dist2` without going through `rij`.

Edit: OK, if you’re using `a` outside the function, just modifying it in-place, then `potential` should be called `potential!`, with a `!`.

Excellent points @tkoolen and @DNF. I edited my comment following your suggestions.

I strongly agree with this. @learnjulia: there is no reason to limit your code to standard arrays and `Float64`s. If you do, it won’t work with other types of floats like `Float32` or `BigFloat`s, or with `Dual` numbers, so then automatic differentiation won’t work. Furthermore, Julia has lots of special kinds of arrays (see e.g. here, though this is just a tiny fraction of them), you have GPU arrays, and shared arrays for parallel processing, symmetrical array types, etc. etc. etc. None of them will work with your code if you limit it to standard `Array`s.

In general, don’t over-specify your types, and definitely don’t write things like `dist2::Float64 = dot(rij,rij)` where `::Float64` is just visual noise.

In my view, the best code has no type specifications at all! 3 Likes

Thanks for the detailed response, I’ll learn a lot from this. One quick question: I’m getting an error with “undef”,

ERROR: LoadError: UndefVarError: undef not defined

is this in a package or a more recent julia?

Indeed, I should have used `potential!`

It’s new syntax for Julia 0.7. `using Compat` will let it work on Julia 0.6, too.

Are you sure? I have `using Compat` at the top, yet still getting this error.

Makes sense, I’ll be more aware of this going forward. Being new to Julia I just follow examples found online, and I had this vague notion that specific types were better because they’d let the compiler optimise storage requirements etc. for each variable.

That must have been in reference to something else, I think.

I was a bit too general though. For example, when declaring new types (using e.g. `struct`), you should specify concrete or parametric types, for performance reasons.

What version of Compat are you on?
`Pkg.status("Compat")`?
You could try `Pkg.update()`.
If that doesn’t work, just delete the `undef`. It isn’t needed with Julia 0.6.
Starting in Julia 0.7, you do need the `undef` when allocating an Array without setting its contents, to warn that they’re arbitrary.

To expand a little more on type inference. You’re right that a compiler has to know the types of variables to generate efficient code. However, it’s rare that you actually have to specify the types in Julia, because the compiler normally figures out all the types on its own based on the input arguments. Julia compiles a separate version for each set of input types, letting the same high level code you write represent all sorts of different highly efficient typed low level code.

``````foo(a, b, c, d) = a' * b + c
@code_typed foo(2, 3, 4)
@code_typed foo(2., 3., 4.)
@code_typed foo(2f0, 3f0, 4f0,)
@code_typed foo(2, 3f0, 4.)
using StaticArrays
A = @SMatrix randn(8,16); B = @SMatrix randn(16,8); C = @SMatrix randn(8,8);
@code_typed foo(A, B, C)
a32 = @SVector randn(Float32, 16); b32 = @SVector randn(Float32, 16);
@code_typed foo(a32, b32, 1f0)
``````

We didn’t say anything about types at all in `foo`, so Julia did it all for us!

Sometimes type inference fails. `@code_warntype` is especially helpful for that, since it will highlight problems in red.

If you want to see low level representations of your code, there’s also `@code_llvm` to see the LLVM intermediate representation. and finally `@code_native` to see assembly. These will tend to vary between computers, since the compiler will optimize code for your specific architecture.
They’re mostly useful for checking if Julia is using SIMD and fma instructions.

The major caveats on inference is that Julia has to be able to deduce all the types from the input types, and that types should not change (ie, should remain stable) within a function.
Regarding deduction:

``````struct NoTypes
x
end
struct Parametric{T}
x::T
end
struct Static
x::Float64
end
t1 = NoTypes(2.3)
t2 = Parametric(2.3)
t3 = Static(2.3)
foo(a, b) = a.x + b
@code_warntype foo(t1, 7) # from a NoTypes, Julia can't figure out what a.x will be.
@code_warntype foo(t2, 7) # From Parametric{Float64}, Julia knows a.x must be a Float64
@code_warntype foo(t3, 7) # Also works.
``````

Accidentally letting types change can be an easy mistake to make.

``````function dot_product(x, y)
@boundscheck length(x) == length(y) || throw(BoundsError())
out = 0 # Created as an Int64
@inbounds @simd for i = eachindex(x)
out += x[i] * y[i] # If out + x[i] * y[i] is not an Int64, out's type will change!
end
out
end
@code_warntype dot_product(3, 8)
@code_warntype dot_product(Cint(3), Cint(8)) # promotes to Int64, still type stable
@code_warntype dot_product(1:4,5:8)
@code_warntype dot_product(collect(1:4),collect(5:8))
@code_warntype dot_product(randn(400), randn(400)) #oops! Promotes to Float64 -> type changes
``````

To be flexible here, we’d need

``````function dot_product(x, y)
@boundscheck length(x) == length(y) || throw(BoundsError())
out = zero(promote_type(eltype(x), eltype(y))) # Create a zero of the correct type
@inbounds @simd for i = eachindex(x)
out += x[i] * y[i]
end
out
end
``````

This is an example where you might just want to say `out = 0.0` if you’re writing your own code for a quick analysis. `zero(promote_type(eltype(x), eltype(y)))` is a bit verbose.

5 Likes

Thanks for the additional tips; very useful to see these macros in use (I read about them, but it’s another thing to use them). I’m running Version 0.6.2 and Compat 0.39.0, but removing undef worked fine for now. I really need to do a fresh reinstall of everything; I’m hoping the next version of JuliaPro will be more usable on my machine because a good IDE is quite useful when discovering a new language.

I have found that it’s easier to just install Atom (with the package uber-juno) and Julia separately. You can upgrade Julia whenever you want and just point Atom to whatever Julia version you want.

A last note that I don’t think anyone has mentioned. Julia is not automatically multithreaded. As a result, for lengthy parallelizeable code, you can often gain substantial speed through explicit shared or separate-memory parallelism. If I am reading your code correctly, then it looks like it can definitely be parallelized.

2 Likes