optimising for loop performance


#1

The following code (molecular dynamics simulation) was ported more or less directly from Fortran, but being new to Julia programming the performance isn’t as good. I followed the performance tips as much as I could, but I’m guessing there are ways to squeeze out additional speedups. It would be a good test-bed for me to learn how to use benchmarking tools in Julia, unfortunately I wasn’t able to interpret the results to find out possible bottlenecks (JuliaPro is giving me lots of errors so I run everything from the terminal, and ProfileView didn’t install cleanly).

Ay advice on how to improve this code for efficiency and/or to learn how to identify and optimise slow bits of code would be appreciated.

using BenchmarkTools, Compat


function potential(r::Array{Float64, 2},a::Array{Float64, 2})

    Epot::Float64 = 0.0

    fill!(a, 0.0)

    rij = Array{Float64, 1}(3)
    n::Int = size(r,2)

    for i = 1:n-1
        for j = i+1:n

            @views rij = r[:,i] - r[:,j]
            dist2::Float64 = dot(rij,rij)
            r6::Float64 = (1.0/dist2)^3.0
            r12::Float64 = r6*r6

            Epot += r12 - r6

            for k=1:3
                tmp::Float64  = (r6 - 2.0*r12)/dist2 * rij[k]
                a[k,i] -= tmp
                a[k,j] += tmp
            end

            end
    end

    return Epot

end

n = 55

r=rand(3,n)
a=zeros(3,n)


function myfunc()

for i = 1:1000
    Epot = potential(r,a)
end

end

@time myfunc()

@benchmark potential(r,a)
# @profile potential(r,a)
 # @profile myfunc()
# Profile.print(format=:flat, sortedby=:count)

#2

In Julia, you normally do not have to annotate types.
Annotations can be nice as documentation, but one of the strengths of Julia is how easy it makes writing generic code. Julia uses type inference to infer all the types at compile time, and compiles separate versions of the function for each set of input types so you can get the best of both worlds – flexibility, without giving up any runtime performance.

Here is a version focusing on being more flexible:

function potential!(a::AbstractMatrix, r::AbstractMatrix{T}) where T
    
    m, n = size(r)
    @boundscheck begin # following advice of tkoolen
        if ( m > size(a,1) || n > size(a,2) )
            throw(BoundsError("Dimensions of a: $(size(a)); dimensions of r: $(size(r))"))
       end
    end

    Epot = zero(T)

    fill!(a, 0)
    rij = Vector{T}(undef, m)
    # equivalent to
    # rij = Array{T, 1}(undef, m)

    @inbounds for i = 1:n-1, j = i+1:n
        # alternatively, you can use views and `@.`
        # @views @. rij = r[:,i] - r[:,j]
        for k = 1:m
            rij[k] = r[k,i] - r[k,j]
        end
        dist2 = sum(abs2, rij) # following DNF
        r6 = (1/dist2)^3
        r12 = r6*r6

        Epot += r12 - r6

        for k=1:m
            tmp  = (r6 - 2*r12)/dist2 * rij[k]
            a[k,i] -= tmp
            a[k,j] += tmp
        end
    end

    return Epot

end

Note, if I hadn’t changed the name above, you would have to restart Julia to use it. When you create multiple functions with the same name but different specified argument types, Julia will choose the most concrete version it can for the inputs. That means, because you specifically specified potential(r::Array{Float64, 2},a::Array{Float64, 2}), your version will always be preferred whenever calling potential(::Array{Float64, 2},::Array{Float64, 2}).
I originally did not rename it, but following DNF’s post I did. It is standard Julia convention to end function names that mutate inputs with a bang “!”, and to also list the mutated arguments first (so I also swapped the order of the inputs).

There are three performance improvements in the above version:

  1. You were allocating a new rij vector every time. You can avoid this via @. to broadcast, manually adding dots so the line reads @views rij .= r[:,i] .- r[:,j], or just expanding it into a for loop. Unfortunately, the for loop is the fastest, because array views in Julia currently allocate memory. That will probably be fixed in most cases.
  2. @inbounds. Fortran does not check array bounds at runtime, while Julia does. We can turn this off in code blocks within functions via @inbounds.
  3. As pointed out by DNF, sum(abs2, x) is much faster for x <: Vector, when x is short. Speeds are similar for long x (eg, 1000).
  4. Integer exponents are much faster:
julia> @benchmark (1.0/rand())^3.0
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     49.376 ns (0.00% GC)
  median time:      50.876 ns (0.00% GC)
  mean time:        51.771 ns (0.00% GC)
  maximum time:     79.993 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     987

julia> @benchmark (1/rand())^3
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.610 ns (0.00% GC)
  median time:      3.138 ns (0.00% GC)
  mean time:        3.169 ns (0.00% GC)
  maximum time:     20.509 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

One final comment before moving on, when benchmarking use $ to interpolate non-constant variables into the expression you’re benchmarking.

julia> @benchmark potential!($a, $r)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     18.338 μs (0.00% GC)
  median time:      18.474 μs (0.00% GC)
  mean time:        18.752 μs (0.00% GC)
  maximum time:     36.766 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

We’re now only allocating memory for the one time we’re creating rij, before the for loop.

If you want even more performance, you can use StaticArrays.

using StaticArrays
function potential!(a::AbstractMatrix, r::Vector{SVector{M,T}}) where {M,T}

    n = length(r)
    @boundscheck begin # following advice of tkoolen
        if ( M > size(a,1) || n > size(a,2) )
            throw(BoundsError("Dimensions of a: $(size(a)); dimensions of r: ($M, $n)"))
       end
    end

    Epot = zero(T)

    fill!(a, 0)


    @inbounds for i = 1:n-1, j = i+1:n

        rij = r[i] - r[j]
        dist2 = rij' * rij
        r6 = (1/dist2)^3
        r12 = r6*r6

        Epot += r12 - r6

        for k=1:M
            tmp  = (r6 - 2*r12)/dist2 * rij[k]
            a[k,i] -= tmp
            a[k,j] += tmp
        end
    end

    return Epot

end
rv = [@SVector randn(3) for i = 1:n];

This yields

julia> @benchmark potential!($a, $rv)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.855 μs (0.00% GC)
  median time:      11.107 μs (0.00% GC)
  mean time:        11.193 μs (0.00% GC)
  maximum time:     44.236 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Julia will compile a different version of the function for each length of StaticArray you pass it, because the length is known at compile time. For this reason, Julia can also knows how much memory will be required, and just puts everything on the stack, avoiding any runtime heap memory allocations.

For reference, this is the initial version:

julia> @benchmark potential($r, $a)
BenchmarkTools.Trial: 
  memory estimate:  301.75 KiB
  allocs estimate:  4456
  --------------
  minimum time:     186.831 μs (0.00% GC)
  median time:      190.487 μs (0.00% GC)
  mean time:        217.959 μs (11.18% GC)
  maximum time:     46.160 ms (99.44% GC)
  --------------
  samples:          10000
  evals/sample:     1

#3

Great answer. The only thing I would change is to add a bounds check on the dimensions of a before the @inbounds annotation.


#4

On my computer at least, sum(abs2, rij) is much faster than rij' * rij.

Also, maybe I missed it, but I cannot see what purpose the last loop fills:

for k=1:M
     tmp  = (r6 - 2*r12)/dist2 * rij[k]
     a[k,i] -= tmp
     a[k,j] += tmp
end

Should this just be deleted? In that case, rij, is also unneeded, you can just directly accumulate dist2 without going through rij.

Edit: OK, if you’re using a outside the function, just modifying it in-place, then potential should be called potential!, with a !.


#5

Excellent points @tkoolen and @DNF. I edited my comment following your suggestions.


#6

I strongly agree with this. @learnjulia: there is no reason to limit your code to standard arrays and Float64s. If you do, it won’t work with other types of floats like Float32 or BigFloats, or with Dual numbers, so then automatic differentiation won’t work. Furthermore, Julia has lots of special kinds of arrays (see e.g. here, though this is just a tiny fraction of them), you have GPU arrays, and shared arrays for parallel processing, symmetrical array types, etc. etc. etc. None of them will work with your code if you limit it to standard Arrays.

In general, don’t over-specify your types, and definitely don’t write things like dist2::Float64 = dot(rij,rij) where ::Float64 is just visual noise.

In my view, the best code has no type specifications at all! :smiley:


#7

Thanks for the detailed response, I’ll learn a lot from this. One quick question: I’m getting an error with “undef”,

ERROR: LoadError: UndefVarError: undef not defined

is this in a package or a more recent julia?


#8

Indeed, I should have used potential!


#9

It’s new syntax for Julia 0.7. using Compat will let it work on Julia 0.6, too.


#10

Are you sure? I have using Compat at the top, yet still getting this error.


#11

Makes sense, I’ll be more aware of this going forward. Being new to Julia I just follow examples found online, and I had this vague notion that specific types were better because they’d let the compiler optimise storage requirements etc. for each variable.


#12

That must have been in reference to something else, I think.

I was a bit too general though. For example, when declaring new types (using e.g. struct), you should specify concrete or parametric types, for performance reasons.


#13

What version of Compat are you on?
Pkg.status("Compat")?
You could try Pkg.update().
If that doesn’t work, just delete the undef. It isn’t needed with Julia 0.6.
Starting in Julia 0.7, you do need the undef when allocating an Array without setting its contents, to warn that they’re arbitrary.

To expand a little more on type inference. You’re right that a compiler has to know the types of variables to generate efficient code. However, it’s rare that you actually have to specify the types in Julia, because the compiler normally figures out all the types on its own based on the input arguments. Julia compiles a separate version for each set of input types, letting the same high level code you write represent all sorts of different highly efficient typed low level code.

foo(a, b, c, d) = a' * b + c
@code_typed foo(2, 3, 4)
@code_typed foo(2., 3., 4.)
@code_typed foo(2f0, 3f0, 4f0,)
@code_typed foo(2, 3f0, 4.)
using StaticArrays
A = @SMatrix randn(8,16); B = @SMatrix randn(16,8); C = @SMatrix randn(8,8);
@code_typed foo(A, B, C)
a32 = @SVector randn(Float32, 16); b32 = @SVector randn(Float32, 16);
@code_typed foo(a32, b32, 1f0)

We didn’t say anything about types at all in foo, so Julia did it all for us!

Sometimes type inference fails. @code_warntype is especially helpful for that, since it will highlight problems in red.

If you want to see low level representations of your code, there’s also @code_llvm to see the LLVM intermediate representation. and finally @code_native to see assembly. These will tend to vary between computers, since the compiler will optimize code for your specific architecture.
They’re mostly useful for checking if Julia is using SIMD and fma instructions.

The major caveats on inference is that Julia has to be able to deduce all the types from the input types, and that types should not change (ie, should remain stable) within a function.
Regarding deduction:

struct NoTypes
    x
end
struct Parametric{T}
    x::T 
end
struct Static
    x::Float64
end
t1 = NoTypes(2.3)
t2 = Parametric(2.3)
t3 = Static(2.3)
foo(a, b) = a.x + b
@code_warntype foo(t1, 7) # from a NoTypes, Julia can't figure out what a.x will be.
@code_warntype foo(t2, 7) # From Parametric{Float64}, Julia knows a.x must be a Float64
@code_warntype foo(t3, 7) # Also works.

Accidentally letting types change can be an easy mistake to make.

function dot_product(x, y)
    @boundscheck length(x) == length(y) || throw(BoundsError())
    out = 0 # Created as an Int64
    @inbounds @simd for i = eachindex(x)
        out += x[i] * y[i] # If out + x[i] * y[i] is not an Int64, out's type will change!
    end
    out
end
@code_warntype dot_product(3, 8)
@code_warntype dot_product(Cint(3), Cint(8)) # promotes to Int64, still type stable
@code_warntype dot_product(1:4,5:8)
@code_warntype dot_product(collect(1:4),collect(5:8))
@code_warntype dot_product(randn(400), randn(400)) #oops! Promotes to Float64 -> type changes

To be flexible here, we’d need

function dot_product(x, y)
    @boundscheck length(x) == length(y) || throw(BoundsError())
    out = zero(promote_type(eltype(x), eltype(y))) # Create a zero of the correct type
    @inbounds @simd for i = eachindex(x)
        out += x[i] * y[i]
    end
    out
end

This is an example where you might just want to say out = 0.0 if you’re writing your own code for a quick analysis. zero(promote_type(eltype(x), eltype(y))) is a bit verbose.


#14

Thanks for the additional tips; very useful to see these macros in use (I read about them, but it’s another thing to use them). I’m running Version 0.6.2 and Compat 0.39.0, but removing undef worked fine for now. I really need to do a fresh reinstall of everything; I’m hoping the next version of JuliaPro will be more usable on my machine because a good IDE is quite useful when discovering a new language.


#15

I have found that it’s easier to just install Atom (with the package uber-juno) and Julia separately. You can upgrade Julia whenever you want and just point Atom to whatever Julia version you want.

A last note that I don’t think anyone has mentioned. Julia is not automatically multithreaded. As a result, for lengthy parallelizeable code, you can often gain substantial speed through explicit shared or separate-memory parallelism. If I am reading your code correctly, then it looks like it can definitely be parallelized.