The following code (molecular dynamics simulation) was ported more or less directly from Fortran, but being new to Julia programming the performance isn’t as good. I followed the performance tips as much as I could, but I’m guessing there are ways to squeeze out additional speedups. It would be a good test-bed for me to learn how to use benchmarking tools in Julia, unfortunately I wasn’t able to interpret the results to find out possible bottlenecks (JuliaPro is giving me lots of errors so I run everything from the terminal, and ProfileView didn’t install cleanly).
Ay advice on how to improve this code for efficiency and/or to learn how to identify and optimise slow bits of code would be appreciated.
using BenchmarkTools, Compat
function potential(r::Array{Float64, 2},a::Array{Float64, 2})
Epot::Float64 = 0.0
fill!(a, 0.0)
rij = Array{Float64, 1}(3)
n::Int = size(r,2)
for i = 1:n-1
for j = i+1:n
@views rij = r[:,i] - r[:,j]
dist2::Float64 = dot(rij,rij)
r6::Float64 = (1.0/dist2)^3.0
r12::Float64 = r6*r6
Epot += r12 - r6
for k=1:3
tmp::Float64 = (r6 - 2.0*r12)/dist2 * rij[k]
a[k,i] -= tmp
a[k,j] += tmp
end
end
end
return Epot
end
n = 55
r=rand(3,n)
a=zeros(3,n)
function myfunc()
for i = 1:1000
Epot = potential(r,a)
end
end
@time myfunc()
@benchmark potential(r,a)
# @profile potential(r,a)
# @profile myfunc()
# Profile.print(format=:flat, sortedby=:count)
In Julia, you normally do not have to annotate types.
Annotations can be nice as documentation, but one of the strengths of Julia is how easy it makes writing generic code. Julia uses type inference to infer all the types at compile time, and compiles separate versions of the function for each set of input types so you can get the best of both worlds – flexibility, without giving up any runtime performance.
Here is a version focusing on being more flexible:
function potential!(a::AbstractMatrix, r::AbstractMatrix{T}) where T
m, n = size(r)
@boundscheck begin # following advice of tkoolen
if ( m > size(a,1) || n > size(a,2) )
throw(BoundsError("Dimensions of a: $(size(a)); dimensions of r: $(size(r))"))
end
end
Epot = zero(T)
fill!(a, 0)
rij = Vector{T}(undef, m)
# equivalent to
# rij = Array{T, 1}(undef, m)
@inbounds for i = 1:n-1, j = i+1:n
# alternatively, you can use views and `@.`
# @views @. rij = r[:,i] - r[:,j]
for k = 1:m
rij[k] = r[k,i] - r[k,j]
end
dist2 = sum(abs2, rij) # following DNF
r6 = (1/dist2)^3
r12 = r6*r6
Epot += r12 - r6
for k=1:m
tmp = (r6 - 2*r12)/dist2 * rij[k]
a[k,i] -= tmp
a[k,j] += tmp
end
end
return Epot
end
Note, if I hadn’t changed the name above, you would have to restart Julia to use it. When you create multiple functions with the same name but different specified argument types, Julia will choose the most concrete version it can for the inputs. That means, because you specifically specified potential(r::Array{Float64, 2},a::Array{Float64, 2}), your version will always be preferred whenever calling potential(::Array{Float64, 2},::Array{Float64, 2}).
I originally did not rename it, but following DNF’s post I did. It is standard Julia convention to end function names that mutate inputs with a bang “!”, and to also list the mutated arguments first (so I also swapped the order of the inputs).
There are three performance improvements in the above version:
You were allocating a new rij vector every time. You can avoid this via @. to broadcast, manually adding dots so the line reads @views rij .= r[:,i] .- r[:,j], or just expanding it into a for loop. Unfortunately, the for loop is the fastest, because array views in Julia currently allocate memory. That will probably be fixed in most cases.
@inbounds. Fortran does not check array bounds at runtime, while Julia does. We can turn this off in code blocks within functions via @inbounds.
As pointed out by DNF, sum(abs2, x) is much faster for x <: Vector, when x is short. Speeds are similar for long x (eg, 1000).
We’re now only allocating memory for the one time we’re creating rij, before the for loop.
If you want even more performance, you can use StaticArrays.
using StaticArrays
function potential!(a::AbstractMatrix, r::Vector{SVector{M,T}}) where {M,T}
n = length(r)
@boundscheck begin # following advice of tkoolen
if ( M > size(a,1) || n > size(a,2) )
throw(BoundsError("Dimensions of a: $(size(a)); dimensions of r: ($M, $n)"))
end
end
Epot = zero(T)
fill!(a, 0)
@inbounds for i = 1:n-1, j = i+1:n
rij = r[i] - r[j]
dist2 = rij' * rij
r6 = (1/dist2)^3
r12 = r6*r6
Epot += r12 - r6
for k=1:M
tmp = (r6 - 2*r12)/dist2 * rij[k]
a[k,i] -= tmp
a[k,j] += tmp
end
end
return Epot
end
rv = [@SVector randn(3) for i = 1:n];
Julia will compile a different version of the function for each length of StaticArray you pass it, because the length is known at compile time. For this reason, Julia can also knows how much memory will be required, and just puts everything on the stack, avoiding any runtime heap memory allocations.
I strongly agree with this. @learnjulia: there is no reason to limit your code to standard arrays and Float64s. If you do, it won’t work with other types of floats like Float32 or BigFloats, or with Dual numbers, so then automatic differentiation won’t work. Furthermore, Julia has lots of special kinds of arrays (see e.g. here, though this is just a tiny fraction of them), you have GPU arrays, and shared arrays for parallel processing, symmetrical array types, etc. etc. etc. None of them will work with your code if you limit it to standard Arrays.
In general, don’t over-specify your types, and definitely don’t write things like dist2::Float64 = dot(rij,rij) where ::Float64 is just visual noise.
In my view, the best code has no type specifications at all!
Makes sense, I’ll be more aware of this going forward. Being new to Julia I just follow examples found online, and I had this vague notion that specific types were better because they’d let the compiler optimise storage requirements etc. for each variable.
That must have been in reference to something else, I think.
I was a bit too general though. For example, when declaring new types (using e.g. struct), you should specify concrete or parametric types, for performance reasons.
What version of Compat are you on? Pkg.status("Compat")?
You could try Pkg.update().
If that doesn’t work, just delete the undef. It isn’t needed with Julia 0.6.
Starting in Julia 0.7, you do need the undef when allocating an Array without setting its contents, to warn that they’re arbitrary.
To expand a little more on type inference. You’re right that a compiler has to know the types of variables to generate efficient code. However, it’s rare that you actually have to specify the types in Julia, because the compiler normally figures out all the types on its own based on the input arguments. Julia compiles a separate version for each set of input types, letting the same high level code you write represent all sorts of different highly efficient typed low level code.
foo(a, b, c, d) = a' * b + c
@code_typed foo(2, 3, 4)
@code_typed foo(2., 3., 4.)
@code_typed foo(2f0, 3f0, 4f0,)
@code_typed foo(2, 3f0, 4.)
using StaticArrays
A = @SMatrix randn(8,16); B = @SMatrix randn(16,8); C = @SMatrix randn(8,8);
@code_typed foo(A, B, C)
a32 = @SVector randn(Float32, 16); b32 = @SVector randn(Float32, 16);
@code_typed foo(a32, b32, 1f0)
We didn’t say anything about types at all in foo, so Julia did it all for us!
Sometimes type inference fails. @code_warntype is especially helpful for that, since it will highlight problems in red.
If you want to see low level representations of your code, there’s also @code_llvm to see the LLVM intermediate representation. and finally @code_native to see assembly. These will tend to vary between computers, since the compiler will optimize code for your specific architecture.
They’re mostly useful for checking if Julia is using SIMD and fma instructions.
The major caveats on inference is that Julia has to be able to deduce all the types from the input types, and that types should not change (ie, should remain stable) within a function.
Regarding deduction:
struct NoTypes
x
end
struct Parametric{T}
x::T
end
struct Static
x::Float64
end
t1 = NoTypes(2.3)
t2 = Parametric(2.3)
t3 = Static(2.3)
foo(a, b) = a.x + b
@code_warntype foo(t1, 7) # from a NoTypes, Julia can't figure out what a.x will be.
@code_warntype foo(t2, 7) # From Parametric{Float64}, Julia knows a.x must be a Float64
@code_warntype foo(t3, 7) # Also works.
Accidentally letting types change can be an easy mistake to make.
function dot_product(x, y)
@boundscheck length(x) == length(y) || throw(BoundsError())
out = 0 # Created as an Int64
@inbounds @simd for i = eachindex(x)
out += x[i] * y[i] # If out + x[i] * y[i] is not an Int64, out's type will change!
end
out
end
@code_warntype dot_product(3, 8)
@code_warntype dot_product(Cint(3), Cint(8)) # promotes to Int64, still type stable
@code_warntype dot_product(1:4,5:8)
@code_warntype dot_product(collect(1:4),collect(5:8))
@code_warntype dot_product(randn(400), randn(400)) #oops! Promotes to Float64 -> type changes
To be flexible here, we’d need
function dot_product(x, y)
@boundscheck length(x) == length(y) || throw(BoundsError())
out = zero(promote_type(eltype(x), eltype(y))) # Create a zero of the correct type
@inbounds @simd for i = eachindex(x)
out += x[i] * y[i]
end
out
end
This is an example where you might just want to say out = 0.0 if you’re writing your own code for a quick analysis. zero(promote_type(eltype(x), eltype(y))) is a bit verbose.
Thanks for the additional tips; very useful to see these macros in use (I read about them, but it’s another thing to use them). I’m running Version 0.6.2 and Compat 0.39.0, but removing undef worked fine for now. I really need to do a fresh reinstall of everything; I’m hoping the next version of JuliaPro will be more usable on my machine because a good IDE is quite useful when discovering a new language.
I have found that it’s easier to just install Atom (with the package uber-juno) and Julia separately. You can upgrade Julia whenever you want and just point Atom to whatever Julia version you want.
A last note that I don’t think anyone has mentioned. Julia is not automatically multithreaded. As a result, for lengthy parallelizeable code, you can often gain substantial speed through explicit shared or separate-memory parallelism. If I am reading your code correctly, then it looks like it can definitely be parallelized.