In order to understand the differences in performance better and learn a little bit of python and pythran, I have simplified the code to a single “acceleration”-like double loop. I will discuss some important points that are definitive for obtaining good performance in Julia:
- The most naive Julia implementation of the loop is (I compute the sum of the components of the forces in all cases just to be sure that am I doing the same thing every time):
function forces!(x,f)
n = length(x)
for i in 1:n
f[i] .= 0.
for j in i+1:n
dx = x[i] - x[j] # <-- Allocates a new vector on every iteration
f[i] = f[i] + dx # <-- also allocates
f[j] = f[j] - dx # <-- also allocates
end
end
fsum = 0.
for i in 1:n
for j in 1:3
fsum += f[i][j]
end
end
return fsum
end
N = 20_000
x = [ Float64[i,i,i] for i in 1:N ]
f = [ zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))
This code runs in ~40s here. (all times are measured using time julia test.jl, in bash).
I have marked one critical line, which allocates a new vector at every iteration. This has to be managed by hand, because the lengths of the vectors in x are not know except at runtime.
- The fix for that line is to preallocate the
dxvector, meaning:
function forces!(x,f)
n = length(x)
dx = zeros(3) # <-- preallocate dx
for i in 1:n
f[i] .= 0.
for j in i+1:n
dx .= x[i] .- x[j] # <-- brodacast assignment
f[i] .= f[i] .+ dx # <-- broadcast (this could have been done in the previous version as well)
f[j] .= f[j] .- dx # <-- broadcast
end
end
fsum = 0.
for i in 1:n
for j in 1:3
fsum += f[i][j]
end
end
return fsum
end
N = 20_000
x = [ Float64[i,i,i] for i in 1:N ]
f = [ zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))
Now execution takes ~4s. A 10x speedup just because of that.
- Still, the allocation of that vector and the assignment of its elements can be avoided at all, if one uses tuples, static vectors or, more simply, just numbers:
function forces!(x,f)
n = length(x)
for i in 1:n
f[i] .= 0.
for j in i+1:n
for k in 1:3 # <-- now we do a loop here
dxₖ = x[i][k] - x[j][k]
f[i][k] = f[i][k] + dxₖ
f[j][k] = f[j][k] - dxₖ
end
end
end
fsum = 0.
for i in 1:n
for j in 1:3
fsum += f[i][j]
end
end
return fsum
end
N = 20_000
x = [ Float64[i,i,i] for i in 1:N ]
f = [ zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))
Execution time is reduced to ~2s! A further 2x speedup for avoiding that assignment. This is the huge benefit of working with static variables.
- Finally, the actual good Julian way to do all this, is to use
StaticArraysforxandf. The code is much cleaner, anddxis naturally static in this case:
using StaticArrays
function forces!(x,f)
n = length(x)
for i in 1:n
f[i] = zeros(eltype(f))
for j in i+1:n
dx = x[i] - x[j]
f[i] = f[i] + dx
f[j] = f[j] - dx
end
end
fsum = 0.
for i in 1:n
for j in 1:3
fsum += f[i][j]
end
end
return fsum
end
N = 20_000
x = [ @SVector Float64[i,i,i] for i in 1:N ]
f = [ @SVector zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))
Execution time goes to slightly less than ~2s now.
Thus, by taking care of allocations and using correctly static arrays, one can have differences of 20 times in execution time relative to first “naive” implementations.
I have tried to write a python-pythran version of the same code, which I post here below. Edited the code to avoid allocating or assigning to the critical vector, as in the “third” version of Julia. It takes now impressive 1.34s. Of course, not including compilation (as is the case of the Julia versions).
Compilation and test with this Makefile:
CC=gcc
CXX=g++
test: __pythran__/test.py
time python test.py
__pythran__/test.py: test.py
transonic test.py -af "-march=native -Ofast"
and the code is:
import numpy as np
from transonic import boost
@boost
def forces(x:"float[:,:]",f:"float[:,:]"):
n = x.shape[0]
for i in range(n):
for k in range(3):
f[i][k] = 0.
for j in range(i + 1, n):
for k in range(3): # <-- Loop as in the non-allocating Julia version
vk = x[i][k] - x[j][k]
f[i][k] = f[i][k] + vk
f[j][k] = f[j][k] - vk
fsum = 0.
for i in range(n):
for j in range(3):
fsum += f[i][j]
return fsum
if __name__ == "__main__":
import sys
N = 20000
x = np.array([[i,i,i] for i in range(N)],np.float64)
f = np.array([np.zeros(3) for i in range(N)],np.float64)
print("sum of forces = ",forces(x,f))
May be of your interest, @paugier .
Since at the end the times are similar and comparable to compilation time, I ran the python-pythran implementation and the static-arrays julia versions with 100_000 points. I get 20s for python-pythran and 16s for julia-staticarrays.
edit: I noticed that I was initializing x as a vector of integers in all cases. Fixing that reduced times of all Julia implementations.
new edit: fixed some assignments in the second version, indicated below. The problem was that f[i] = f[i] + dx allocates, while f[i] .= f[i] .+ dx does not). The second version is then much improved.
Now, @simd, @inbounds or the trick of using 4-element vectors instead of 3-element vectors did not make any difference that was worth mentioning.
New edit:
The fortran equivalent takes 0.5s for the 20k case, and 11s for the 100k case, thus beating everyone here:
Fortran Code
Compiled with: gfotran -O3
program main
integer, parameter :: n = 100000
double precision :: x(n,3), f(n,3), dx(3), fsum
do i = 1,n
do j = 1,3
x(i,j) = i
end do
end do
do i = 1,n
do j = 1, 3
f(i,j) = 0.
end do
do j = i+1,n
do k = 1,3
dx(k) = x(i,k) - x(j,k)
f(i,k) = f(i,k) + dx(k)
f(j,k) = f(j,k) - dx(k)
end do
end do
end do
fsum = 0.d0
do i = 1,n
do j = 1,3
fsum = fsum + f(i,j)
end do
end do
write(*,*) " sum of forces = ", fsum
end