In order to understand the differences in performance better and learn a little bit of python and pythran, I have simplified the code to a single “acceleration”-like double loop. I will discuss some important points that are definitive for obtaining good performance in Julia:
- The most naive Julia implementation of the loop is (I compute the sum of the components of the forces in all cases just to be sure that am I doing the same thing every time):
function forces!(x,f)
n = length(x)
for i in 1:n
f[i] .= 0.
for j in i+1:n
dx = x[i] - x[j] # <-- Allocates a new vector on every iteration
f[i] = f[i] + dx # <-- also allocates
f[j] = f[j] - dx # <-- also allocates
end
end
fsum = 0.
for i in 1:n
for j in 1:3
fsum += f[i][j]
end
end
return fsum
end
N = 20_000
x = [ Float64[i,i,i] for i in 1:N ]
f = [ zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))
This code runs in ~40s
here. (all times are measured using time julia test.jl
, in bash).
I have marked one critical line, which allocates a new vector at every iteration. This has to be managed by hand, because the lengths of the vectors in x
are not know except at runtime.
- The fix for that line is to preallocate the
dx
vector, meaning:
function forces!(x,f)
n = length(x)
dx = zeros(3) # <-- preallocate dx
for i in 1:n
f[i] .= 0.
for j in i+1:n
dx .= x[i] .- x[j] # <-- brodacast assignment
f[i] .= f[i] .+ dx # <-- broadcast (this could have been done in the previous version as well)
f[j] .= f[j] .- dx # <-- broadcast
end
end
fsum = 0.
for i in 1:n
for j in 1:3
fsum += f[i][j]
end
end
return fsum
end
N = 20_000
x = [ Float64[i,i,i] for i in 1:N ]
f = [ zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))
Now execution takes ~4s
. A 10x speedup just because of that.
- Still, the allocation of that vector and the assignment of its elements can be avoided at all, if one uses tuples, static vectors or, more simply, just numbers:
function forces!(x,f)
n = length(x)
for i in 1:n
f[i] .= 0.
for j in i+1:n
for k in 1:3 # <-- now we do a loop here
dxₖ = x[i][k] - x[j][k]
f[i][k] = f[i][k] + dxₖ
f[j][k] = f[j][k] - dxₖ
end
end
end
fsum = 0.
for i in 1:n
for j in 1:3
fsum += f[i][j]
end
end
return fsum
end
N = 20_000
x = [ Float64[i,i,i] for i in 1:N ]
f = [ zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))
Execution time is reduced to ~2s
! A further 2x speedup for avoiding that assignment. This is the huge benefit of working with static variables.
- Finally, the actual good Julian way to do all this, is to use
StaticArrays
forx
andf
. The code is much cleaner, anddx
is naturally static in this case:
using StaticArrays
function forces!(x,f)
n = length(x)
for i in 1:n
f[i] = zeros(eltype(f))
for j in i+1:n
dx = x[i] - x[j]
f[i] = f[i] + dx
f[j] = f[j] - dx
end
end
fsum = 0.
for i in 1:n
for j in 1:3
fsum += f[i][j]
end
end
return fsum
end
N = 20_000
x = [ @SVector Float64[i,i,i] for i in 1:N ]
f = [ @SVector zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))
Execution time goes to slightly less than ~2s
now.
Thus, by taking care of allocations and using correctly static arrays, one can have differences of 20 times in execution time relative to first “naive” implementations.
I have tried to write a python-pythran
version of the same code, which I post here below. Edited the code to avoid allocating or assigning to the critical vector, as in the “third” version of Julia. It takes now impressive 1.34s
. Of course, not including compilation (as is the case of the Julia versions).
Compilation and test with this Makefile:
CC=gcc
CXX=g++
test: __pythran__/test.py
time python test.py
__pythran__/test.py: test.py
transonic test.py -af "-march=native -Ofast"
and the code is:
import numpy as np
from transonic import boost
@boost
def forces(x:"float[:,:]",f:"float[:,:]"):
n = x.shape[0]
for i in range(n):
for k in range(3):
f[i][k] = 0.
for j in range(i + 1, n):
for k in range(3): # <-- Loop as in the non-allocating Julia version
vk = x[i][k] - x[j][k]
f[i][k] = f[i][k] + vk
f[j][k] = f[j][k] - vk
fsum = 0.
for i in range(n):
for j in range(3):
fsum += f[i][j]
return fsum
if __name__ == "__main__":
import sys
N = 20000
x = np.array([[i,i,i] for i in range(N)],np.float64)
f = np.array([np.zeros(3) for i in range(N)],np.float64)
print("sum of forces = ",forces(x,f))
May be of your interest, @paugier .
Since at the end the times are similar and comparable to compilation time, I ran the python-pythran implementation and the static-arrays julia versions with 100_000 points. I get 20s for python-pythran and 16s for julia-staticarrays.
edit: I noticed that I was initializing x
as a vector of integers in all cases. Fixing that reduced times of all Julia implementations.
new edit: fixed some assignments in the second version, indicated below. The problem was that f[i] = f[i] + dx
allocates, while f[i] .= f[i] .+ dx
does not). The second version is then much improved.
Now, @simd
, @inbounds
or the trick of using 4-element vectors instead of 3-element vectors did not make any difference that was worth mentioning.
New edit:
The fortran equivalent takes 0.5s for the 20k case, and 11s for the 100k case, thus beating everyone here:
Fortran Code
Compiled with: gfotran -O3
program main
integer, parameter :: n = 100000
double precision :: x(n,3), f(n,3), dx(3), fsum
do i = 1,n
do j = 1,3
x(i,j) = i
end do
end do
do i = 1,n
do j = 1, 3
f(i,j) = 0.
end do
do j = i+1,n
do k = 1,3
dx(k) = x(i,k) - x(j,k)
f(i,k) = f(i,k) + dx(k)
f(j,k) = f(j,k) - dx(k)
end do
end do
end do
fsum = 0.d0
do i = 1,n
do j = 1,3
fsum = fsum + f(i,j)
end do
end do
write(*,*) " sum of forces = ", fsum
end