Adding up to the second answer. Consider this typical scenario, in which I am computing the sum of the square distances of 1000 particles in three dimensions.
I will define three functions and particle representations. One in which the coordinates of each particle is stored in each columns of a matrix. The other in which the coordinates are each row of the matrix (thus the long list of particles is col-wise), and a third using a vector of static 3D-vectors.
In the first test, I compute the sum of the distances of all 1000 particles generated, such that all matrices and vectors are contiguous in memory:
Code
using BenchmarkTools
using StaticArrays
function sumd_longline(M)
N = size(M,2)
s = 0.
@inbounds for i in 1:N-1
@simd for j in i+1:N
s += (M[1,i]-M[1,j])^2 + (M[2,i]-M[2,j])^2 + (M[3,i]-M[3,j])^2
end
end
s
end
function sumd_longcol(M)
N = size(M,1)
s = 0.
@inbounds for i in 1:N-1
@simd for j in i+1:N
s += (M[i,1]-M[j,1])^2 + (M[i,2]-M[j,2])^2 + (M[i,3]-M[j,3])^2
end
end
s
end
function sumd_staticvectors(x)
N = length(x)
s = 0.
@inbounds for i in 1:N-1
for j in i+1:N
s += (x[i][1]-x[j][1])^2 + (x[i][2]-x[j][2])^2 + (x[i][2]-x[j][2])^2
end
end
s
end
M1 = rand(3,1000);
print(" long line: ");@btime sumd_longline($M1)
M2 = rand(1000,3);
print(" long col: ");@btime sumd_longcol($M2)
x = [ rand(SVector{3,Float64}) for i in 1:1000 ];
print(" static vectors: ");@btime sumd_staticvectors($x)
The result is:
long line: 666.828 Ξs (0 allocations: 0 bytes)
long col: 230.356 Ξs (0 allocations: 0 bytes)
static vectors: 663.359 Ξs (0 allocations: 0 bytes)
As expected, there is a huge benefit in using a storage with long columns wich are run over in the inner loop, taking advantage of simd
operations.
However, as I mentioned, this is not the most typical situation in a particle simulation. You most likely will compute the distances (and interactions) between particles which have no reason to be contiguous in the memory, if you are using any algorithm that allows you to skip all computations for particles which are further away from each other than some cutoff.
In that case, a more realistic scenario is computing the sums above but between particle coordinates which are not contiguous in memory. The code below emulates that, by generating 10 thousand particles and computing the sum of squared distances of a random subset of 1000 of them:
Code
function sumd_longline(M,indexes)
N = length(indexes)
s = 0.
@inbounds for i1 in 1:N-1
@simd for j1 in i1+1:N
i = indexes[i1]
j = indexes[j1]
s += (M[1,i]-M[1,j])^2 + (M[2,i]-M[2,j])^2 + (M[3,i]-M[3,j])^2
end
end
s
end
function sumd_longcol(M,indexes)
N = length(indexes)
s = 0.
@inbounds for i1 in 1:N-1
@simd for j1 in i1+1:N
i = indexes[i1]
j = indexes[j1]
s += (M[i,1]-M[j,1])^2 + (M[i,2]-M[j,2])^2 + (M[i,3]-M[j,3])^2
end
end
s
end
function sumd_staticvectors(x,indexes)
N = length(indexes)
s = 0.
@inbounds for i1 in 1:N-1
for j1 in i1+1:N
i = indexes[i1]
j = indexes[j1]
s += (x[i][1]-x[j][1])^2 + (x[i][2]-x[j][2])^2 + (x[i][2]-x[j][2])^2
end
end
s
end
indexes = rand(1:10000,1000)
M1 = rand(3,10000)
print(" long line: ");@btime sumd_longline($M1,$indexes)
M2 = rand(10000,3)
print(" long col: ");@btime sumd_longcol($M2,$indexes)
x = [ rand(SVector{3,Float64}) for i in 1:10000 ];
print(" static vectors: ");@btime sumd_staticvectors($x,$indexes)
The result is:
long line: 962.355 Ξs (0 allocations: 0 bytes)
long col: 1.001 ms (0 allocations: 0 bytes)
static vectors: 785.297 Ξs (0 allocations: 0 bytes)
Since the access to the arrays is not anymore contiguous in memory in any of the cases, simd
operations do not provide any performance gain. And, now, the structure of a vector of static vectors is the best one. One could consider the possibility of copying the subarrays into contiguous arrays before doing the computations, but I think that is not worth the effort in most cases.
I am very interested in any deeper discussion on these possibilities, if any one has anything to add.