Thank you all again. I will specify a little bit further the problem, with another example. The sets of atomic coordinates for which the distances have to be computed are subsets of the total set of atomic coordinates. Essentially random sets (because I will compute the distances between atoms which are closer to each other than a specified cutoff, and this is essentially random given the output of a simulation). Thus, the sets of coordinates that are actually being considered for the distance calculations are non-contiguous subsets of the total number of atoms. For this reason, the column-based approach is faster than the row-based approach, we get more continuity by being sure that at least the coordinates of each atom are contiguous in memory (in the example now I do not show the benchmark of the row-based approach, which is worse than that of the column-based one).

I have rewritten the example in a more realistic way. Both `fastmath`

and `SVectors`

accelerate the calculations (thank you for the suggestions). Restructuring the code to use StaticArrays wherever possible will take some time, but I will try that.

I still wander if using BLAS within the Distances packages does not have any role here. I couldn’t get any improvement in performance from using its routines yet.

The current set of tests is the following.

EDIT: Here I just copy the subsets to get contiguous sets

in memory: I will post the results without copying in a minute.

```
using Test
using BenchmarkTools
using StaticArrays
# Data and size of sets
n = 10_000
xall = rand(3,n)
yall = rand(3,n)
# random subsets of 500 x points and 1000 y points will be considered
nset_x = 500
nset_y = 1000
println("----------------\n COLUMN BASED:\n ---------------- ")
x = copy(xall[:,collect(rand(1:n,nset_x))]) # copying here to be "fair" with the next text
y = copy(yall[:,collect(rand(1:n,nset_y))])
function sumd(x,y)
sumd = 0.
for i in 1:size(x,2)
for j in 1:size(y,2)
sumd += sqrt((x[1,i]-y[1,j])^2+(x[2,i]-y[2,j])^2+(x[3,i]-y[3,j])^2)
end
end
return sumd
end
function sumd2(x,y)
sumd = 0.
@inbounds @fastmath for i in 1:size(x,2)
for j in 1:size(y,2)
sumd += sqrt((x[1,i]-y[1,j])^2+(x[2,i]-y[2,j])^2+(x[3,i]-y[3,j])^2)
end
end
return sumd
end
s1 = sumd(x,y)
s2 = sumd2(x,y)
println(" simple loop: ")
@btime sumd($x,$y)
println(" inbounds, fast-math: ")
@btime sumd2($x,$y)
println("----------------\n Using StaticArrays:\n ---------------- ")
sx = [ SVector{3,Float64}(x[1:3,i]...) for i in 1:size(x,2) ]
sy = [ SVector{3,Float64}(y[1:3,i]...) for i in 1:size(y,2) ]
function sumd_SA(x,y)
sumd = 0.
for i in 1:length(x)
for j in 1:length(y)
sumd += sqrt((x[i][1]-y[j][1])^2+(x[i][2]-y[j][2])^2+(x[i][3]-y[j][3])^2)
end
end
return sumd
end
function sumd2_SA(x,y)
sumd = 0.
@inbounds @fastmath for i in 1:length(x)
for j in 1:length(y)
sumd += sqrt((x[i][1]-y[j][1])^2+(x[i][2]-y[j][2])^2+(x[i][3]-y[j][3])^2)
end
end
return sumd
end
s3 = sumd_SA(sx,sy)
s4 = sumd2_SA(sx,sy)
@test s1 ≈ s2 ≈ s3 ≈ s4
println(" simple loop: ")
@btime sumd_SA($sx,$sy)
println(" inbounds, fast-math: ")
@btime sumd2_SA($sx,$sy)
```

Results:

```
COLUMN BASED:
----------------
simple loop:
1.006 ms (0 allocations: 0 bytes)
fast-math:
751.763 μs (0 allocations: 0 bytes)
----------------
Using StaticArrays:
----------------
simple loop:
752.988 μs (0 allocations: 0 bytes)
fast-math:
551.326 μs (0 allocations: 0 bytes)
```

It seems that I might get a ~50% speedup using these features appropriately. It is worthwhile the try.