First of all:
energy += 1.0
end
end
end
return total /= N
I’ll assume energy
is supposed to be total
and move on.
Your second example contains:
xij = view(a, :, i)  view(a, :, j)
rij = sqrt(sum(abs2.(xij)))
You’re allocating two new vectors: xij
and abs2.(xij)
.
The second is easy to skip. Instead of sqrt(sum(abs2.(xij)))
, you can:
sqrt(sum(abs2, xij))
sqrt(xij' * xij)

norm(xij)
(you would need using LinearAlgebra
)
The first (allocation of xij
) is harder. One strategy is to just replace the views, sum, sqrt, etc with loops:
function loop_loops(a::AbstractMatrix)
K, N = size(a)
total = 0.0
for i = 1:N1
for j = (i + 1):N
rij = 0
for k in 1:K
xijk = a[k, i]  a[k, j]
rij += abs2(xijk)
end
rij = sqrt(rij)
if rij < 0.5
total += 1.0
end
end
end
return total / N
end
Another strategy would be to use StaticArray
s, and reinterpret you matrix as a vector of SVector{3,Float64}
.
But, to return the the focus of the thread, you aren’t actually changing between column and row order in your examples. In both, you are accessing them in column major order, but only 3 elements – which is not enough for SIMD.
If you replaced the views with slices that copied the data, then you would be doing that.
To truly benefit, you need more than 3 elements. The easiest way to do that is to transpose your inputs:
bt = copy(b'); # Copy is necessary, otherwise underlying data is not rearranged!
function loop_fast(a::AbstractMatrix)
N, K = size(a)
@assert K == 3
x = view(a, :, 1)
y = view(a, :, 2)
z = view(a, :, 3)
total = 0.0
for i = 1:N1
@inbounds @fastmath for j = (i + 1):N
xij = x[i]  x[j]
yij = y[i]  y[j]
zij = z[i]  z[j]
rij = sqrt(xij*xij + yij*yij + zij*zij)
if rij < 0.5
total += 1.0
end
end
end
return total / N
end
Now…
julia> @btime loop_rows($b)
21.113 μs (0 allocations: 0 bytes)
14.18
julia> @btime loop_cols($b)
397.699 μs (19800 allocations: 1.51 MiB)
14.18
julia> @btime loop_fast($bt)
12.066 μs (0 allocations: 0 bytes)
14.18
Note that the @inbounds @fastmath
only helps when the columnmajor loop is long (like the 100). When the columnmajor loop is only 3 elements, it’ll have negligible impact:
julia> function loop_rows(a::AbstractArray)
N = size(a, 2)
total = 0.0
x = view(a, 1, :)
y = view(a, 2, :)
z = view(a, 3, :)
for i = 1:N1
@inbounds @fastmath for j = (i + 1):N
xij = x[i]  x[j]
yij = y[i]  y[j]
zij = z[i]  z[j]
rij = sqrt(xij*xij + yij*yij + zij*zij)
if rij < 0.5
total += 1.0
end
end
end
return total /= N
end
loop_rows (generic function with 1 method)
julia> @btime loop_rows($b)
21.225 μs (0 allocations: 0 bytes)
14.18
Also worth pointing out that square roots are slow:
function loop_fast_noroot(a::AbstractMatrix)
N, K = size(a)
@assert K == 3
x = view(a, :, 1)
y = view(a, :, 2)
z = view(a, :, 3)
total = 0.0
for i = 1:N1
@inbounds @fastmath for j = (i + 1):N
xij = x[i]  x[j]
yij = y[i]  y[j]
zij = z[i]  z[j]
rij2 = xij*xij + yij*yij + zij*zij
if rij2 < 0.25
total += 1.0
end
end
end
return total / N
end
This yields:
julia> @btime loop_fast_noroot($bt)
3.611 μs (0 allocations: 0 bytes)
14.18