# Looping over rows and columns: what am I doing wrong?

Following some of the Performance Tips in the main Julia documentation I was a trying the following code, timing it with BenchmarkTools.jl:

``````using Random
using BenchmarkTools

b = rand(3, 100)

function loop_rows(a::AbstractArray)
N = size(a, 2)
total = 0.0

x = view(a, 1, :)
y = view(a, 2, :)
z = view(a, 3, :)

for i = 1:N-1
for j = (i + 1):N
xij = x[i] - x[j]
yij = y[i] - y[j]
zij = z[i] - z[j]
rij = hypot(xij, yij, zij)
if rij < 0.5
total += 1.0
end
end
end
end

function loop_cols(a::AbstractArray)
N = size(a, 2)
total = 0.0

for i = 1:N-1
for j = (i + 1):N
xij = view(a, :, i) - view(a, :, j)
rij = sqrt(sum(abs2.(xij)))
if rij < 0.5
total += 1.0
end
end
end
end

@btime loop_rows(\$b)
# 14.265 μs (1 allocation: 16 bytes)
@btime loop_cols(\$b)
# 462.785 μs (19801 allocations: 1.51 MiB)

``````

As I understand the array layout in Julia, the second function should perform faster and be more efficient as I am looping over the columns of the array, whereas the first function is using views of the rows from the array.

So, two questions:

• Why is the first one faster even though I’m using the rows?
• Why does the second one has a lot more allocations even though I’m using views?

First of all:

``````                energy += 1.0
end
end
end
``````

I’ll assume `energy` is supposed to be `total` and move on.

``````    xij = view(a, :, i) - view(a, :, j)
rij = sqrt(sum(abs2.(xij)))
``````

You’re allocating two new vectors: `xij` and `abs2.(xij)`.
The second is easy to skip. Instead of `sqrt(sum(abs2.(xij)))`, you can:

• `sqrt(sum(abs2, xij))`
• `sqrt(xij' * xij)`
• `norm(xij)` (you would need `using LinearAlgebra`)

The first (allocation of `xij`) is harder. One strategy is to just replace the views, sum, sqrt, etc with loops:

``````function loop_loops(a::AbstractMatrix)
K, N = size(a)
total = 0.0

for i = 1:N-1
for j = (i + 1):N
rij = 0
for k in 1:K
xijk = a[k, i] - a[k, j]
rij += abs2(xijk)
end
rij = sqrt(rij)
if rij < 0.5
total += 1.0
end
end
end
end
``````

Another strategy would be to use `StaticArray`s, and reinterpret you matrix as a vector of `SVector{3,Float64}`.

But, to return the the focus of the thread, you aren’t actually changing between column and row order in your examples. In both, you are accessing them in column major order, but only 3 elements – which is not enough for SIMD.
If you replaced the views with slices that copied the data, then you would be doing that.

To truly benefit, you need more than 3 elements. The easiest way to do that is to transpose your inputs:

``````bt = copy(b'); # Copy is necessary, otherwise underlying data is not rearranged!
function loop_fast(a::AbstractMatrix)
N, K = size(a)
@assert K == 3
x = view(a, :, 1)
y = view(a, :, 2)
z = view(a, :, 3)
total = 0.0
for i = 1:N-1
@inbounds @fastmath for j = (i + 1):N
xij = x[i] - x[j]
yij = y[i] - y[j]
zij = z[i] - z[j]
rij = sqrt(xij*xij + yij*yij + zij*zij)
if rij < 0.5
total += 1.0
end
end
end
end
``````

Now…

``````julia> @btime loop_rows(\$b)
21.113 μs (0 allocations: 0 bytes)
14.18

julia> @btime loop_cols(\$b)
397.699 μs (19800 allocations: 1.51 MiB)
14.18

julia> @btime loop_fast(\$bt)
12.066 μs (0 allocations: 0 bytes)
14.18
``````

Note that the `@inbounds @fastmath` only helps when the column-major loop is long (like the 100). When the column-major loop is only 3 elements, it’ll have negligible impact:

``````julia> function loop_rows(a::AbstractArray)
N = size(a, 2)
total = 0.0

x = view(a, 1, :)
y = view(a, 2, :)
z = view(a, 3, :)

for i = 1:N-1
@inbounds @fastmath for j = (i + 1):N
xij = x[i] - x[j]
yij = y[i] - y[j]
zij = z[i] - z[j]
rij = sqrt(xij*xij + yij*yij + zij*zij)
if rij < 0.5
total += 1.0
end
end
end
end
loop_rows (generic function with 1 method)

julia> @btime loop_rows(\$b)
21.225 μs (0 allocations: 0 bytes)
14.18
``````

Also worth pointing out that square roots are slow:

``````function loop_fast_noroot(a::AbstractMatrix)
N, K = size(a)
@assert K == 3
x = view(a, :, 1)
y = view(a, :, 2)
z = view(a, :, 3)
total = 0.0
for i = 1:N-1
@inbounds @fastmath for j = (i + 1):N
xij = x[i] - x[j]
yij = y[i] - y[j]
zij = z[i] - z[j]
rij2 = xij*xij + yij*yij + zij*zij
if rij2 < 0.25
total += 1.0
end
end
end
end
``````

This yields:

``````julia> @btime loop_fast_noroot(\$bt)
3.611 μs (0 allocations: 0 bytes)
14.18
``````
7 Likes

Hello! Thank you very much for the explanation, it is very enlightening now that I understand the difference between one and the other examples.

One last question though. In this example

``````function loop_fast_noroot(a::AbstractMatrix)
N, K = size(a)
@assert K == 3
x = view(a, :, 1)
y = view(a, :, 2)
z = view(a, :, 3)
total = 0.0
for i = 1:N-1
@inbounds @fastmath for j = (i + 1):N
xij = x[i] - x[j]
yij = y[i] - y[j]
zij = z[i] - z[j]
rij2 = xij*xij + yij*yij + zij*zij
if rij2 < 0.25
total += 1.0
end
end
end
you are allocating `x`, `y` and `z`, correct? So why is it that when doing the benchmark it says `0 allocations`?