Looping over rows and columns: what am I doing wrong?

Following some of the Performance Tips in the main Julia documentation I was a trying the following code, timing it with BenchmarkTools.jl:

using Random
using BenchmarkTools

b = rand(3, 100)

function loop_rows(a::AbstractArray)
    N = size(a, 2)
    total = 0.0

    x = view(a, 1, :)
    y = view(a, 2, :)
    z = view(a, 3, :)

    for i = 1:N-1
        for j = (i + 1):N
            xij = x[i] - x[j]
            yij = y[i] - y[j]
            zij = z[i] - z[j]
            rij = hypot(xij, yij, zij)
            if rij < 0.5
                total += 1.0
            end
        end
    end
    return total /= N
end

function loop_cols(a::AbstractArray)
    N = size(a, 2)
    total = 0.0

    for i = 1:N-1
        for j = (i + 1):N
            xij = view(a, :, i) - view(a, :, j)
            rij = sqrt(sum(abs2.(xij)))
            if rij < 0.5
                total += 1.0
            end
        end
    end
    return total /= N
end

@btime loop_rows($b)
# 14.265 μs (1 allocation: 16 bytes)
@btime loop_cols($b)
# 462.785 μs (19801 allocations: 1.51 MiB)

As I understand the array layout in Julia, the second function should perform faster and be more efficient as I am looping over the columns of the array, whereas the first function is using views of the rows from the array.

So, two questions:

  • Why is the first one faster even though I’m using the rows?
  • Why does the second one has a lot more allocations even though I’m using views?

First of all:

                energy += 1.0
            end
        end
    end
    return total /= N

I’ll assume energy is supposed to be total and move on.

Your second example contains:

    xij = view(a, :, i) - view(a, :, j)
    rij = sqrt(sum(abs2.(xij)))

You’re allocating two new vectors: xij and abs2.(xij).
The second is easy to skip. Instead of sqrt(sum(abs2.(xij))), you can:

  • sqrt(sum(abs2, xij))
  • sqrt(xij' * xij)
  • norm(xij) (you would need using LinearAlgebra)

The first (allocation of xij) is harder. One strategy is to just replace the views, sum, sqrt, etc with loops:

function loop_loops(a::AbstractMatrix)
    K, N = size(a)
    total = 0.0

    for i = 1:N-1
        for j = (i + 1):N
            rij = 0
            for k in 1:K
                xijk = a[k, i] - a[k, j]
                rij += abs2(xijk)
            end
            rij = sqrt(rij)
            if rij < 0.5
                total += 1.0
            end
        end
    end
    return total / N
end

Another strategy would be to use StaticArrays, and reinterpret you matrix as a vector of SVector{3,Float64}.

But, to return the the focus of the thread, you aren’t actually changing between column and row order in your examples. In both, you are accessing them in column major order, but only 3 elements – which is not enough for SIMD.
If you replaced the views with slices that copied the data, then you would be doing that.

To truly benefit, you need more than 3 elements. The easiest way to do that is to transpose your inputs:

bt = copy(b'); # Copy is necessary, otherwise underlying data is not rearranged!
function loop_fast(a::AbstractMatrix)
    N, K = size(a)
    @assert K == 3
    x = view(a, :, 1)
    y = view(a, :, 2)
    z = view(a, :, 3)
    total = 0.0
    for i = 1:N-1
        @inbounds @fastmath for j = (i + 1):N
            xij = x[i] - x[j]
            yij = y[i] - y[j]
            zij = z[i] - z[j]
            rij = sqrt(xij*xij + yij*yij + zij*zij)
            if rij < 0.5
                 total += 1.0
            end
        end
    end
    return total / N
end

Now…

julia> @btime loop_rows($b)
  21.113 μs (0 allocations: 0 bytes)
14.18

julia> @btime loop_cols($b)
  397.699 μs (19800 allocations: 1.51 MiB)
14.18

julia> @btime loop_fast($bt)
  12.066 μs (0 allocations: 0 bytes)
14.18

Note that the @inbounds @fastmath only helps when the column-major loop is long (like the 100). When the column-major loop is only 3 elements, it’ll have negligible impact:

julia> function loop_rows(a::AbstractArray)
           N = size(a, 2)
           total = 0.0

           x = view(a, 1, :)
           y = view(a, 2, :)
           z = view(a, 3, :)

           for i = 1:N-1
               @inbounds @fastmath for j = (i + 1):N
                   xij = x[i] - x[j]
                   yij = y[i] - y[j]
                   zij = z[i] - z[j]
                   rij = sqrt(xij*xij + yij*yij + zij*zij)
                   if rij < 0.5
                       total += 1.0
                   end
               end
           end
           return total /= N
       end
loop_rows (generic function with 1 method)

julia> @btime loop_rows($b)
  21.225 μs (0 allocations: 0 bytes)
14.18

Also worth pointing out that square roots are slow:

function loop_fast_noroot(a::AbstractMatrix)
    N, K = size(a)
    @assert K == 3
    x = view(a, :, 1)
    y = view(a, :, 2)
    z = view(a, :, 3)
    total = 0.0
    for i = 1:N-1
        @inbounds @fastmath for j = (i + 1):N
            xij = x[i] - x[j]
            yij = y[i] - y[j]
            zij = z[i] - z[j]
            rij2 = xij*xij + yij*yij + zij*zij
            if rij2 < 0.25
                 total += 1.0
            end
        end
    end
    return total / N
end

This yields:

julia> @btime loop_fast_noroot($bt)
  3.611 μs (0 allocations: 0 bytes)
14.18
7 Likes

Hello! Thank you very much for the explanation, it is very enlightening now that I understand the difference between one and the other examples.

One last question though. In this example

function loop_fast_noroot(a::AbstractMatrix)
    N, K = size(a)
    @assert K == 3
    x = view(a, :, 1)
    y = view(a, :, 2)
    z = view(a, :, 3)
    total = 0.0
    for i = 1:N-1
        @inbounds @fastmath for j = (i + 1):N
            xij = x[i] - x[j]
            yij = y[i] - y[j]
            zij = z[i] - z[j]
            rij2 = xij*xij + yij*yij + zij*zij
            if rij2 < 0.25
                 total += 1.0
            end
        end
    end
    return total / N
end

you are allocating x, y and z, correct? So why is it that when doing the benchmark it says 0 allocations?

IIUC views are non-allocating if the compiler knows that they are only used within that function and not returned. So this would fall under that category.

2 Likes

Oh I see, thanks a lot for pointing that out, I have a lot left to learn.