Looping over rows and columns: what am I doing wrong?

edwinb-ai · February 17, 2020, 3:26pm

Following some of the Performance Tips in the main Julia documentation I was a trying the following code, timing it with BenchmarkTools.jl:

using Random
using BenchmarkTools

b = rand(3, 100)

function loop_rows(a::AbstractArray)
    N = size(a, 2)
    total = 0.0

    x = view(a, 1, :)
    y = view(a, 2, :)
    z = view(a, 3, :)

    for i = 1:N-1
        for j = (i + 1):N
            xij = x[i] - x[j]
            yij = y[i] - y[j]
            zij = z[i] - z[j]
            rij = hypot(xij, yij, zij)
            if rij < 0.5
                total += 1.0
            end
        end
    end
    return total /= N
end

function loop_cols(a::AbstractArray)
    N = size(a, 2)
    total = 0.0

    for i = 1:N-1
        for j = (i + 1):N
            xij = view(a, :, i) - view(a, :, j)
            rij = sqrt(sum(abs2.(xij)))
            if rij < 0.5
                total += 1.0
            end
        end
    end
    return total /= N
end

@btime loop_rows($b)
# 14.265 μs (1 allocation: 16 bytes)
@btime loop_cols($b)
# 462.785 μs (19801 allocations: 1.51 MiB)

As I understand the array layout in Julia, the second function should perform faster and be more efficient as I am looping over the columns of the array, whereas the first function is using views of the rows from the array.

So, two questions:

Why is the first one faster even though I’m using the rows?
Why does the second one has a lot more allocations even though I’m using views?

Elrod · February 17, 2020, 4:31pm

First of all:

                energy += 1.0
            end
        end
    end
    return total /= N

I’ll assume energy is supposed to be total and move on.

Your second example contains:

    xij = view(a, :, i) - view(a, :, j)
    rij = sqrt(sum(abs2.(xij)))

You’re allocating two new vectors: xij and abs2.(xij).
The second is easy to skip. Instead of sqrt(sum(abs2.(xij))), you can:

sqrt(sum(abs2, xij))
sqrt(xij' * xij)
norm(xij) (you would need using LinearAlgebra)

The first (allocation of xij) is harder. One strategy is to just replace the views, sum, sqrt, etc with loops:

function loop_loops(a::AbstractMatrix)
    K, N = size(a)
    total = 0.0

    for i = 1:N-1
        for j = (i + 1):N
            rij = 0
            for k in 1:K
                xijk = a[k, i] - a[k, j]
                rij += abs2(xijk)
            end
            rij = sqrt(rij)
            if rij < 0.5
                total += 1.0
            end
        end
    end
    return total / N
end

Another strategy would be to use StaticArrays, and reinterpret you matrix as a vector of SVector{3,Float64}.

But, to return the the focus of the thread, you aren’t actually changing between column and row order in your examples. In both, you are accessing them in column major order, but only 3 elements – which is not enough for SIMD.
If you replaced the views with slices that copied the data, then you would be doing that.

To truly benefit, you need more than 3 elements. The easiest way to do that is to transpose your inputs:

bt = copy(b'); # Copy is necessary, otherwise underlying data is not rearranged!
function loop_fast(a::AbstractMatrix)
    N, K = size(a)
    @assert K == 3
    x = view(a, :, 1)
    y = view(a, :, 2)
    z = view(a, :, 3)
    total = 0.0
    for i = 1:N-1
        @inbounds @fastmath for j = (i + 1):N
            xij = x[i] - x[j]
            yij = y[i] - y[j]
            zij = z[i] - z[j]
            rij = sqrt(xij*xij + yij*yij + zij*zij)
            if rij < 0.5
                 total += 1.0
            end
        end
    end
    return total / N
end

Now…

julia> @btime loop_rows($b)
  21.113 μs (0 allocations: 0 bytes)
14.18

julia> @btime loop_cols($b)
  397.699 μs (19800 allocations: 1.51 MiB)
14.18

julia> @btime loop_fast($bt)
  12.066 μs (0 allocations: 0 bytes)
14.18

Note that the @inbounds @fastmath only helps when the column-major loop is long (like the 100). When the column-major loop is only 3 elements, it’ll have negligible impact:

julia> function loop_rows(a::AbstractArray)
           N = size(a, 2)
           total = 0.0

           x = view(a, 1, :)
           y = view(a, 2, :)
           z = view(a, 3, :)

           for i = 1:N-1
               @inbounds @fastmath for j = (i + 1):N
                   xij = x[i] - x[j]
                   yij = y[i] - y[j]
                   zij = z[i] - z[j]
                   rij = sqrt(xij*xij + yij*yij + zij*zij)
                   if rij < 0.5
                       total += 1.0
                   end
               end
           end
           return total /= N
       end
loop_rows (generic function with 1 method)

julia> @btime loop_rows($b)
  21.225 μs (0 allocations: 0 bytes)
14.18

Also worth pointing out that square roots are slow:

function loop_fast_noroot(a::AbstractMatrix)
    N, K = size(a)
    @assert K == 3
    x = view(a, :, 1)
    y = view(a, :, 2)
    z = view(a, :, 3)
    total = 0.0
    for i = 1:N-1
        @inbounds @fastmath for j = (i + 1):N
            xij = x[i] - x[j]
            yij = y[i] - y[j]
            zij = z[i] - z[j]
            rij2 = xij*xij + yij*yij + zij*zij
            if rij2 < 0.25
                 total += 1.0
            end
        end
    end
    return total / N
end

This yields:

julia> @btime loop_fast_noroot($bt)
  3.611 μs (0 allocations: 0 bytes)
14.18

edwinb-ai · February 17, 2020, 5:07pm

Hello! Thank you very much for the explanation, it is very enlightening now that I understand the difference between one and the other examples.

One last question though. In this example

function loop_fast_noroot(a::AbstractMatrix)
    N, K = size(a)
    @assert K == 3
    x = view(a, :, 1)
    y = view(a, :, 2)
    z = view(a, :, 3)
    total = 0.0
    for i = 1:N-1
        @inbounds @fastmath for j = (i + 1):N
            xij = x[i] - x[j]
            yij = y[i] - y[j]
            zij = z[i] - z[j]
            rij2 = xij*xij + yij*yij + zij*zij
            if rij2 < 0.25
                 total += 1.0
            end
        end
    end
    return total / N
end

you are allocating x, y and z, correct? So why is it that when doing the benchmark it says 0 allocations?

tbeason · February 17, 2020, 6:18pm

IIUC views are non-allocating if the compiler knows that they are only used within that function and not returned. So this would fall under that category.

edwinb-ai · February 17, 2020, 8:42pm

Oh I see, thanks a lot for pointing that out, I have a lot left to learn.

Topic		Replies	Views
Understanding major order performance when broadcasting in column vs row operations Performance question , array , benchmark	9	1005	June 21, 2021
Puzzled by the difference in performance when summing an array in different ways Performance array , memory-allocation	4	876	July 19, 2019
Array views becoming dominant source of memory allocation Performance	31	5171	November 14, 2018
No significant performance difference of row- vs column-wise loops New to Julia memory-allocation , loops	2	1509	July 15, 2019
Why is there memory allocation and time difference in @views and view? Performance memory-allocation	9	485	March 29, 2021

Looping over rows and columns: what am I doing wrong?

Related topics