Column-wise calculations in arrays are slower

I just read the excellent article on improving calculation performance in loops on the Julia blog fast-numeric.

In the article we can read …

Julia arrays are stored in column-major order, which means that the rows of a column are contiguous, but the columns of a row are generally not. It is therefore generally more efficient to access data column-by-column than row-by-row.

The stated principle seems correct to me. However, benchmarking it doesn’t confirm it.

using BenchmarkTools

@btime begin
    global a
    a = rand(1_000, 1_000);
end;
# Natural implementation (not column-wise)
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        s = 0.
        for j = 1:n
            s += a[i,j]
        end
        r[i] = s
    end   
end
# Column-wise implementation
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        r[i] = a[i,1]
    end

    for j = 2:n, i = 1:m
        r[i] += a[i,j]
    end
end

I get the following results:
image

The conclusion is:
Using column-wise calculation loops is slower (some 40% in this case) and uses more allocations (25% in this case).

Therefore, column-wise calculations shouls be discouraged.

I am surprised by my own conclusion, am I doing anything wrong?

You should avoid untyped global variables and enclose your code blocks in functions in order to get meaningful timings. See Performance Tips · The Julia Language and Manual · BenchmarkTools.jl

Your methods are dissimilar. If I make them symmetrical:

using BenchmarkTools

@btime begin
    global a
    a = rand(1_000, 1_000);
end;
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        s = 0.0
        for j = 1:n
            s += a[i, j]
        end
        r[i] = s
    end
end
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, n)
    for j = 1:n
        s=0.0
        for i = 1:m
            s += a[i, j]
        end
        r[j] = s
    end
end

I get:

1.178 ms (2 allocations: 7.63 MiB)
  89.509 ms (3980985 allocations: 76.04 MiB)
  76.960 ms (3980985 allocations: 76.04 MiB)

And putting things in functions as @John_Gibson suggests:

using BenchmarkTools

function docolumns(a)
    m, n = size(a)
    r = Vector{Float64}(undef, n)
    for j = 1:n
        s = 0.0
        for i = 1:m
            s += a[i, j]
        end
        r[j] = s
    end
    return nothing
end
function dorows(a)
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        s = 0.0
        for j = 1:n
            s += a[i, j]
        end
        r[i] = s
    end
    return nothing
end

a = rand(1_000, 1_000)

@btime dorows($a)
@btime docolumns($a)

gives:

  990.200 μs (1 allocation: 7.94 KiB)
  922.700 μs (1 allocation: 7.94 KiB)

(Have I done this right?)

Edit: Changed the function calls to use $a instead of a and to update timings accordingly.

When you say that I should avoid “untyped global variables”, are you referring to “a” ?
Concerning the enclosing of code blocks in functions, you are perfectly right

Your code block is very clear, and the improvement of performance as well.

Yes. Typical benchmark structure is

using BenchmarkTools

a = randn(8,8)

function foo(a)
    # do something
end

@benchmark foo($a)
1 Like

Finally,

a = rand(10_000, 10_000)

gives

  987.501 ms (2 allocations: 78.17 KiB)
  105.369 ms (2 allocations: 78.17 KiB)

for rows and columns respectively.

1 Like