Column-wise calculations in arrays are slower

vsoler · December 19, 2023, 12:38pm

I just read the excellent article on improving calculation performance in loops on the Julia blog fast-numeric.

In the article we can read …

Julia arrays are stored in column-major order, which means that the rows of a column are contiguous, but the columns of a row are generally not. It is therefore generally more efficient to access data column-by-column than row-by-row.

The stated principle seems correct to me. However, benchmarking it doesn’t confirm it.

using BenchmarkTools

@btime begin
    global a
    a = rand(1_000, 1_000);
end;

# Natural implementation (not column-wise)
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        s = 0.
        for j = 1:n
            s += a[i,j]
        end
        r[i] = s
    end   
end

# Column-wise implementation
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        r[i] = a[i,1]
    end

    for j = 2:n, i = 1:m
        r[i] += a[i,j]
    end
end

I get the following results:

The conclusion is:
Using column-wise calculation loops is slower (some 40% in this case) and uses more allocations (25% in this case).

Therefore, column-wise calculations shouls be discouraged.

I am surprised by my own conclusion, am I doing anything wrong?

John_Gibson · December 19, 2023, 12:58pm

You should avoid untyped global variables and enclose your code blocks in functions in order to get meaningful timings. See Performance Tips · The Julia Language and Manual · BenchmarkTools.jl

TimG · December 19, 2023, 1:04pm

Your methods are dissimilar. If I make them symmetrical:

using BenchmarkTools

@btime begin
    global a
    a = rand(1_000, 1_000);
end;
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        s = 0.0
        for j = 1:n
            s += a[i, j]
        end
        r[i] = s
    end
end
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, n)
    for j = 1:n
        s=0.0
        for i = 1:m
            s += a[i, j]
        end
        r[j] = s
    end
end

I get:

1.178 ms (2 allocations: 7.63 MiB)
  89.509 ms (3980985 allocations: 76.04 MiB)
  76.960 ms (3980985 allocations: 76.04 MiB)

TimG · December 19, 2023, 1:14pm

And putting things in functions as @John_Gibson suggests:

using BenchmarkTools

function docolumns(a)
    m, n = size(a)
    r = Vector{Float64}(undef, n)
    for j = 1:n
        s = 0.0
        for i = 1:m
            s += a[i, j]
        end
        r[j] = s
    end
    return nothing
end
function dorows(a)
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        s = 0.0
        for j = 1:n
            s += a[i, j]
        end
        r[i] = s
    end
    return nothing
end

a = rand(1_000, 1_000)

@btime dorows($a)
@btime docolumns($a)

gives:

  990.200 μs (1 allocation: 7.94 KiB)
  922.700 μs (1 allocation: 7.94 KiB)

(Have I done this right?)

Edit: Changed the function calls to use $a instead of a and to update timings accordingly.

vsoler · December 19, 2023, 1:46pm

When you say that I should avoid “untyped global variables”, are you referring to “a” ?
Concerning the enclosing of code blocks in functions, you are perfectly right

vsoler · December 19, 2023, 1:48pm

Your code block is very clear, and the improvement of performance as well.

John_Gibson · December 19, 2023, 1:50pm

Yes. Typical benchmark structure is

using BenchmarkTools

a = randn(8,8)

function foo(a)
    # do something
end

@benchmark foo($a)

TimG · December 19, 2023, 2:12pm

Finally,

a = rand(10_000, 10_000)

gives

  987.501 ms (2 allocations: 78.17 KiB)
  105.369 ms (2 allocations: 78.17 KiB)

for rows and columns respectively.

Topic		Replies	Views
No significant performance difference of row- vs column-wise loops New to Julia memory-allocation , loops	2	1509	July 15, 2019
The columns and the rows - puzzled about performance Performance question	8	910	October 14, 2021
Looping over rows and columns: what am I doing wrong? Performance	4	1115	February 17, 2020
Performance difference when accessing square matrix rows-first or cols-first Performance	14	1733	April 13, 2021
Understanding major order performance when broadcasting in column vs row operations Performance question , array , benchmark	9	1005	June 21, 2021

Column-wise calculations in arrays are slower

Related topics