Column-wise calculations in arrays are slower

I just read the excellent article on improving calculation performance in loops on the Julia blog fast-numeric.

In the article we can read …

Julia arrays are stored in column-major order, which means that the rows of a column are contiguous, but the columns of a row are generally not. It is therefore generally more efficient to access data column-by-column than row-by-row.

The stated principle seems correct to me. However, benchmarking it doesn’t confirm it.

using BenchmarkTools

@btime begin
    global a
    a = rand(1_000, 1_000);
end;
# Natural implementation (not column-wise)
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        s = 0.
        for j = 1:n
            s += a[i,j]
        end
        r[i] = s
    end   
end
# Column-wise implementation
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        r[i] = a[i,1]
    end

    for j = 2:n, i = 1:m
        r[i] += a[i,j]
    end
end

I get the following results:

The conclusion is:
Using column-wise calculation loops is slower (some 40% in this case) and uses more allocations (25% in this case).

Therefore, column-wise calculations shouls be discouraged.

I am surprised by my own conclusion, am I doing anything wrong?

You should avoid untyped global variables and enclose your code blocks in functions in order to get meaningful timings. See Performance Tips · The Julia Language and Manual · BenchmarkTools.jl

Your methods are dissimilar. If I make them symmetrical:

using BenchmarkTools

@btime begin
    global a
    a = rand(1_000, 1_000);
end;
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        s = 0.0
        for j = 1:n
            s += a[i, j]
        end
        r[i] = s
    end
end
@btime begin
    m, n = size(a)
    r = Vector{Float64}(undef, n)
    for j = 1:n
        s=0.0
        for i = 1:m
            s += a[i, j]
        end
        r[j] = s
    end
end

I get:

1.178 ms (2 allocations: 7.63 MiB)
  89.509 ms (3980985 allocations: 76.04 MiB)
  76.960 ms (3980985 allocations: 76.04 MiB)

And putting things in functions as @John_Gibson suggests:

using BenchmarkTools

function docolumns(a)
    m, n = size(a)
    r = Vector{Float64}(undef, n)
    for j = 1:n
        s = 0.0
        for i = 1:m
            s += a[i, j]
        end
        r[j] = s
    end
    return nothing
end
function dorows(a)
    m, n = size(a)
    r = Vector{Float64}(undef, m)
    for i = 1:m
        s = 0.0
        for j = 1:n
            s += a[i, j]
        end
        r[i] = s
    end
    return nothing
end

a = rand(1_000, 1_000)

@btime dorows($a)
@btime docolumns($a)

gives:

  990.200 μs (1 allocation: 7.94 KiB)
  922.700 μs (1 allocation: 7.94 KiB)

(Have I done this right?)

Edit: Changed the function calls to use $a instead of a and to update timings accordingly.

When you say that I should avoid “untyped global variables”, are you referring to “a” ?
Concerning the enclosing of code blocks in functions, you are perfectly right

Your code block is very clear, and the improvement of performance as well.

Yes. Typical benchmark structure is

using BenchmarkTools

a = randn(8,8)

function foo(a)
    # do something
end

@benchmark foo($a)

Finally,

a = rand(10_000, 10_000)

gives

  987.501 ms (2 allocations: 78.17 KiB)
  105.369 ms (2 allocations: 78.17 KiB)

for rows and columns respectively.

Your 1000x1000 array uses 8MB memory and easily fits in in the L2 cache of most modern CPUs. So there shouldn’t be much difference between row-wise and column-wise algorithms. It might be more meaningful to try 20_000 x 20_000.

The OP’s code sums the rows using a row-wise and column-wise iteration pattern to see their performance difference, but your code computes a different function for each iteration pattern, which I think makes the measurements less comparable.

The IMO correct code would be:

function sumrows_rowwise(a)
    m, n = size(a)
    r = Vector{Float64}(undef, m)

    for i = 1:m
        s = 0.0
        for j = 1:n
            s += a[i,j]
        end
        r[i] = s
    end

    return nothing
end

function sumrows_colwise(a)
    m, n = size(a)
    r = Vector{Float64}(undef, m)

    for i = 1:m
        r[i] = a[i, 1]
    end

    for j = 2:n, i = 1:m
        r[i] += a[i, j]
    end

    return nothing
end

The benchmark would then simply be:

using BenchmarkTools

# ... the functions ...

a = rand(10_000, 10_000)

@btime sumrows_rowwise($a)
@btime sumrows_colwise($a)

Which on my laptop produces:

  372.644 ms (3 allocations: 78.20 KiB)
   29.115 ms (3 allocations: 78.20 KiB)

I was unable to replicate getting only 2 allocation for 10_000 by 10_000, but that might just be due to the different Julia versions being used.

This does not seem right to me. @TimG’s implementation seems more correct, in terms of making the two implementations equivalent.

Furthermore,

r[i] += a[i, j]

is likely to be slower than

s += a[i,j]

since the former requires 2 array reads and one write per iteration, while the latter requires a single lookup only, no writes. Quite pissibly, that also means three bounds checks per iteration, as well as the cost of lookup itself.

Are you trolling me? Cause I’m honestly having a hard time believing you wrote that in good faith.

for j = 1:n
    s = 0.0
    for i = 1:m
        s += a[i, j]
    end
    r[j] = s
end

and

for i = 1:m
    s = 0.0
    for j = 1:n
        s += a[i, j]
    end
    r[i] = s
end

don’t even compute the same thing, so I have no idea how someone would consider that to be “more correct”.

Besides that, sumrows_colwise() is faster on my machine than docolumns(), which I found very surprising. I also annotated all expressions that index with @inbounds and the delta was quite small on a 10_000 by 10_000 array, which shouldn’t be surprising, since bounds checking is very cheep when compared to memory access, especially after the branch predictor has warmed up on your loop. The delta is still there though.

They are equivalent if you transpose your storage scheme.

(One difficulty with the comparison in this thread is that the problem has been formulated in a way that is intrinsically asymmetrical between rows and columns, making the benchmark somewhat biased if you aren’t allowed to change your storage. But in most real applications you do get to choose your storage scheme.)

It all depends on precisely what effect you want to measure, and reasonable people can disagree about what comparison is most interesting here.

The results would be the same, yes, though I’d still argue the functions don’t compute the same thing, but if you turn your row into columns and visa versa then it does indeed achieves the same thing.

I wouldn’t really call it biased and needlessly transposing your matrix just to take the row sums along the columns seems to be very wasteful, at least, for the matrix sizes I tend to work with. It wouldn’t be too bad, but it would be visible in the energy bill, which makes very few people happy… well the energy providers I guess :sweat_smile:

(This text seemed a bit to confrontational, so I edited it out, sorry.)

But back to the actual topic of the thread.

What I find very interesting to observe is that sumrows_colwise() is faster than docolumns() even on a powerful node, would you have any idea why this is?

(You wouldn’t transpose your matrix, you would change how the data is stored from the beginning, depending on how it is to be used. Of course, you might have a situation in which the matrix is used for various things, some of which favor one format and some of which favor the other, and then it’s less obvious which to choose.)

Of course, I was mostly talking about the general case where no single layout is ideal for all situations.

(Assuming you can call matrices that don’t fit on a single node with 256 GiB normal/general.)

Accusing other posters of trolling is completely unacceptable.