Performance difference when accessing square matrix rows-first or cols-first

danilo-bc · April 13, 2021, 1:06pm

I’m trying to expose a few Julia basics for friends, specially performance results, so I’m creating code snippets for visualization.

In this example, I have two functions creating matrices of zeros and I’m simply setting each element to 1, element by element. One function traverses rows-first (matrix_0_1_rows) and the other one goes columns-first.

I know Julia prefers handling arrays column by column (Performance Tips · The Julia Language), but, in this situation, I expected the program to have the same runtime for both, since I’m alternating between the same number of rows and columns (square matrices).

That said, my output is as follows, where the left number is the n for the square matrix (n-by-n matrix) and following is the @btime output. Source code on the bottom of this post.

Questions: why aren’t the times the same? Why do results drastically change at the 1000 size mark?
Thanks in advance
Extra info: intel cpu, Julia v1.6 on Linux

Rows:
1 33.365 ns (1 allocation: 96 bytes)
10 117.317 ns (1 allocation: 896 bytes)
20 679.895 ns (1 allocation: 3.25 KiB)
50 3.413 μs (2 allocations: 19.64 KiB)
100 14.185 μs (2 allocations: 78.20 KiB)
1000 3.147 ms (2 allocations: 7.63 MiB)
10000 1.301 s (2 allocations: 762.94 MiB)
Cols:
1 33.798 ns (1 allocation: 96 bytes)
10 137.325 ns (1 allocation: 896 bytes)
20 741.205 ns (1 allocation: 3.25 KiB)
50 3.325 μs (2 allocations: 19.64 KiB)
100 15.582 μs (2 allocations: 78.20 KiB)
1000 1.702 ms (2 allocations: 7.63 MiB)
10000 376.680 ms (2 allocations: 762.94 MiB)

using BenchmarkTools

function matrix_0_1_rows(n::Int)
    m = zeros(n,n)

    for i in 1:n
        for j in 1:n
            m[i,j] = 1
        end
    end
end

function matrix_0_1_cols(n::Int)
    m = zeros(n,n)

    for j in 1:n
        for i in 1:n
            m[i,j] = 1
        end
    end

end

const sizes = [1, 10, 20, 50, 100, 1000, 10000]
println("Rows: ")
for n in sizes
    print("$n\t\t")
    @btime matrix_0_1_rows($n)
end
println("Cols: ")
for n in sizes
    print("$n\t\t")
    @btime matrix_0_1_cols($n)
end

stevengj · April 13, 2021, 1:09pm

It doesn’t matter whether the matrices are square. If you loop by columns then you are accessing memory consecutively (due to Julia’s column-major order), whereas if you loop by rows then you are accessing memory in non-consecutive order. Computer memory is faster for consecutive access because of cache lines.

danilo-bc · April 13, 2021, 1:34pm

Your explanation doesn’t use the data I provided (and the code). When I test row or column vectors, it’s obvious that accessing by column is faster. The times are at least one order of magnitude different from each other.

In this case, they’re similar until a 1000 x 1000 matrix, where accessing column by column is about 50% faster than row by row (and even faster on 10000 x 10000 case)

How can the previous results be explained, then?

stevengj · April 13, 2021, 2:42pm

My explanation (which is the same as the one in the Julia manual) is perfectly consistent with this – accessing by column is faster because it corresponds to contiguous memory access, which is faster, and this difference becomes more important as the matrix gets larger and memory bandwidth dominates the runtime.

The timing comparison is even clearer if you pre-allocate the matrix, so that you aren’t timing the allocation itself. When you do that you’ll see that accessing by column is much faster even for relatively small matrices (until it gets so tiny that memory bandwidth ceases to be an issue). For example, I get:

m = 1
  7.042 ns (0 allocations: 0 bytes)
  7.023 ns (0 allocations: 0 bytes)
m = 10
  56.272 ns (0 allocations: 0 bytes)
  53.523 ns (0 allocations: 0 bytes)
m = 20
  187.415 ns (0 allocations: 0 bytes)
  69.617 ns (0 allocations: 0 bytes)
m = 50
  837.184 ns (0 allocations: 0 bytes)
  289.270 ns (0 allocations: 0 bytes)
m = 100
  8.792 μs (0 allocations: 0 bytes)
  1.211 μs (0 allocations: 0 bytes)
m = 1000
  1.443 ms (0 allocations: 0 bytes)
  256.103 μs (0 allocations: 0 bytes)
m = 10000
  931.414 ms (0 allocations: 0 bytes)
  62.150 ms (0 allocations: 0 bytes)

from

function matrix_0_1_rows(M)
    m, n = size(M)
    for i in 1:m
        for j in 1:n
            M[i,j] = 1
        end
    end
end

function matrix_0_1_cols(M)
    m, n = size(M)
    for j in 1:n
        for i in 1:m
            M[i,j] = 1
        end
    end
end

for m in [1, 10, 20, 50, 100, 1000, 10000]
    @show m
    M = zeros(m,m)
    @btime matrix_0_1_rows($M)
    @btime matrix_0_1_cols($M)
end

stevengj · April 13, 2021, 2:46pm

In particular, realize that the matrix

julia> M =  [0  4   8  12
             1  5   9  13
             2  6  10  14
             3  7  11  15]

is stored in memory as a “1d” sequence of values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] for Julia’s column-major order. So when you loop over it by rows you are actually jumping discontinuously in memory (from 0 to 4 to 8 to 12 etc.).

danilo-bc · April 13, 2021, 2:52pm

Your own benchmarks showed that in the n = 100 case, for example, you have about 8 times faster processing using columns. From what I understand, the bulk of my processing time, in that situation, was the matrix allocation. The row vs. column difference starts to be apparent on higher dimensions.

In your situation (pre-allocated matrix) the difference is much more evident.

Thanks a bunch for clarifying and taking your time!

stevengj · April 13, 2021, 3:55pm

It’s also informative to plot the timings (for my version which pre-allocates the matrix) as a function of the matrix size m. You can see that

The time per element at first goes down with increasing m, as things like the function-call overhead become less important, and then eventually goes up with increasing m as the memory bandwidth becomes more important (as less and less of the matrix fits into the caches).
The per-row and per-column access times match for very small sizes (basically where a whole column of the matrix fits in a cache line?).
Above a threshold size, suddenly the by-column access becomes faster, and stays faster (by about a factor of 8, which presumably comes from the cache-line length?) thereafter.

I’m actually a bit puzzled by the latter effect — I would have expected the by-row access to get slower, not the by-column access to get faster, once the matrix size gets big enough.

Code:

m = unique!(round.(Int, exp10.(range(0,log10(3000),length=50))))
t_row = Float64[]
t_col = Float64[]
for m in m
    @show m
    M = zeros(m,m)
    push!(t_row, @belapsed(matrix_0_1_rows($M)) / m^2)
    push!(t_col, @belapsed(matrix_0_1_cols($M)) / m^2)
end

using PyPlot
title("matrix access by rows/cols")
loglog(m, t_row / 1e-9, "r.-")
loglog(m, t_col / 1e-9, "b*-")
xlabel(L"matrix size $m$")
ylabel(L"time (ns) / $m^2$")
legend(["by rows", "by cols"])

stevengj · April 13, 2021, 4:04pm

(The upshot here is that there is a simple principle — consecutive access, also known as spatial locality, is faster. But the details of precisely how much faster can be hard to unravel, because computer architectures are so complicated.)

danilo-bc · April 13, 2021, 4:31pm

That sudden drop between 10 and 20 is curious. It feels like it’s scaled down by some factor, since the shape of the curve is pretty similar.

For the sake of comparison, I ran your code in my I7-9750H, 16GB of RAM (and some cache which I don’t know the specifics).

What I can think follows your line of thought. When that point is reached, there should be an optimal way to access matrices column by column using some memory trick embedded in Julia Base (like how dividing by powers of 2 simply means shifting bits, whereas other divisions need further calculation)

rowsvscols

Elrod · April 13, 2021, 4:38pm

LLVM is getting smarter. Even though the code doesn’t have @inbounds…

julia> function matrix_0_1_cols(M)
           m, n = size(M)
           for j in 1:n
               for i in 1:m
                   M[i,j] = 1
               end
           end
       end
matrix_0_1_cols (generic function with 1 method)

julia> code_llvm(matrix_0_1_cols, (Matrix{Float64},), debuginfo=:none)

reveals:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %16 = add i64 %14, %index
  %17 = getelementptr inbounds double, double* %15, i64 %16
  %18 = bitcast double* %17 to <8 x double>*
  store <8 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, <8 x double>* %18, align 8
  %19 = getelementptr inbounds double, double* %17, i64 8
  %20 = bitcast double* %19 to <8 x double>*
  store <8 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, <8 x double>* %20, align 8
  %21 = getelementptr inbounds double, double* %17, i64 16
  %22 = bitcast double* %21 to <8 x double>*
  store <8 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, <8 x double>* %22, align 8
  %23 = getelementptr inbounds double, double* %17, i64 24
  %24 = bitcast double* %23 to <8 x double>*
  store <8 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, <8 x double>* %24, align 8
  %index.next = add i64 %index, 32
  %25 = icmp eq i64 %index.next, %n.vec
  br i1 %25, label %middle.block, label %vector.body

Once m >= 4W where W is the vector width, the col version will speed up thanks to SIMD.
Linear indexing is much faster here for that reason, length(M) % (4W) is much smaller than n * (m % (4W)).

If your CPU has a 32 KiB L1d cache (which is fairly common), then up to a 64x64 Matrix{Float64} could fit in this cache, meaning if we didn’t have vectorization, I’d expect fairly similar performance below this size.

devel-chm · April 13, 2021, 5:33pm

There are drops for m = 16, 32, 64 which are powers of 2.
Seems like it is a cache interaction effect.

Elrod · April 13, 2021, 5:36pm

It’s because of SIMD, like I said above.
LLVM vectorizes the code and unrolls by 4.

If your computer has 256 bit vectors (4x doubles), then this means it processes batches of 16 using SIMD instructions. These batches of 16 will be fast, and the leftovers slow.

devel-chm · April 13, 2021, 5:53pm

@Elrod: Sorry, I got lost in the “vector.body” code and missed your points following.

Elrod · April 13, 2021, 6:00pm

Using LinuxPerf:

julia> using LinuxPerf

julia> function matrix_0_1_rows(M)
           m, n = size(M)
           for i in 1:m
               for j in 1:n
                   M[i,j] = 1
               end
           end
       end
matrix_0_1_rows (generic function with 1 method)

julia> function matrix_0_1_cols(M)
           m, n = size(M)
           for j in 1:n
               for i in 1:m
                   M[i,j] = 1
               end
           end
       end
matrix_0_1_cols (generic function with 1 method)

julia> foreachf(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf (generic function with 1 method)
julia> M = zeros(64,64);

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(matrix_0_1_rows, 1_000_000, M)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               8.74e+09   50.0%  #  4.3 cycles per ns
┌ instructions             1.81e+10   75.0%  #  2.1 insns per cycle
│ branch-instructions      4.42e+09   75.0%  # 24.4% of instructions
└ branch-misses            1.21e+08   75.0%  #  2.7% of branch instructions
┌ task-clock               2.04e+09  100.0%  #  2.0 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.95e+08   25.0%  # 103.7% of dcache loads
│ L1-dcache-loads          1.88e+08   25.0%
└ L1-icache-load-misses    7.59e+04   25.0%
┌ dTLB-load-misses         1.36e+02   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.88e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(matrix_0_1_cols, 1_000_000, M)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.76e+09   49.8%  #  4.3 cycles per ns
┌ instructions             2.35e+09   74.9%  #  1.3 insns per cycle
│ branch-instructions      3.97e+08   74.9%  # 16.9% of instructions
└ branch-misses            1.01e+06   74.9%  #  0.3% of branch instructions
┌ task-clock               4.10e+08  100.0%  # 410.2 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    2.93e+08   25.1%  # 149.1% of dcache loads
│ L1-dcache-loads          1.96e+08   25.1%
└ L1-icache-load-misses    1.58e+04   25.1%
┌ dTLB-load-misses         1.89e+02   24.9%  #  0.0% of dTLB loads
└ dTLB-loads               1.96e+08   24.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 2.35e9 / 1.81e10
0.1298342541436464

This code called the function 1_000_000 times, and gives us a summary of details like the number of instructions executed.
Note there was >= 100% cache misses, because I was using 64x64 matrices, and this CPU has a 32 KiB L1d cache (and 64^2*sizeof(Float64) == 32*(2^10)).

Importantly, thanks to the SIMD instructions, the col version required many times fewer instructions, just 0.13 times as many.
However, all those cache misses meant it didn’t achieve nearly as many instructions per clock cycle. It spent a lot of time waiting for memory.

With 32x32, it easily fits in cache, so the instructions per clock are much better, but 32 isn’t enough for the SIMD version to stretch its legs (the SIMD loop iterates just once per iteration on this computer), so the advantage in number of instructions needed is less (and increasing repetitions to 10 million):

julia> M = zeros(32,32);

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(matrix_0_1_rows, 10_000_000, M)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.47e+10   50.0%  #  4.3 cycles per ns
┌ instructions             5.18e+10   75.0%  #  3.5 insns per cycle
│ branch-instructions      1.22e+10   75.0%  # 23.6% of instructions
└ branch-misses            1.72e+07   75.0%  #  0.1% of branch instructions
┌ task-clock               3.42e+09  100.0%  #  3.4 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    9.43e+04   25.0%  #  0.0% of dcache loads
│ L1-dcache-loads          1.56e+09   25.0%
└ L1-icache-load-misses    1.01e+05   25.0%
┌ dTLB-load-misses         9.92e+02   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.56e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(matrix_0_1_cols, 10_000_000, M)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.01e+09   49.9%  #  4.3 cycles per ns
┌ instructions             1.15e+10   75.0%  #  3.8 insns per cycle
│ branch-instructions      2.00e+09   75.0%  # 17.4% of instructions
└ branch-misses            1.91e+04   75.0%  #  0.0% of branch instructions
┌ task-clock               7.03e+08  100.0%  # 703.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    2.01e+04   25.0%  #  0.0% of dcache loads
│ L1-dcache-loads          1.55e+09   25.0%
└ L1-icache-load-misses    2.87e+04   25.0%
┌ dTLB-load-misses         2.50e+03   24.9%  #  0.0% of dTLB loads
└ dTLB-loads               1.55e+09   24.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

But the relatively short inner loop means there’s a lot of overhead in entering and starting that loop compared to using linear indexing:

julia> function matrix_0_1_linear(M)
           for i in eachindex(M)
               M[i] = 1
           end
       end
matrix_0_1_linear (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(matrix_0_1_linear, 10_000_000, M)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.40e+09   49.9%  #  4.3 cycles per ns
┌ instructions             6.56e+09   75.0%  #  2.7 insns per cycle
│ branch-instructions      1.12e+09   75.0%  # 17.1% of instructions
└ branch-misses            8.86e+04   75.0%  #  0.0% of branch instructions
┌ task-clock               5.62e+08  100.0%  # 561.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    2.67e+04   25.0%  #  0.0% of dcache loads
│ L1-dcache-loads          1.33e+09   25.0%
└ L1-icache-load-misses    3.37e+04   25.0%
┌ dTLB-load-misses         4.67e+04   24.9%  #  0.0% of dTLB loads
└ dTLB-loads               1.33e+09   24.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

danilo-bc · April 13, 2021, 8:32pm

Thanks for another analysis for this problem. Both LLVM and this pstats output give a lot of insight on the matter.

If the body of the function weren’t so short (M[i] = 1 is stupidly short), regular matricial indexing M[i, j] wouldn’t be so bad in terms of overhead.

Still, it’s interesting and relieving to know about how cache size can affect optimizations, and how traversing by the recommended column-first way takes huge advantage of that!

Topic		Replies	Views
The columns and the rows - puzzled about performance Performance question	8	905	October 14, 2021
Why column major? General Usage question , array , linearalgebra , column-major	59	19271	February 17, 2024
Row and column major order for arrays of different shape Performance column-major , row-major	23	2712	April 12, 2022
Faster sum loop when looping through vector multiple times than once Performance	5	291	September 29, 2023
Why substituting "for loop" with direct assignment of Matrix worsen the performance time? General Usage question	20	864	September 23, 2021

Performance difference when accessing square matrix rows-first or cols-first

Related topics