Understanding major order performance when broadcasting in column vs row operations

I understand that Julia stores arrays in column-major order which means that columns are stacked onto one another. Thus, adjacent rows in the same column are adjacent in memory.

If that is the case, I am failing to see why the following operation is faster in row-form and makes less alocations rather than in column form:

using BenchmarkTools

A = rand(5, 5);
B = rand(5, 5);
C = Array{Float64}(undef, 5, 5);

function modifyC0!(A, B, i)
    C[i, :] .= A[i, :] .* B[i, :]
end
@btime modifyC0!(A, B, 2);
# 333.318 ns (5 allocations: 352 bytes)

function modifyC1!(A, B, i)
    C[:, i] .= A[:, i] .* B[:, i]
end
@btime modifyC1!(A, B, 2);
# 361.809 ns (6 allocations: 368 bytes)

Shouldn’t they be the same in terms of allocation and modifyC1! at marginally better in terms of time.

The situation is as expected when I do not broadcast the final allocation to C. However, in this case it feels like one allocation is missing (since C is 5x5).

function modifyC2!(A, B, i)
    C[i, :] = A[i, :] .* B[i, :]
end
@btime modifyC2!(A, B, 2);
#  487.897 ns (4 allocations: 400 bytes)

function modifyC3!(A, B, i)
    C[:, i] = A[:, i] .* B[:, i]
end
@btime modifyC3!(A, B, 2);
# 446.896 ns (4 allocations: 400 bytes)

Some things:

  1. Use views, otherwise the slices (i. e. A[:,i]) are allocating new arrays.

  2. You need to pass C as a parameter, because otherwise it is a global variable.

  3. To benchmark properly, you need to interpolate ($) the variables, and since you are mutating C, maybe initialize it for every sample (in this case probably that does not make any difference).

One example:

julia> @views function modifyC0!(A, B, C, i)
           C[i, :] .= A[i, :] .* B[i, :]
       end
modifyC0! (generic function with 2 methods)

julia> @btime modifyC0!($A, $B, C, 2) setup=(C=Array{Float64}(undef, 5, 5)) evals=1
  60.000 ns (0 allocations: 0 bytes)
5-element view(::Matrix{Float64}, 2, :) with eltype Float64:
 0.00027977390770820865
 0.0528013293219392
 0.6003185444007695
 0.43237270343222706
 0.3425750069418806


Taking those in consideration, probably your benchmarks will change.

5 Likes

Thanks @lmiq, that’s very helpful. I was not aware of the interpolation or the call to setup. These are very useful. With your changes I can now obtain the results I expected.

I am teaching an introduction to Julia. My intention is to demonstrate the differences in performance when dealing with arrays from different uses of broadcast, views and loop. Any feedback is more than welcome.

I am reproducing all the examples below for future reference. I hope that the example of row/column allocation can demonstrate all the trade-offs in a simple way. I find it amazing that a pure loop with @inbounds is the fastest method.

The only thing that is missing from my understanding is why = creates an additional allocation and takes longer as opposed to .= . My hypothesis is that with = the result is saved on an intermediary array before being allocated to the array C. Whereas with .=, we allocate directly during iteration so no additional allocation and iteration are necessary. Is that correct?

N  = 100
A = rand(N, N);
B = rand(N, N);

A[:] # vectorize the matrix, allow us to c the column-order

function modifyC0!(A, B, C, i)
    C[i, :] .= A[i, :] .* B[i, :]
end
@btime modifyC0!($A, $B, C, 2) setup=(C=Array{Float64}(undef, N, N))
# 283.544 ns (2 allocations: 1.75 KiB)

function modifyC1!(A, B, C, i)
    C[:, i] .= A[:, i] .* B[:, i]
end
@btime modifyC1!($A, $B, C, 2) setup=(C=Array{Float64}(undef, N, N))
# 191.176 ns (2 allocations: 1.75 KiB)

@views function modifyC2!(A, B, C, i)
    C[i, :] .= A[i, :] .* B[i, :]
end
@btime modifyC2!($A, $B, C, 2) setup=(C=Array{Float64}(undef, N, N))
# 84.840 ns (0 allocations: 0 bytes)

@views function modifyC3!(A, B, C, i)
    C[:, i] .= A[:, i] .* B[:, i]
end
@btime modifyC3!($A, $B, C, 2) setup=(C=Array{Float64}(undef, N, N));
# 30.224 ns (0 allocations: 0 bytes)

function modifyC4!(A, B, C, i)
    for j in 1:size(C)[1]
        C[j, i] = A[j, i] * B[j, i]
    end
end
@btime modifyC4!($A, $B, C, 2) setup=(C=Array{Float64}(undef, N, N));
# 84.091 ns (0 allocations: 0 bytes)

function modifyC5!(A, B, C, i)
    @inbounds for j in 1:size(C)[1]
        C[j, i] = A[j, i] * B[j, i]
    end
end
@btime modifyC5!($A, $B, C, 2) setup=(C=Array{Float64}(undef, N, N));
# 15.298 ns (0 allocations: 0 bytes)

And:

  1. Matrices as small as 5×5 won’t really show this. It’s a cache effect, and if the entire matrix fits in one mouthful, then everything after that can be fast.
julia> function modifyC1!(A, B, C, i)
           @views C[:, i] .= A[:, i] .* B[:, i]
       end;

julia> @btime modifyC0!($A, $B, $C, 2);
  22.088 ns (0 allocations: 0 bytes)

julia> @btime modifyC1!($A, $B, $C, 2);
  25.226 ns (0 allocations: 0 bytes)

julia> A = rand(500, 500); B = rand(500,500); C = zeros(500,500);

julia> @btime modifyC0!($A, $B, $C, 2);  # one row
  761.968 ns (0 allocations: 0 bytes)

julia> @btime modifyC1!($A, $B, $C, 2);  # one column
  120.773 ns (0 allocations: 0 bytes)

julia> 761 / 120
6.341666666666667

julia> @btime modifyC0!($A', $B', $C', 2);  # transposed, also contiguous
  127.525 ns (0 allocations: 0 bytes)

julia> @btime modifyC1!($A', $B', $C', 2); 
  761.702 ns (0 allocations: 0 bytes)

julia> A = rand(5000, 5000); B = rand(5000,5000); C = zeros(5000,5000);

julia> @btime modifyC0!($A, $B, $C, 2);  # one row
  83.375 ÎĽs (0 allocations: 0 bytes)

julia> @btime modifyC1!($A, $B, $C, 2);  # one column
  1.188 ÎĽs (0 allocations: 0 bytes)

julia> 83/1.2
69.16666666666667

Edit – between writing and posting, I see you got it sorted out! Somehow my screen didn’t update.

2 Likes

Perfect! That makes sense. In the code above, I added a global parameter N to adjust the size of the matrix.

I hadn’t thought about matrix transpose. That’s another good idea to show.

Add @views to all functions that use slices. None of these functions that mutate the output should allocate anything.

Also I guess that LoopVectorizations can give these loops an additional boost.

1 Like

It is not the = , it is the slice. Slices in Julia allocate a new array by default. And views solve that.

1 Like

It should also show better performance for C1 than C0, even for 5x5:

julia> using LoopVectorization

julia> function modifyC0!(A, B, C, i)
           @views C[i, :] .= A[i, :] .* B[i, :]
       end
modifyC0! (generic function with 1 method)

julia> function modifyC1!(A, B, C, i)
           @views C[:, i] .= A[:, i] .* B[:, i]
       end;

julia> function modifyC0_turbo!(A, B, C, i)
           @turbo @views C[i, :] .= A[i, :] .* B[i, :]
       end
modifyC0_turbo! (generic function with 1 method)

julia> function modifyC1_turbo!(A, B, C, i)
           @turbo @views C[:, i] .= A[:, i] .* B[:, i]
       end;

julia> A = rand(5, 5);

julia> B = rand(5, 5);

julia> C = Array{Float64}(undef, 5, 5);

julia> @btime modifyC0!($A, $B, $C, 1);
  17.038 ns (0 allocations: 0 bytes)

julia> @btime modifyC1!($A, $B, $C, 1);
  16.355 ns (0 allocations: 0 bytes)

julia> @btime modifyC0_turbo!($A, $B, $C, 1);
  11.340 ns (0 allocations: 0 bytes)

julia> @btime modifyC1_turbo!($A, $B, $C, 1);
  9.155 ns (0 allocations: 0 bytes)

Something else to show off with ordering:

using LoopVectorization
function modify!(C, A, B)
    @inbounds for n in indices((C,A,B),2), m in indices((C,A,B),1)
        C[m,n] = A[m,n] * B[m,n]
    end
end
function modify_turbo!(C, A, B)
    @turbo for n in indices((C,A,B),2), m in indices((C,A,B),1)
        C[m,n] = A[m,n] * B[m,n]
    end
end
A = rand(32,32); B = rand(32,32); C = similar(A);
@btime modify!($C,$A,$B)
@btime modify_turbo!($C,$A,$B)

@btime modify!($C',$A',$B')
@btime modify_turbo!($C',$A',$B')

I get

julia> @btime modify!($C,$A,$B)
  121.922 ns (0 allocations: 0 bytes)

julia> @btime modify_turbo!($C,$A,$B)
  53.863 ns (0 allocations: 0 bytes)

julia> @btime modify!($C',$A',$B')
  417.021 ns (0 allocations: 0 bytes)

julia> @btime modify_turbo!($C',$A',$B')
  53.360 ns (0 allocations: 0 bytes)

It’ll swap the loops into whatever order it predicts will be fastest. Hence, why we don’t see any slowdown when transposing all three arrays.

5 Likes

It’s both.

A slice expression like A[i, :] does indeed create a copy by default, unless you opt-in to views using the @views macro or similar.

However, the A[i, :] .* B[i, :] operation also allocates an array for the result of the .*, unless this elementwise multiplication is fused with the assignment loop using .=. Julia can’t do this by default with = because loop fusion may change the semantics and proving that it doesn’t is a hard task for a compiler.

This shouldn’t be too amazing. Loops are fast in Julia. How do you think broadcast is implemented?

(You really want to go to something like 1000x1000 to investigate cache effects.)

3 Likes

@stevengj thanks for the clarification and for the link. They were very helpful.

I meant amazing in a good way, since that’s exactly what I want to show to the attendants of my introductory workshop. Initially, I was struggling to get these results with BenchmarkTools despite my expectation that loops were supposed to be as fast as broadcast. When I finally got to see the results that loops were even faster than broadcast, I was impressed. A lot of people coming from Python spend a lot of time thinking about how to vectorize their code, but in Julia sometimes the opposite is true.

On a second note, I was not aware of the LoopVectorization library. In some cases, writing the computation in vectorized format could be more expressive. However, the benchmarks show that broadcast was inferior in terms of performance. Therefore, it is good to know that the @turbo macro sort of closes this loop and brings broadcast in par with a pure loop implementation.