Performance of simple broadcasting operations with many arguments

I am puzzled by some microbenchmarks tests I did on very simple broadcasting operations.

julia> using BenchmarkTools
julia> n = 10000; x = rand(n); y = rand(n); z = rand(n); t = rand(n); r = zeros(n);
julia> @belapsed $r .= $x .+ $y
julia> @belapsed $r .= $x .+ $y .+ $z
julia> @belapsed $r .= $x .+ $y .+ $z .+ $t

What is the reason for the drop in performance when I sum over four vectors?

It may be that the processor has a 512 bites cache, thus it can load four 64 bit numbers per cycle. With five involved (including the result) you need two cycles of the processor.

1 Like

Interestingly, LoopVectorization is getting optimal performance here. I think the problem might be that Julia is requiring an inefficient add order.

1 Like

Yes, indeed, LoopVectorization does a very nice job. What do you mean whey you write that “Julia is requiring an inefficient add order”?

@lmiq this is possibile, but then I would expect that the problem disappears when I am using Float32 data: instead, I get something even stranger…

julia> n = 10000; x = rand(Float32, n); y = rand(Float32, n); z = rand(Float32, n); t = rand(Float32, n); r = zeros(Float32, n);
julia> @belapsed $r .= $x .+ $y
julia> @belapsed $r .= $x .+ $y .+ $z
julia> @belapsed $r .= $x .+ $y .+ $z .+ $t

since floating point math isn’t isn’t associative, llvm isn’t allowed to reassociate which can improve performance. loopvectorization on the other hand does look for the best ordering.

FastBroascast.jl also does well:

julia> n = 10000; x = rand(n); y = rand(n); z = rand(n); t = rand(n); r = zeros(n);

julia> using FastBroadcast, LoopVectorization

julia> @btime $r .= $x .+ $y .+ $z .+ $t;
  6.558 μs (0 allocations: 0 bytes)

julia> @btime @.. $r = $x + $y + $z + $t; # FastBroadcast
  2.297 μs (0 allocations: 0 bytes)

julia> @btime @turbo $r .= $x .+ $y .+ $z .+ $t; # LoopVectorization
  2.275 μs (0 allocations: 0 bytes)

The problem is that while FastBroadcast and LoopVectorization are using SIMD instructions, regular broadcasting is not.

There are exponential possibilities in the number of arguments when broadcasting. This causes the compiler to simply give up instead of considering them all.
FastBroadcast handles this by simply prioritizing the most common case – the case where none of the arguments are being broadcast (i.e., none of the arguments have an axis with size of 1) – and guaranteeing efficient code is being generated for that particular case.
Other cases will be slow, but should still be correct.

LoopVectorization handles this via zeroing out strides associated with size-1 axis and SIMD-ing. It currently does not support dynamic size-1 in a contiguous axis. As there is only one axis and that axis is contiguous, this means it’s doing something more or less the same as FastBroadcast.
LoopVectorization can thus make more scenarios fast than FastBroadcast, although be sure to make sure all contiguous axis are either full-size or statically known to be size 1 (e.g., an Adjoint{Float64,Vector{Float64}} is okay, as is rand(1000,1), but rand(1,1000) is not). This restriction can be lifted, but I haven’t made it a priority / it’ll be addressed at some point after the rewrite.

Note that given few enough arguments, base broadcasting can actually be fast, as it’ll try to enumerate the possibilities and generate fast code for each case. This means it should be able to match the others for the common case then, while perhaps even being faster for some others due to having created specialized cases for each of them. Obviously, that doesn’t scale with the number of arguments, as we see here where it simply gives up and generates slow code.


Thank yoou @Oscar_Smith and @Elrod , your explanations clarify a lot of things. Still, I am not entirely sure what is happening here…

Let’s start with SIMD, which as you write is not used by regular broadcasting. Why is that? Is this something related to the number of arguments in the broadcast, or a general “feature”? I am asking because the drop in performance does depend on the number of vectors added.

Regarding your other point, I understand there are many different ways to perform a broadcasting, but if one does not use the associativity of the sum (which Julia does not by default) I can think only of a single sensible way of doing a broadcasting: just use a single loop and perform the full sum of each element inside the loop. Other options would require multiple passes (i.e., multiple loops), and are likely to be slower. To test the performance, I did a few more tests:

n = 10000; x = rand(n); y = rand(n); z = rand(n); t = rand(n); r = zeros(n);

function test4a!(r, x, y, z, t)
    @. r = x + y + z + t

function test4b!(r, x, y, z, t)
    @turbo @. r = x + y + z + t

function test4c!(r, x, y, z, t)
    @inbounds for i ∈ 1:10000
        r[i] = x[i] + y[i] + z[i] + t[i]

function test4d!(r, x, y, z, t)
    @inbounds @simd for i ∈ 1:10000
        r[i] = x[i] + y[i] + z[i] + t[i]

I then measured the performance as usual:

julia> @btime test4a!($r, $x, $y, $z, $t);
  8.319 μs (0 allocations: 0 bytes)

julia> @btime test4b!($r, $x, $y, $z, $t);
  4.787 μs (0 allocations: 0 bytes)

julia> @btime test4c!($r, $x, $y, $z, $t);
  4.897 μs (0 allocations: 0 bytes)

julia> @btime test4d!($r, $x, $y, $z, $t);
  4.913 μs (0 allocations: 0 bytes)

If I have to conclude something, I would say that

  • LoopVectorization is just doing the obvious thing, loop over each element.
  • SIMD does not help at all: either the macro is ignored, or for some reason I do not understand SIMD cannot be used here. Note that the same happens if I define functions test3d! or test2d! that only add three or two vectors: @simd is not improving the performance.
  • The standard broadcasting does something I still do not understand. Is there a way to understand the way the loop is performed in Julia? That clearly would help me (and hopefully other people) in writing efficient code, without being forced to either guess or to test each single line as I am doing here.

There are 2^num_vector ways:
E.g., with two input vectors:

  1. Broadcast none (both full size)
  2. Broadcast 1 (first is of length 1, second is full size)
  3. Broadcast 2 (second is of length 1, first is full size)
  4. Broadcast both (both have length 1)

Add a third and we have 8 combinations. A fourth, and we’re at 16.

Associativity has nothing to do with it. Your @inbounds loop SIMDs just fine, which is why the macro doesn’t help.
The problem is broadcasting, i.e. replicating size==1 axes.

With two many arguments, there are too many combinations of arguments having length 1 vs full length, hence the compiler gives up.

My previous comment describes the approaches that FastBroadcast.jl and LoopVectorization.jl use to avoid this problem.

Thank you again. Reading the Julia documentation I thought that a broadcasting operation would generally fuse and thus that, if the Broadcast.broadcasted method is not rewritten to make an eager evaluation, the evaluation would be always lazy. This also corresponds to the fact that there are no allocations in none of the tests functions I made (I would expect allocations if the computation is not done elementwise, because I would have intermediate results to store). Also, I do not see your point with size==1 axes: I only have vectors, so just one size and different from 1.

So, if you have a minute, could you kindly help me understand how could Julia possibly perform the operation by writing a code with a for loop that simulates the possible order of operation basic broadcasting is doing? Thank you so much!

Order of operations has nothing to do with it, which is why I never brought it up or discussed it in this thread.

Perhaps this makes it more clear, here we write one broadcast statement that corresponds to 8 different loops because we have 3 arguments (2^3 = 8):

w = Vector{Float64}(undef, 4);
for i in (1,length(w))
  x = rand(i)
  for j in (1,length(w))
    y = rand(j)
    for k in (1,length(w))
      z = rand(k)
      @. w = x + y + z

Semantically, the single broadcasting statement represents these 8 different loops. If we add a fourth input argument, it has to be able to represent 16 different loops.
Which of the loops that it actually is is not known at compile time. Only at runtime.

Basically, you might not be using sizes equal to 1, but the compiler does not know that.

Therefore, the compiler has to generate code able to handle all the 2^number_arguments cases.
Because this can be a lot, what is the compiler to do?
There are a lot of possibilities, but broadcasting just gives up if the number is too big.

1 Like

I don’t understand why that works at all, instead of simply throwing a dimension mismatch when the arrays don’t have all the same shape:

julia> w = zeros(2);

julia> @. w = [0.4098911891015198] + [0.25666422432859126, 0.18429878091474228]
2-element Vector{Float64}:

julia> @. w = rand(1) + rand(2)
ERROR: DimensionMismatch("dimensions must match: a has dims (Base.OneTo(1),), b has dims (Base.OneTo(2),), mismatch at 1")

I thought that a broadcasting operation like that would be expanded to something as simple as:

julia> function f!(r,x,y,z,t)
           for i in 1:length(r)
               @inbounds r[i] = x[i] + y[i] + z[i] + t[i]
f! (generic function with 1 method)

which does SIMD.

You’re broadcasting the rand calls. That is, for every element of w, it is calling rand(2) and rand(1).

For the loop f!, you’d need something like length(x) == 1 ? 1 : i inside each getindex instead of just i.

Perhaps that makes it more clear why it is actually impressive that broadcast can compete with loops as often as it does.

1 Like

Ok, but I mean, I never expected that broadcasting to work, for general consistency with the fact that in Julia matrices, vectors and scalars are all different things:

julia> @. [1] + [1,2]
2-element Vector{Int64}:

julia> @. [1,2] + [1,2,3]
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 2 and 3")

I’m possibly interpreting things wrong, but the fact that that one-element vector acts as a scalar in that broadcasting operation sort of goes against the choice of making other one-element arrays throwing errors when inadvertently used as scalars (for ex. [1] * [1,2]).

It can be convenient, particularly for multidimensional arrays, e.g.

using Statistics 
A = rand(10,10,10);
Z = (A .- mean(A, dims=(1,2))) ./ std(A, dims=(1,2))

Broadcasting always replicates size==1 axes across the entire corresponding axis.

It would be nice to have something that doesn’t broadcast though, to make simple loops more convenient to write. Maybe FastBroadcast.jl should fill that role.

1 Like

Thank you, this really clarifies everything! And I do appreciate the speed simple broadcasting is generally able to reach at runtime, considering all the various possibilities. In reality, in same cases I have not even been able to fully match its speed with a for loop (as in #71336)…