Performance of simple broadcasting operations with many arguments

Order of operations has nothing to do with it, which is why I never brought it up or discussed it in this thread.

Perhaps this makes it more clear, here we write one broadcast statement that corresponds to 8 different loops because we have 3 arguments (2^3 = 8):

w = Vector{Float64}(undef, 4);
for i in (1,length(w))
  x = rand(i)
  for j in (1,length(w))
    y = rand(j)
    for k in (1,length(w))
      z = rand(k)
      @. w = x + y + z
    end
  end
end

Semantically, the single broadcasting statement represents these 8 different loops. If we add a fourth input argument, it has to be able to represent 16 different loops.
Which of the loops that it actually is is not known at compile time. Only at runtime.

Basically, you might not be using sizes equal to 1, but the compiler does not know that.

Therefore, the compiler has to generate code able to handle all the 2^number_arguments cases.
Because this can be a lot, what is the compiler to do?
There are a lot of possibilities, but broadcasting just gives up if the number is too big.

1 Like