Weird performance sensitivity to order of operations in broadcasted expression

I’m seeing a weird performance sensitivity to order of operations.

foo1(x,y) = @. x*(x + y) + y*y
foo2(x,y) = @. (x + y)*x + y*y

For StaticArray inputs, the first function allocates, while the second function does not.

x = @SVector [1.0]
y = @SVector [2.0]
julia> @btime foo1($x,$y);
  478.870 ns (25 allocations: 480 bytes)
julia> @btime foo2($x,$y);
  2.949 ns (0 allocations: 0 bytes)

Any explanation? Is this expected behavior?

I see a difference in the outputs of @code_typed (e.g., foo1 calls invoke Base.Broadcast.var and Core._apply_iterate ,foo2 does not) but I’m not sure how to interpret them.

Note that the original poster on Slack cannot see your response here on Discourse. Consider transcribing the appropriate answer back to Slack, or pinging the poster here on Discourse so they can follow this thread.
(Original message :slack:) (More Info)

From @mbauman on Slack:

likely hitting some compiler heuristic threshold for Broadcast.flatten, which is how StaticArrays implements broadcasting. […] Broadcast.flatten is an alternative way to implement broadcasting. It creates lots of anonymous functions. If they don’t inline, you’ll see them as var"5#6" s and the like [in @code_typed].

That’s not really a good answer. A better answer is that this is a performance bug — likely the same as #27988.

1 Like

Another MWE based on SparseArrays. Both foo1, foo2 seem to allocate extra, though expanding the product out fully avoids this.

using SparseArrays
x = sparsevec([1.0])
y = sparsevec([2.0])
foo1(x,y) = @. x*(x + y) + y*y
foo2(x,y) = @. (x + y)*x + y*y
foo3(x,y) = @. x*x + x*y + y*yjulia> @btime foo1($x,$y);
1.111 μs (47 allocations: 1.09 KiB)
julia> @btime foo2($x,$y);
1.124 μs (29 allocations: 1.20 KiB)
julia> @btime foo3($x,$y);
124.881 ns (2 allocations: 192 bytes)

Neat workaround from @marius311

In more complex cases a workaround for that issue which also works in your case here is just write the broadcast kernel out by hand,

foo3(x,y) = broadcast(x,y) do x,y
    x*(x + y) + y*y
end