Why does this broadcast operation require specialization for optimal performance?

julia> g1(f, x, y, z) = (y .= f.(x, z); y);

julia> g2(f, x, y, z) = (broadcast!(f, y, x, z); y);

julia> g3(f::F, x, y, z) where {F} = (y .= f.(x, z); y);

julia> x = rand(100); y = zeros(100); z = rand(100);

julia> @btime g1(*, $x, $y, $z);
  231.065 ns (2 allocations: 64 bytes)

julia> @btime g2(*, $x, $y, $z);
  23.848 ns (0 allocations: 0 bytes)

julia> @btime g3(*, $x, $y, $z);
  28.712 ns (0 allocations: 0 bytes)

julia> g1(*, x, y, z) == g2(*, x, y, z) == g3(*, x, y, z)
true

julia> VERSION
v"1.9.0-beta3"

I don’t see any dynamic dispatch in g1:

julia> @report_opt g1(*, x, y, z)
No errors detected

Why does g1 require specializing on the function for optimal performance?

1 Like

see the performance tips. functions by default (unlike almost everything else) aren’t specialized on by default.

It is confusing though. The performance tips say that specializing only happens when the function or type is “used” (I presume to mean a call) and not when it is passed as an argument to a higher order function, so I would not expect g2 to specialize on f, I would expect g2 to perform the same as g1.

It’s also strange that @report_opt suggests no dynamic dispatch in g1 when that’s what no-specialization would do. The allocations and performance seems to suggest it.

3 Likes

Despite being very familiar with that section of the Performance Tips, I frequently fail to accurately predict when it will and won’t apply. I’m not sure that it is entirely clear or accurate. Next time I catch a case with unexpected behavior I’ll be sure to post about it, but it seems this thread already has an example where it’s either misleading or otherwise requires understanding the nuance of broadcast lowering.

4 Likes

Perhaps I’m starting to understand now. Looking into what g2 is calling, I see

broadcast!(f::Tf, dest, As::Vararg{Any,N}) where {Tf,N} = (materialize!(dest, broadcasted(f, As...)); dest)

so broadcast! forces a specialization on the function before passing it to broadcasted, whereas y .= x .* z lowers to broadcasted, and simply passes the function to it without specializing on it. As for why this makes a difference, I’m not certain, since it should ultimately be passed down to Broadcasted in either case, which does specialize on the function.

If this is indeed the issue, perhaps the specialization should happen in broadcasted instead, which would handle all cases?

2 Likes

There is a lot of @inlineing going on in broadcast.jl, and I’m not exactly sure if the chain of function calls are just inlined to the point where only specializing function calls remain. I do suspect that because of g3’s performance; even if g3 itself specializes, if y .= f.(x, z) doesn’t then I expect runtime dispatch.

Which makes me speculate this is some inlining into the @btime loop going on? But I don’t know of a way to simply inspect which function calls are inlineable.

broadcast: disable nospecialize logic for outer method signature by vtjnash · Pull Request #43200 · JuliaLang/julia · GitHub would reduce the possiblity of similar problem.
The extensive @inline does make life worse in many places. (As a non-inlined materialize! would always be specialized)

Of course, if we remove all @inline then the const propagation of broadcast would be turned off,
which means that y .= 2 .* x and f(x) = 2x; y .= f.(x) might have different performance.

Indeed, in this case, adding call-site inlining seems to remove the disparity:

julia> @btime @inline g1(*, $x, $y, $z);
  21.322 ns (0 allocations: 0 bytes)

julia> @btime g1(*, $x, $y, $z);
  213.411 ns (2 allocations: 64 bytes)
1 Like