Why does this broadcast operation require specialization for optimal performance?

jishnub · February 13, 2023, 10:12am

julia> g1(f, x, y, z) = (y .= f.(x, z); y);

julia> g2(f, x, y, z) = (broadcast!(f, y, x, z); y);

julia> g3(f::F, x, y, z) where {F} = (y .= f.(x, z); y);

julia> x = rand(100); y = zeros(100); z = rand(100);

julia> @btime g1(*, $x, $y, $z);
  231.065 ns (2 allocations: 64 bytes)

julia> @btime g2(*, $x, $y, $z);
  23.848 ns (0 allocations: 0 bytes)

julia> @btime g3(*, $x, $y, $z);
  28.712 ns (0 allocations: 0 bytes)

julia> g1(*, x, y, z) == g2(*, x, y, z) == g3(*, x, y, z)
true

julia> VERSION
v"1.9.0-beta3"

I don’t see any dynamic dispatch in g1:

julia> @report_opt g1(*, x, y, z)
No errors detected

Why does g1 require specializing on the function for optimal performance?

Oscar_Smith · February 13, 2023, 1:41pm

see the performance tips. functions by default (unlike almost everything else) aren’t specialized on by default.

Benny · February 13, 2023, 1:50pm

It is confusing though. The performance tips say that specializing only happens when the function or type is “used” (I presume to mean a call) and not when it is passed as an argument to a higher order function, so I would not expect g2 to specialize on f, I would expect g2 to perform the same as g1.

It’s also strange that @report_opt suggests no dynamic dispatch in g1 when that’s what no-specialization would do. The allocations and performance seems to suggest it.

mikmoore · February 13, 2023, 3:39pm

Despite being very familiar with that section of the Performance Tips, I frequently fail to accurately predict when it will and won’t apply. I’m not sure that it is entirely clear or accurate. Next time I catch a case with unexpected behavior I’ll be sure to post about it, but it seems this thread already has an example where it’s either misleading or otherwise requires understanding the nuance of broadcast lowering.

jishnub · February 13, 2023, 4:30pm

Perhaps I’m starting to understand now. Looking into what g2 is calling, I see

broadcast!(f::Tf, dest, As::Vararg{Any,N}) where {Tf,N} = (materialize!(dest, broadcasted(f, As...)); dest)

so broadcast! forces a specialization on the function before passing it to broadcasted, whereas y .= x .* z lowers to broadcasted, and simply passes the function to it without specializing on it. As for why this makes a difference, I’m not certain, since it should ultimately be passed down to Broadcasted in either case, which does specialize on the function.

If this is indeed the issue, perhaps the specialization should happen in broadcasted instead, which would handle all cases?

Benny · February 14, 2023, 11:51am

There is a lot of @inlineing going on in broadcast.jl, and I’m not exactly sure if the chain of function calls are just inlined to the point where only specializing function calls remain. I do suspect that because of g3’s performance; even if g3 itself specializes, if y .= f.(x, z) doesn’t then I expect runtime dispatch.

Which makes me speculate this is some inlining into the @btime loop going on? But I don’t know of a way to simply inspect which function calls are inlineable.

N5N3 · February 14, 2023, 1:02pm

broadcast: disable nospecialize logic for outer method signature by vtjnash · Pull Request #43200 · JuliaLang/julia · GitHub would reduce the possiblity of similar problem.
The extensive @inline does make life worse in many places. (As a non-inlined materialize! would always be specialized)

Of course, if we remove all @inline then the const propagation of broadcast would be turned off,
which means that y .= 2 .* x and f(x) = 2x; y .= f.(x) might have different performance.

jishnub · February 15, 2023, 3:08pm

Indeed, in this case, adding call-site inlining seems to remove the disparity:

julia> @btime @inline g1(*, $x, $y, $z);
  21.322 ns (0 allocations: 0 bytes)

julia> @btime g1(*, $x, $y, $z);
  213.411 ns (2 allocations: 64 bytes)

Topic		Replies	Views
Performance of simple broadcasting operations with many arguments Performance performance , broadcast	15	1592	November 29, 2021
How to customize broadcast for a function that accepts keyword arguments? General Usage question , broadcast , function	0	407	August 25, 2022
Memory allocation in broadcast assignment Performance question	3	863	July 24, 2020
Why does broadcast operation allocate? Performance	3	1276	July 28, 2021
Weird performance sensitivity to order of operations in broadcasted expression General Usage	4	285	February 3, 2021

Why does this broadcast operation require specialization for optimal performance?

Related topics