It is confusing though. The performance tips say that specializing only happens when the function or type is “used” (I presume to mean a call) and not when it is passed as an argument to a higher order function, so I would not expect g2 to specialize on f, I would expect g2 to perform the same as g1.
It’s also strange that @report_opt suggests no dynamic dispatch in g1 when that’s what no-specialization would do. The allocations and performance seems to suggest it.
Despite being very familiar with that section of the Performance Tips, I frequently fail to accurately predict when it will and won’t apply. I’m not sure that it is entirely clear or accurate. Next time I catch a case with unexpected behavior I’ll be sure to post about it, but it seems this thread already has an example where it’s either misleading or otherwise requires understanding the nuance of broadcast lowering.
Perhaps I’m starting to understand now. Looking into what g2 is calling, I see
broadcast!(f::Tf, dest, As::Vararg{Any,N}) where {Tf,N} = (materialize!(dest, broadcasted(f, As...)); dest)
so broadcast! forces a specialization on the function before passing it to broadcasted, whereas y .= x .* z lowers to broadcasted, and simply passes the function to it without specializing on it. As for why this makes a difference, I’m not certain, since it should ultimately be passed down to Broadcasted in either case, which does specialize on the function.
If this is indeed the issue, perhaps the specialization should happen in broadcasted instead, which would handle all cases?
There is a lot of @inlineing going on in broadcast.jl, and I’m not exactly sure if the chain of function calls are just inlined to the point where only specializing function calls remain. I do suspect that because of g3’s performance; even if g3 itself specializes, if y .= f.(x, z) doesn’t then I expect runtime dispatch.
Which makes me speculate this is some inlining into the @btime loop going on? But I don’t know of a way to simply inspect which function calls are inlineable.
Of course, if we remove all @inline then the const propagation of broadcast would be turned off,
which means that y .= 2 .* x and f(x) = 2x; y .= f.(x) might have different performance.