Be less aware of when Julia avoids specializing?

I count roughly 11 Discourse threads in the past 6 months linking to the near-infamous performance tip regarding when Julia avoids specializing to resolve performance issues. The purpose of this thread is to re-visit this choice of default behavior and consider whether an alternative may be preferable.

To re-cap: there are three cases where Julia may choose to not specialize on an argument type: Function, Type, and Vararg. This non-specialization can save the compiler the effort of recompiling functions for different argument types that actually have little/no effect on the generated code. It is possible to override this default by adding otherwise-unnecessary parametric annotations to those arguments in the method signature. It is possible to force non-specialization on other types with the use of @nospecialize or other code patterns. Using an otherwise non-specialized argument explicitly within the function body will usually cause specialization (although I recall seeing cases where it doesn’t), but not always and the prevailing advice of “write short functions for clarity and rely on inlining to remove the cost” works against this.

The current default seems to lead directly to ~2 issues per month on Discourse, although the iceberg probably goes much deeper in its (un-noticed or ignored) effect on members of the community. My speculation is that, “on average,” user code would see performance benefits (even accounting for regression due to extra compilation) if we removed some or all of the three special cases (and re-add the current behavior in a few key snippets that rely on the current default).

For my use code, personally, I’ve overridden almost every place where I’ve noticed this non-specialization in effect because I only ever use the functions in question with a small number of argument types within a session (so the compiler/memory load is modest) and the non-specialization results in considerable performance losses.

The non-specialization is definitely important, but the places it is important are often places like Base or package code where it might be used diversely within a single application. This code is contributed by (or at least reviewed by) veteran users who can be expected to know the performance tips and who are putting extra effort into this re-usable code. With a change to the default, it would become necessary to annotate some sites as @nospecialize for the same reason the current defaults exist, but I suspect that code outside of Base and a small set of packages would require many fewer annotations than we use now.

Going out on a limb, I suspect that “average” performance across existing Julia code (most of which is private and written by individuals with shallower knowledge than Discourse regulars) would be positively affected by a changed default. And I think that “this code seems slow to compile” is a more suitable and less frequent issue to require Discourse help (if a bit more complicated to diagnose and resolve) than the “this code is slow and allocates a ton even though I’ve done everything right except to know this special exception” issues we see currently.

If it’s worth changing the default in only a subset of these three cases, that’s also worth considering.

12 Likes

+1

Or at least having much more sensible defaults given the current state of Julia precompilation and package extensions. I could see having a larger upper bound for Vararg non-specialization be an easy improvement. Right now it seems too extreme imo:

julia> f(x...) = x

julia> Base.specializations(@which f(1, 2))
Base.MethodSpecializations(MethodInstance for f(::Int64, ::Vararg{Int64}))

which is not great when, as you mention, it’s often encouraged to write generic code without worrying about lost performance.

Also worth noting that the Vararg specialization rules are currently so complicated they seem to result in unexpected behavior when you combine them with other non-specializing types:

I think simpler rules would generally be better if the specialization behavior is changed.

1 Like

All these heuristics that can lead to slow code in an unexpected place are definitely my #1 gripe in terms of writing composable performant code. Especially noticeable when using higher-order functions liberally.
Would be nice to solve this issue somehow… One partial approach could be to let function authors/users opt into higher specialization/inference limits, Allow more aggressive inference for some functions · Issue #52239 · JuliaLang/julia · GitHub.

As for this specific linked “performance tip”, I personally still don’t fully understand what docs mean by “argument is used” in

Julia will always specialize when the argument is used within the method, but not if the argument is just passed through to another function.

and how lack of specialization affects/doesn’t affect performance of functions that “just pass through” this argument.

1 Like

Indeed, very strange.

To share one example, I forced some Vararg specialization in CUDA.jl last week and it resulted in an over 100% performance gain for small CUDA kernel launches:

If even very established and optimized packages like CUDA.jl can instantly see improvements of this magnitude, it would be very interesting to see the performance improvements over the entire ecosystem.

1 Like

Unfortunately it didn’t seem to in practice. The linked writing is about functions specifically, but excessive specialization also applies to types and Vararg. Specialization is a double edged sword: we compile more versions of a method in exchange for optimizing execution. This is worth it if we compile for a fixed number of types and reuse the compiled code in hot loops, but it backfires if we compile for an arbitrary large number of types, especially if the compiled code is not reused like in reflection. Varying over functions, Type{T}, and Vararg naturally involves an arbitrarily large number of types, and turns out their unconditional automatic specialization makes base Julia unusable.

It seems to just mean “called” but I’d want an expert on the compiler to clarify that.

That’s also vaguely worded to the point of possibly being misleading, but there’s a couple ways:

  • Performance is usually hurt if the output of said higher-order function needs to be used in the rest of the code because runtime dispatches need to handle the abstractly inferred return type. But if that’s not the case, you don’t see a performance issue, in fact you may benefit from far less compilation over an arbitrarily large number of input functions. We really don’t need to compile foo(f, x) = @noinline bar(f, baz(x)) for every f if foo is only called at top-level; we’d only need to compile the bar doing the real work of calling f.
  • Even in the cases where the return type should be inferred for performance, method inlining can make this moot. map itself doesn’t call the input function, it ends up specializing over it because of propagated inlining of the callees that do call it. This doesn’t cause compilation bloat because the inlined callees aren’t compiled separately.

Taking this into account, I view this as a weird exception that we have to live with because excessive compilation is harder to diagnose and reverse, and manually opting into nonspecialization would involve more work.

2 Likes

I don’t think we are in disagreement here that specializing on absolutely everything = bad. The question is really how far to push the specialization now that Julia has improved in other areas.

For functions, it is binary yes/no, but for Vararg it is a sliding scale along multiple axes. I would like to see hard data on where the optimal tradeoff is with current Julia and whether it makes sense to slide the scale further…

2 Likes

The CUDA example had plain args.... If all elements had the same type, you would get a specialization for the type, but if not, you specialize on only the first element and infer the rest as Any. Specializing over heterogeneous elements is justified in hot loops like repeated CUDA kernel calls, but special cases don’t make for general rules, and the general rule prevents severe compilation bloat.

Also worth mentioning that inference over runtime types can happen without specialization, as the docs for @nospecialize versus @nospecializeinfer show. Uncalled input functions, types, and obviously Vararg seem to work more like the latter based on the performance of subsequent code in callers. If there is ever a “what if we specialized on everything by default” benchmark made for updated versions of Julia, it’d be interesting to also test “what if we inferred over everything by default so we at least don’t store more compiled code”.