Performance issue due to function as an argument

using LinearAlgebra, InteractiveUtils, BenchmarkTools

function process_matrix(f, matrix, vectors)
    N, M, L = size(matrix)
    #coefficients = map(f, vectors)
    #coefficients = [ones(Float64, 4, 4) * norm(v) for v in vectors]
    ret = zeros(ComplexF64, L, L)
    for i in 1:L, j in 1:L
        for k in 1:N
            for a in 1:M, b in 1:M
                ret[i, j] = coefficients[k][a, b] * matrix[k, a, i] * matrix[k, b, j]
            end
        end
    end

    return ret
end

function _f(v)
    return ones(Float64, 4, 4) * norm(v)
end

N = 6000
M = 4
L = 12
vectors = [rand(2) for _ in 1:N]
matrix = rand(ComplexF64, N, M, L)

@btime process_matrix(_f, matrix, vectors)

With this script, I get 3.014 s (81778660 allocations: 2.04 GiB) if I uncomment the first comment (defining coefficients) and 22.304 ms (12003 allocations: 2.25 MiB) if I uncomment the second comment.

It seems the two definitions of coefficients should be the same, and both versions do not show type instability from @code_warntype. What is causing the huge performance difference?

EDIT: While this (and the many helpful additions from everyone below) answers the original question, as to what causes the issue, a (in my opinion) nicer and more maintainable way to improve the performance of the code is to add a function barrier, as described further below.

Looks like it’s related to these exceptions that disable specialization of methods to a new argument type signature
https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing

Changing the function definition to this should work:

function process_matrix(f::F, matrix, vectors) where F<:Function

the where F <: Function is optional and could just be where F, but it makes it clearer that f should be a function in my opinion.

PS: Also note that the function arguments in a @btime call should be interpolated, as shown here: Manual · BenchmarkTools.jl

6 Likes

@Sevi seems on the right track here. I found a funny alternative: If you simply call f once within process_matrix then the performance difference disappears completely: the call to f forces specialization but later on gets eliminated by the compiler and does not occur in the final code :smile:

function process_matrix(f, matrix, vectors)
    N, M, L = size(matrix)
    f(vectors[1]) # call f once to force specialization - see Sevi's answer
    coefficients = map(f, vectors)
    #coefficients = [ones(Float64, 4, 4) * norm(v) for v in vectors]
    ret = zeros(ComplexF64, L, L)
    for i in 1:L, j in 1:L
        for k in 1:N
            for a in 1:M, b in 1:M
                ret[i, j] = coefficients[k][a, b] * matrix[k, a, i] * matrix[k, b, j]
            end
        end
    end

    return ret
end

I was not aware that Julia has these special cases for specialization, so I definitely learned something. However I still wonder where the allocations/slowdown come from. @code_warntype produces identical results for versions but I have the feeling that it actually performs specialization that does not occur during a normal call. If this was the case, should it be considered a bug/an edge case of @code_warntype?

1 Like

@code_warntype at least historically (might have been fixed) forces specialization and so is a liar.
Cthulhu doesn’t do it wrong iirc.

3 Likes

That is new information for me, thanks! I feel this should be mentioned both in the Performance tips for @code_warntype and in the docstring of @code_warntype. Probably also with a reference to the (non-)specialization cases. I’ll open an issue and perhaps a PR to add these warnings+references to the docstring.

Honestly I find these special cases for specialization not great. It is mostly an optimization in some cases but will bite you if you are not aware (and I guess most people are not aware but could have the idea of abstracting behavior by using function arguments). However, @nospecialize also exists if you specifically want Julia to not specialize on something (or does that do something subtly different?). Wouldn’t it be much cleaner to say: We always specialize, if you don’t want that use @nospecialize? Instead of “well we usually specialize except for Function, Type and VarArg, except if that argument is used directly. If you don’t want that then you need to put a type-declaration, which is otherwise only used for dispatch reasons and never for performance reasons”. Seems quite convoluted to me but it is probably too late to change now.

2 Likes

Here is the issue #51415. I also tried replacing @code_warntype with @descend from Cthulhu.ljl but that got me the same results. I never used it before so i might have overlooked something but otherwise it seems that Cthulhu.jl also misses the non-specialization.

1 Like

Thank you.

A relevant question: the Julia manual says

As a heuristic, Julia avoids automatically specializing on argument type parameters in three specific cases: Type, Function, and Vararg. Julia will always specialize when the argument is used within the method, but not if the argument is just passed through to another function. This usually has no performance impact at runtime and improves compiler performance.

I don’t understand how this behavior ususally has no performance impact. Maybe someone can shed some light on this for me.

If the function is not used, but merely passed along to another function, then there is no value in specializing. And not specializing can save the compiler from producing lots of pointless specializations, hogging memory.

I think the optimization is well worth it, but it would be nice to fix the bug in the code introspection functions.

3 Likes

But with this reasoning there should be no performance difference in this case, no? After all, the function is just passed on to map. I think I don’t understand what

actually means.

@descend also seems to produce identical typed code for both version, and the types seem to be specialized in both:

Chulhu output

original process_matrix:

julia> @descend process_matrix(_f, matrix, vectors)
process_matrix(f, matrix, vectors) in Main at REPL[10]:1
 1 function process_matrix(f::typeof(_f), matrix::Array{ComplexF64, 3}, vectors::Vector{Vector{Float64}})::Matrix{ComplexF64}
 2     N::Int64, M::Int64, L::Int64 = size(matrix::Array{ComplexF64, 3})::Tuple{Int64, Int64}::Int64
 3     coefficients::Vector{Matrix{Float64}} = map(f::typeof(_f), vectors::Vector{Vector{Float64}})::Vector{Matrix{Float64}}
 4     #coefficients = [ones(Float64, 4, 4) * norm(v) for v in vectors]
 5     ret::Matrix{ComplexF64} = zeros(ComplexF64::Type{ComplexF64}, L::Int64, L::Int64)::Matrix{ComplexF64}
 6     for i::Int64 in 1:L, j in 1:L
 7         for k::Int64 in (1:N::Int64)::Int64::Union{Nothing, Tuple{Int64, Int64}}
 8             for a::Int64 in 1:M, b in 1:M
 9                 ret::Matrix{ComplexF64}[i::Int64, j::Int64] = (coefficients::Vector{Matrix{Float64}}[k::Int64]::Matrix{Float64}[a, b::Int64]::Float64 * matrix::Array{ComplexF64, 3}[k::Int64, a::Int64, i::Int64]::ComplexF64 * matrix::Array{ComplexF64, 3}[k::Int64, b::Int64, j::Int64]::ComplexF64)::ComplexF64
10             end
11         end
12     end
13 
14     return ret::Matrix{ComplexF64}
15 end
Select a call to descend into or ↩ to ascend. [q]uit. [b]ookmark.
Toggles: [w]arn, [h]ide type-stable statements, [t]ype annotations, [s]yntax highlight for Source/LLVM/Native.
Show: [S]ource code, [A]ST, [T]yped code, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
 • size(matrix)
   size(matrix)
   size(matrix)
   size(matrix::Array{ComplexF64, 3})
   map(f::typeof(_f), vectors::Vector{Vector{Float64}})
   zeros(ComplexF64::Type{ComplexF64}, L::Int64, L::Int64)
   %12 = < constprop > Colon(::Core.Const(1),::Int64)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
   %13 = < constprop > iterate(::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64]))::Union{Nothing, Tuple{Int64, Int64}}
   %20 = < constprop > Colon(::Core.Const(1),::Int64)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
v  %21 = < constprop > iterate(::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64]))::Union{Nothing, Tuple{Int64, Int64}}

process_matrix2 with forced specialization:

julia> @descend process_matrix2(_f, matrix, vectors)
process_matrix2(f::F, matrix, vectors) where F<:Function in Main at REPL[13]:1
 1 function (process_matrix2(f::typeof(_f)::F, matrix::Array{ComplexF64, 3}, vectors::Vector{Vector{Float64}}) where F<:Function)::Matrix{ComplexF64}
 2     N::Int64, M::Int64, L::Int64 = size(matrix::Array{ComplexF64, 3})::Tuple{Int64, Int64}::Int64
 3     coefficients::Vector{Matrix{Float64}} = map(f::typeof(_f), vectors::Vector{Vector{Float64}})::Vector{Matrix{Float64}}
 4     #coefficients = [ones(Float64, 4, 4) * norm(v) for v in vectors]
 5     ret::Matrix{ComplexF64} = zeros(ComplexF64::Type{ComplexF64}, L::Int64, L::Int64)::Matrix{ComplexF64}
 6     for i::Int64 in 1:L, j in 1:L
 7         for k::Int64 in (1:N::Int64)::Int64::Union{Nothing, Tuple{Int64, Int64}}
 8             for a::Int64 in 1:M, b in 1:M
 9                 ret::Matrix{ComplexF64}[i::Int64, j::Int64] = (coefficients::Vector{Matrix{Float64}}[k::Int64]::Matrix{Float64}[a, b::Int64]::Float64 * matrix::Array{ComplexF64, 3}[k::Int64, a::Int64, i::Int64]::ComplexF64 * matrix::Array{ComplexF64, 3}[k::Int64, b::Int64, j::Int64]::ComplexF64)::ComplexF64
10             end
11         end
12     end
13 
14     return ret::Matrix{ComplexF64}
15 end
Select a call to descend into or ↩ to ascend. [q]uit. [b]ookmark.
Toggles: [w]arn, [h]ide type-stable statements, [t]ype annotations, [s]yntax highlight for Source/LLVM/Native.
Show: [S]ource code, [A]ST, [T]yped code, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
 • size(matrix)
   size(matrix)
   size(matrix)
   size(matrix::Array{ComplexF64, 3})
   map(f::typeof(_f), vectors::Vector{Vector{Float64}})
   zeros(ComplexF64::Type{ComplexF64}, L::Int64, L::Int64)
   %12 = < constprop > Colon(::Core.Const(1),::Int64)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
   %13 = < constprop > iterate(::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64]))::Union{Nothing, Tuple{Int64, Int64}}
   %20 = < constprop > Colon(::Core.Const(1),::Int64)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
v  %21 = < constprop > iterate(::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64]))::Union{Nothing, Tuple{Int64, Int64}}

I tried to reduce the MWE a bit further to figure out what’s going on, and it looks like the problem with the non-specialized version is that the output type of map(f, vectors) cannot be inferred and hence every access to coefficients creates extra allocations (which makes sense to me).

Smaller MWE

Note that the non-annotated version has ~ 60 x 60 = 3600 allocations more than the annotated one, which is the number of getindex calls for coefficients:

using LinearAlgebra
using BenchmarkTools

vectors = [rand(2) for _ in 1:60]

function _f(v)
    return ones(Float64, 4, 4) * norm(v)
end

function process_matrix_redux(f, vectors)
    coefficients = map(f, vectors)
    N = length(vectors)
    ret = zeros(N, N)
    for i in 1:N, j in 1:N
        ret[i, j] = first(coefficients[i])
    end

    return ret
end

# 446.167 μs (3724 allocations: 107.47 KiB)

function process_matrix_redux_annotated(f::F, vectors) where F
    coefficients = map(f, vectors)
    N = length(vectors)
    ret = zeros(N, N)
    for i in 1:N, j in 1:N
        ret[i, j] = first(coefficients[i])
    end

    return ret
end

# 10.171 μs (123 allocations: 51.20 KiB)

Another minor weirdness: The docs mention (@which process_matrix_redux(_f, vectors)).specializations, which does in fact show that that there is only a non-specialized version:

# Right after defining everything
julia> (@which process_matrix_redux(_f, vectors)).specializations
svec(MethodInstance for process_matrix_redux(::Function, ::Vector{Vector{Float64}}), nothing, nothing, nothing, nothing, nothing, nothing, nothing)

however, it only does so when called before calling the benchmark code (I get why this also happens when calling @code_warntype, since it apparently triggers a new specialization, but why does it happen with @btime :sweat_smile: ? )

# After doing `@btime process_matrix_redux($_f, $vectors)`
julia> (@which process_matrix_redux(_f, vectors)).specializations
svec(MethodInstance for process_matrix_redux(::Function, ::Vector{Vector{Float64}}), MethodInstance for process_matrix_redux(::typeof(_f), ::Vector{Vector{Float64}}), nothing, nothing, nothing, nothing, nothing, nothing)

Now there is seemingly a new specialization MethodInstance for process_matrix_redux(::typeof(_f), ::Vector{Vector{Float64}}), but the code that is actually run is still the non-specialized one.

The language is a bit loose, it’s more like the impact is small and vastly outweighed by other performance variations. Looking at specifics, when you don’t specialize on a function f, then you have to 1) dynamically dispatch a higher-order function g(f, args) and 2) cannot infer its return type at compile-time and need subsequent code to incorporate runtime type checks and dispatches.

In OP’s case, the 2nd part really hurt performance because the subsequent code loops a lot. This can be helped by using a function barrier, in other words putting the subsequent code in a function that can be dispatched to at runtime and compiled for the specific types (no runtime checks and dispatches).

The 1st part can hurt performance if the overall function is called frequently or in a loop, again because of uninferred return types and dispatches. Even that 1 runtime dispatch g(f, args) can add up to a performance loss in some scenarios.

But not all code is loops and not all functions are frequently called, so slowing those down makes an insignificant impact to the program’s performance. And it’s worth avoiding the costs of compiling more code (excerpt from a link earlier in the thread):

Generating a new type for every function has potentially serious consequences for compiler resource use when combined with Julia’s “specialize on all arguments by default” design. Indeed, the initial implementation of this design suffered from much longer build and test times, higher memory use, and a system image nearly 2x larger than the baseline. In a naive implementation, the problem is bad enough to make the system nearly unusable. Several significant optimizations were needed to make the design practical.

The link goes on to more precisely explain what “using” an argument means (I really dislike that language in the Performance Tips):

Many functions simply “pass through” an argument to somewhere else, e.g. to another function or to a storage location. Such functions do not need to be specialized for every closure that might be passed in. Fortunately this case is easy to distinguish by simply considering whether a function calls one of its arguments (i.e. the argument appears in “head position” somewhere). Performance-critical higher-order functions like map certainly call their argument function and so will still be specialized as expected.

That last part does not go on to mention that inlining of small specialized functions will often propagate across unspecialized calls and remove the performance cost. In fact, map itself does not call the argument function in its body, it happens in one of several internal functions that gets inlined up several layers of calls.

julia> begin
       foo1(x)=x+1
       bar(f, x) = f(x) # specializes on f, so bar(foo1, x) inlines to x+1
       baz(f, x) = bar(f, x) # not specialized on f, so only inlines to f(x)
       foo1_2(x) = baz(foo1, x) # inlines to bar(foo1, x) to foo1(x) to x+1
       using BenchmarkTools
       end;
julia> @which(bar(foo1, 1)).specializations # specialized on foo1
svec(MethodInstance for bar(::typeof(foo1), ::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)

julia> @which(baz(foo1, 1)).specializations # not specialized on foo1
svec(MethodInstance for baz(::Function, ::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)

julia> @btime @noinline foo1($1); # loop has extra cost of function call
  4.319 ns (0 allocations: 0 bytes)
julia> @btime @noinline bar($foo1, $1);
  4.320 ns (0 allocations: 0 bytes)
julia> @btime @noinline baz($foo1, $1); # forces unspecialized call, otherwise inlined
  20.943 ns (0 allocations: 0 bytes)
julia> @btime @noinline foo1_2($1);
  4.319 ns (0 allocations: 0 bytes)

My guess is this reflects the inlining of process_matrix_redux(_f, ...) into the benchmark loop. You can see this in my example above, before doing the benchmarking:

julia> baz(foo1, 1); (@which baz(foo1, 1)).specializations
svec(MethodInstance for baz(::Function, ::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)

julia> foo1_2(1); (@which baz(foo1, 1)).specializations
svec(MethodInstance for baz(::Function, ::Int64), MethodInstance for baz(::typeof(foo1), ::Int64), nothing, nothing, nothing, nothing, nothing, nothing)
2 Likes

I fully agree that there are instances where you din’t want specialization. However, I really dislike this implicite behavior which makes Julia code much harder to reason about.

The current logic contains an exception the exception and the fix is very hacky and uses a totally different part of the language, that usually has nothing to do with performance optimization.

Julia usually specializes on all arguments except for Function, Type and VarArg, except if that argument is used directly. If you don’t want that then you need to put a type-declaration, which is otherwise only used for dispatch reasons and never for performance reasons.

However, we already have @nospecialize to explicitely forego specialization and we try to educate users that more type parameters are [not always good](Performance Tips · The Julia Language due to the increased load on the compiler. I think many people don’t necessarily realize that each function is its own type, but that is quickly explained and understood if the user knows the other fundamentals. So in principle users have all the knowlegde to write higher order functions and then decide whether to put a @nospecialize for optimization. But maybe more importantly it is much more obvious when someone complains about slow code and we see a lot of different types flying around. So personally I strongly the rule to be

Julia always specializes. If you don’t want that, use @nospecialize.

@nospecialize expresses the intention quite clearly and there are good signs (long compile times and many many types) when to use it.

1 Like

I agree it’s not nice design. The problem is: What if that optimization (namely, not specializing in certain situations) has a big impact on compile times?
In other words, what if the consequence of having a more straightforward set of rules is that Julia’s latency becomes much higher, and that it hogs more memory?

It might be worth revisiting now though with native caching… perhaps.

1 Like

I would argue that we should not lose any performance, if you put some well-placed @nospecialize into the existing code that relies on these special cases. The issue is probably that a good portion of Base and/or the compiler may use them, so in practice it might be a lot of work.

Perhaps if we would want to change this, we could find and change existing functions using the optimization programmatically. In theory this should be possible.

Or am I misunderstanding something and @nospecialize is not equivalent to the case where Julia chooses to not specialize?

1 Like

Putting aside that it’s actually fairly difficult to decide when to put @nospecialize on arguments, let alone at such a scale, it’s not a one-to-one translation either:

  1. @nospecialize does compile for the argument’s declared type in the method, whereas you don’t need a declaration for the compiler to see a function to decide not to specialize. In the latter case, you could think of it as the compiler acting as if ::Function were declared, but the method is dispatched for any other argument types, like types.
  2. Automatically specializing on a type’s type would compile for ::DataType, which doesn’t really help type stability in most cases. Method parameters f(t::T) where T or f(t::Type{T}) where T are the only way to force specialization over an input type. (Aside, the latter form is useful if you want method dispatch to constrain the input type, like T<:Integer. The former form would have T=DataType and such, not very useful to constrain; it’s also less preferred because the semantic method variance by T doesn’t match the actual specialization on t).
1 Like

That’s very interesting, thanks!

Just to add my takeaway from this (very enlightening!) discussion: I too think the current behavior of the (non-)specialization is a bit too tricky in this kind of situations. But reading through the explanations so far, it seems like it’s impossible to choose a default behavior that catches all performance pitfalls

  1. either we specialize on function arguments per default, and provide a way to disable it (perhaps with a different macro than @nospecialize, but with similar effect), but then someone writing code using a lot of closures and user-defined functions might run into performance issues
  2. or we don’t specialize by default (current behavior), and provide a way to force it, but then a case like the original post in this thread may happen, where the result of a non-inferred map call is not behind a function barrier

Both scenarios require the author of the code to be aware of the issue and choose the appropriate strategy. It still feels a bit like opting out of too much specialization would be more desirable than opting in to more specialization, but in the end it’s a bit of a coin toss…


And lastly, I feel like a better fix to the original problem would be to add a function barrier, as @Benny mentioned in passing:

function kernel_operation!(ret, coefficients, matrix)
    N, M, L = size(matrix)
    for i in 1:L, j in 1:L
        for k in 1:N
            for a in 1:M, b in 1:M
                ret[i, j] = coefficients[k][a, b] * matrix[k, a, i] * matrix[k, b, j]
            end
        end
    end
end

function process_matrix_with_barrier(f, matrix, vectors)
    N, M, L = size(matrix)
    coefficients = map(f, vectors)
    ret = zeros(ComplexF64, L, L)
    kernel_operation!(ret, coefficients, matrix)
    return ret
end

This achieves the same performance as the specialization of the argument and leads to (in my opinion) more readable code. Just because the underlying issue is a bit obscure, doesn’t mean the solution has to be :slight_smile:

1 Like

I created an issue based on this discussion: It is too hard to reason about when Julia specializes arguments · Issue #51423 · JuliaLang/julia · GitHub

While I’m myself not technically capable of working on the compiler to sort this mess out, I believe perhaps the following would make the situation better:

  • Julia only does not specialize code for types and functions when the result of calling the function is not used in the remainder of the function. E.g. in the case:
function foo(f, x)
    map(f, x) # result not used
    f(x) === nothing && return nothing #  result used, but no need to specialize
    1
end

It would not specialize, but if the result of f(x) was used, it would.

  • #32834 is fixed - although this is less critical to fix if the above happens, because despecialization would not lead to type instability
  • @nospecialize gets argument validation.
3 Likes

It makes sense that @nospecialize doesn’t work on an argument with a method parameter because parametric methods are compiled for each set of parameter values (it would work for the other argument though). It’s only a hint to the compiler and can’t force its way through impossible situations.

I disagree on f(x) === nothing not requiring specialization over f, it would help if return type of f(x) can be inferred to be Nothing (or whatever other type of the comparison value) or a small Union. Otherwise this seems intriguing, but I have doubts this covers enough ground to handle the compilation bloat that resulted in Function/Type/Vararg being treated special. It would have to be put to the test.

Also, this would result in input types being exceptional in a different way, in that it’s specialized by argument instance rather its concrete type; specializing on DataType or UnionAll would not be much different from the status quo. This would simplify the situation by excluding functions, but complicate the situation by having 2 distinct rules for types and Varargs, which remain unspecialized by default for good reason.

Hard agree. If a method incorrectly assumes specialization, I’d rather it be a nondefault option or a separate lesser-used method. There are packages that don’t have this problem, but that’s not a good enough reason to have flaws in the simpler built-in methods.

2 Likes