But with this reasoning there should be no performance difference in this case, no? After all, the function is just passed on to map. I think I don’t understand what
actually means.
@descend also seems to produce identical typed code for both version, and the types seem to be specialized in both:
Chulhu output
original process_matrix:
julia> @descend process_matrix(_f, matrix, vectors)
process_matrix(f, matrix, vectors) in Main at REPL[10]:1
 1 function process_matrix(f::typeof(_f), matrix::Array{ComplexF64, 3}, vectors::Vector{Vector{Float64}})::Matrix{ComplexF64}
 2     N::Int64, M::Int64, L::Int64 = size(matrix::Array{ComplexF64, 3})::Tuple{Int64, Int64}::Int64
 3     coefficients::Vector{Matrix{Float64}} = map(f::typeof(_f), vectors::Vector{Vector{Float64}})::Vector{Matrix{Float64}}
 4     #coefficients = [ones(Float64, 4, 4) * norm(v) for v in vectors]
 5     ret::Matrix{ComplexF64} = zeros(ComplexF64::Type{ComplexF64}, L::Int64, L::Int64)::Matrix{ComplexF64}
 6     for i::Int64 in 1:L, j in 1:L
 7         for k::Int64 in (1:N::Int64)::Int64::Union{Nothing, Tuple{Int64, Int64}}
 8             for a::Int64 in 1:M, b in 1:M
 9                 ret::Matrix{ComplexF64}[i::Int64, j::Int64] = (coefficients::Vector{Matrix{Float64}}[k::Int64]::Matrix{Float64}[a, b::Int64]::Float64 * matrix::Array{ComplexF64, 3}[k::Int64, a::Int64, i::Int64]::ComplexF64 * matrix::Array{ComplexF64, 3}[k::Int64, b::Int64, j::Int64]::ComplexF64)::ComplexF64
10             end
11         end
12     end
13 
14     return ret::Matrix{ComplexF64}
15 end
Select a call to descend into or ↩ to ascend. [q]uit. [b]ookmark.
Toggles: [w]arn, [h]ide type-stable statements, [t]ype annotations, [s]yntax highlight for Source/LLVM/Native.
Show: [S]ource code, [A]ST, [T]yped code, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
 • size(matrix)
   size(matrix)
   size(matrix)
   size(matrix::Array{ComplexF64, 3})
   map(f::typeof(_f), vectors::Vector{Vector{Float64}})
   zeros(ComplexF64::Type{ComplexF64}, L::Int64, L::Int64)
   %12 = < constprop > Colon(::Core.Const(1),::Int64)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
   %13 = < constprop > iterate(::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64]))::Union{Nothing, Tuple{Int64, Int64}}
   %20 = < constprop > Colon(::Core.Const(1),::Int64)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
v  %21 = < constprop > iterate(::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64]))::Union{Nothing, Tuple{Int64, Int64}}
process_matrix2 with forced specialization:
julia> @descend process_matrix2(_f, matrix, vectors)
process_matrix2(f::F, matrix, vectors) where F<:Function in Main at REPL[13]:1
 1 function (process_matrix2(f::typeof(_f)::F, matrix::Array{ComplexF64, 3}, vectors::Vector{Vector{Float64}}) where F<:Function)::Matrix{ComplexF64}
 2     N::Int64, M::Int64, L::Int64 = size(matrix::Array{ComplexF64, 3})::Tuple{Int64, Int64}::Int64
 3     coefficients::Vector{Matrix{Float64}} = map(f::typeof(_f), vectors::Vector{Vector{Float64}})::Vector{Matrix{Float64}}
 4     #coefficients = [ones(Float64, 4, 4) * norm(v) for v in vectors]
 5     ret::Matrix{ComplexF64} = zeros(ComplexF64::Type{ComplexF64}, L::Int64, L::Int64)::Matrix{ComplexF64}
 6     for i::Int64 in 1:L, j in 1:L
 7         for k::Int64 in (1:N::Int64)::Int64::Union{Nothing, Tuple{Int64, Int64}}
 8             for a::Int64 in 1:M, b in 1:M
 9                 ret::Matrix{ComplexF64}[i::Int64, j::Int64] = (coefficients::Vector{Matrix{Float64}}[k::Int64]::Matrix{Float64}[a, b::Int64]::Float64 * matrix::Array{ComplexF64, 3}[k::Int64, a::Int64, i::Int64]::ComplexF64 * matrix::Array{ComplexF64, 3}[k::Int64, b::Int64, j::Int64]::ComplexF64)::ComplexF64
10             end
11         end
12     end
13 
14     return ret::Matrix{ComplexF64}
15 end
Select a call to descend into or ↩ to ascend. [q]uit. [b]ookmark.
Toggles: [w]arn, [h]ide type-stable statements, [t]ype annotations, [s]yntax highlight for Source/LLVM/Native.
Show: [S]ource code, [A]ST, [T]yped code, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
 • size(matrix)
   size(matrix)
   size(matrix)
   size(matrix::Array{ComplexF64, 3})
   map(f::typeof(_f), vectors::Vector{Vector{Float64}})
   zeros(ComplexF64::Type{ComplexF64}, L::Int64, L::Int64)
   %12 = < constprop > Colon(::Core.Const(1),::Int64)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
   %13 = < constprop > iterate(::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64]))::Union{Nothing, Tuple{Int64, Int64}}
   %20 = < constprop > Colon(::Core.Const(1),::Int64)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
v  %21 = < constprop > iterate(::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64]))::Union{Nothing, Tuple{Int64, Int64}}
 
I tried to reduce the MWE a bit further to figure out what’s going on, and it looks like the problem with the non-specialized version is that the output type of map(f, vectors) cannot be inferred and hence every access to coefficients creates extra allocations (which makes sense to me).
Smaller MWE
Note that the non-annotated version has ~ 60 x 60 = 3600 allocations more than the annotated one, which is the number of getindex calls for coefficients:
using LinearAlgebra
using BenchmarkTools
vectors = [rand(2) for _ in 1:60]
function _f(v)
    return ones(Float64, 4, 4) * norm(v)
end
function process_matrix_redux(f, vectors)
    coefficients = map(f, vectors)
    N = length(vectors)
    ret = zeros(N, N)
    for i in 1:N, j in 1:N
        ret[i, j] = first(coefficients[i])
    end
    return ret
end
# 446.167 ÎĽs (3724 allocations: 107.47 KiB)
function process_matrix_redux_annotated(f::F, vectors) where F
    coefficients = map(f, vectors)
    N = length(vectors)
    ret = zeros(N, N)
    for i in 1:N, j in 1:N
        ret[i, j] = first(coefficients[i])
    end
    return ret
end
# 10.171 ÎĽs (123 allocations: 51.20 KiB)
 
Another minor weirdness: The docs mention (@which process_matrix_redux(_f, vectors)).specializations, which does in fact show that that there is only a non-specialized version:
# Right after defining everything
julia> (@which process_matrix_redux(_f, vectors)).specializations
svec(MethodInstance for process_matrix_redux(::Function, ::Vector{Vector{Float64}}), nothing, nothing, nothing, nothing, nothing, nothing, nothing)
however, it only does so when called before calling the benchmark code (I get why this also happens when calling @code_warntype, since it apparently triggers a new specialization, but why does it happen with @btime  ? )
 ? )
# After doing `@btime process_matrix_redux($_f, $vectors)`
julia> (@which process_matrix_redux(_f, vectors)).specializations
svec(MethodInstance for process_matrix_redux(::Function, ::Vector{Vector{Float64}}), MethodInstance for process_matrix_redux(::typeof(_f), ::Vector{Vector{Float64}}), nothing, nothing, nothing, nothing, nothing, nothing)
Now there is seemingly a new specialization MethodInstance for process_matrix_redux(::typeof(_f), ::Vector{Vector{Float64}}), but the code that is actually run is still the non-specialized one.