Inner function overhead


#1

I have found that using a small nested function to handle repeated code has resulted in unwanted allocations. I tried to simplify the below function:

function compute_distance(grid::Matrix{T}, I::CartesianIndex{2}, Ifirst::CartesianIndex{2}, Ilast::CartesianIndex{2}, h::Real, maxdist::Real) where {T <: AbstractFloat}
    Iv = CartesianIndex(1, 0)
    Ih = CartesianIndex(0, 1)

    Jfirst = max(I-Iv, Ifirst)
    Ufirst = Jfirst == I ? maxdist : grid[Jfirst]
    Jlast = min(I+Iv, Ilast)
    Ulast = Jlast == I ? maxdist : grid[Jlast]
    Uv = min(Ufirst, Ulast)

    Jfirst = max(I-Ih, Ifirst)
    Ufirst = Jfirst == I ? maxdist : grid[Jfirst]
    Jlast = min(I+Ih, Ilast)
    Ulast = Jlast == I ? maxdist : grid[Jlast]
    Uh = min(Ufirst, Ulast)

    if Uh < maxdist && Uv < maxdist
        disc = 2*h^2-(Uh-Uv)^2
        if disc >= 0
            return min(0.5*(Uh+Uv)+0.5*sqrt(disc), maxdist)
        end
    end

    Umin = min(Uv, Uh)
    return min(h+Umin, maxdist)
end

using the function m():

function compute_distance(grid::Matrix{T}, I::CartesianIndex{2}, Ifirst::CartesianIndex{2}, Ilast::CartesianIndex{2}, h::Real, maxdist::Real) where {T <: AbstractFloat}
    Iv = CartesianIndex(1, 0)
    Ih = CartesianIndex(0, 1)

    function m(Ii)
        Jfirst = max(I-Ii, Ifirst)
        Ufirst = Jfirst == I ? maxdist : grid[Jfirst]
        Jlast = min(I+Ii, Ilast)
        Ulast = Jlast == I ? maxdist : grid[Jlast]
        Umin = min(Ufirst, Ulast)
    end
    Uh, Uv = m(Ih), m(Iv)

    if Uh < maxdist && Uv < maxdist
        disc = 2*h^2-(Uh-Uv)^2
        if disc >= 0
            return min(0.5*(Uh+Uv)+0.5*sqrt(disc), maxdist)
        end
    end

    Umin = min(Uv, Uh)
    return min(h+Umin, maxdist)
end

When I benchmark these two versions with

grid = zeros(60,60);
R = CartesianIndices(grid)
Ifirst, Ilast = first(R), last(R)
@btime compute_distance($(grid), CartesianIndex(50, 20), $Ifirst, $Ilast, 0.1, 3.0)

The first version gives

19.555 ns (0 allocations: 0 bytes)

The second version gives

28.444 ns (1 allocation: 16 bytes)

I thought maybe it was the captured variables creating a β€œbox” as described in the performance tips, so I made m() a separate function which accepts Ifirst, Ilast, I, and grid as arguments, but this proved to be even worse:

285.022 ns (18 allocations: 288 bytes)

What’s the cause of this? When can I not trust functions to handle small tasks within my code without worsening performance?


#2

The problem is that the last line in your function m:

Umin = min(Ufirst, Ulast)

Clashes with the outer variable in compute_distance (same variable name Umin):

Umin = min(Uv, Uh)

If you rename any of these variables, or simple replace the first expression with:

return min(Ufirst, Ulast)

The allocation will go away, and the two implementations should perform equally.


Btw, you can use @code_warntype to detect this type of problem:

julia> @code_warntype compute_distance2((grid), CartesianIndex(50, 20), Ifirst, Ilast, 0.1, 3.0)
Body::Any
1 ── %1   = %new(Core.Box)::Core.Box
...

The return type being Any (colored red) and the presence of Core.Box indicates that there’s a problem. You can use a more detailed version of @code_warntype to get a better idea of where the problem occurs (look for the first red Any):

julia> @code_warntype debuginfo=:source compute_distance2((grid), CartesianIndex(50, 20), Ifirst, Ilast, 0.1, 3.0)
...
β”‚    @ inner_function_overhead.jl:49 within 'compute_distance'
β”‚    %163 = (Core.isdefined)(%1, :contents)::Bool
└───        goto #26 if not %163
25 ─        goto #27
26 ─        $(Expr(:throw_undef_if_not, :Umin, false))
27 β”„ %167 = (Core.getfield)(%1, :contents)::Any
β”‚    %168 = (h + %167)::Any

It’s perhaps not super clear, but line 49 corresponds to return min(h+Umin, maxdist), and we can see that Core.isdefined is called for the field Umin which indicates that the compiler is unable to fully deduce what that field is. You might find the output of @code_lowered easier to read.


As for why this happens, it has to do with how scoping works in Julia. If the variable is declared anywhere in the outer scope, and also in an inner scope, it will be available anywhere after the declaration in the inner scope. Not sure if that explanation made sense, but consider these examples:

function test()
    for i = 1:10
        k = i
    end
    println(k)
end

julia> test();
ERROR: UndefVarError: k not defined

And now:

function test()
    for i = 1:10
        k = i
    end
    println(k)
    k = 2
end

julia> test();
10

#3

To me, this second example represents a really weird scoping behavior, I thought this should throw an error too because you requested to print the value of k before declaring k = 2 out of the loop scope. I thought k must be declared before entering the loop to be able to collect its value outside of the loop!