Calling guard function causes millions of allocs while running inner function doesn't

I’ve followed the performance tips to implement a function guard for setting up objects to be accessed in the inner loop of an algorithm, as follows:

function fast_marching!(grid::Matrix{T}, h::Real, maxdist::Real=1) where {T <: AbstractFloat}
    states = zeros(UInt8, size(grid))::Matrix{UInt8}
    fast_marching!(states, grid, h, maxdist)
end

where the last line of the function calls the “real” algorithm that reads and writes to the preallocated states array along with the other inputs. The inner function seems to run fine, with just a few bytes of memory usage (6 allocs):

using BenchmarkTools

grid = ones(200,200)
grid[100,100] = 0
states = zeros(UInt8, size(grid))

@benchmark fast_marching!($states, $grid, 0.025, 4)
BenchmarkTools.Trial: 
  memory estimate:  848 bytes
  allocs estimate:  6
  --------------
  minimum time:     433.333 μs (0.00% GC)
  median time:      467.555 μs (0.00% GC)
  mean time:        489.823 μs (0.00% GC)
  maximum time:     2.914 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

But the outer function seems to run a totally different version of the algorithm:

@benchmark fast_marching!($grid, 0.025, 4)
BenchmarkTools.Trial: 
  memory estimate:  49.10 MiB
  allocs estimate:  1788630
  --------------
  minimum time:     114.365 ms (2.80% GC)
  median time:      118.648 ms (2.74% GC)
  mean time:        118.945 ms (3.07% GC)
  maximum time:     127.768 ms (2.38% GC)
  --------------
  samples:          43
  evals/sample:     1

The only difference between these functions is that the fist one pre-allocates an array before calling the second. The number of allocations suggests that this is happening in the inner loop. Are they compiling differently for some reason? I thought the function guard is supposed to mitigate exactly that problem by compiling the “core” part of the algorithm separately.
(As an aside, should I be using a function guard here? If not, my question about why these differ so drastically still stands).
And here is the code for the actual algorithm being executed:

using DataStructures

function fast_marching!(states::Matrix{UInt8}, grid::Matrix{T}, h::Real, maxdist::Real) where {T <: AbstractFloat}
    L = PriorityQueue{CartesianIndex{2}, T}()
    for i in eachindex(grid)
        if grid[i] <= 0
            states[i] = 2
        else
            grid[i] = maxdist
        end
    end
    R = CartesianIndices(grid)
    Ifirst, Ilast = first(R), last(R)
    Iv = CartesianIndex(1, 0)
    Ih = CartesianIndex(0, 1)
    for I in R
        if states[I] == 0 &&
            (states[max(Ifirst, I-Iv)] == 2 ||
             states[max(Ifirst, I-Ih)] == 2 ||
             states[min(Ilast, I+Ih)] == 2 ||
             states[min(Ilast, I+Iv)] == 2)
            states[I] = 1
            dist = compute_distance(grid, I, Ifirst, Ilast, h, maxdist)
            grid[I] = dist
            enqueue!(L, I, dist)
        end
    end
    #main loop
    while !isempty(L)
        I = dequeue!(L)
        states[I] = 2
        for Ii = (Iv, Ih)
            for J = (min(I+Ii, Ilast), max(I-Ii, Ifirst))
                if states[J] != 2
                    dist = compute_distance(grid, J, Ifirst, Ilast, h, maxdist)
                    if dist < grid[J]
                        grid[J] = dist
                        if states[J] == 1
                            L[J] = dist
                        else
                            states[J] = 1
                            enqueue!(L, J, dist)
                        end
                    end
                end
            end
        end
    end
    return nothing
end

function compute_distance(grid::Matrix{T}, I::CartesianIndex{2}, Ifirst::CartesianIndex{2}, Ilast::CartesianIndex{2}, h::Real, maxdist::Real) where {T <: AbstractFloat}
    Iv = CartesianIndex(1, 0)
    Ih = CartesianIndex(0, 1)

    Uh = Uv = maxdist
    function m(Ii)
        Jfirst = max(I-Ii, Ifirst)
        Ufirst = Jfirst == I ? maxdist : grid[Jfirst]
        Jlast = min(I+Ii, Ilast)
        Ulast = Jlast == I ? maxdist : grid[Jlast]
        Umin = min(Ufirst, Ulast)
    end
    Uh, Uv = m(Ih), m(Iv)

    if Uh < maxdist && Uv < maxdist
        disc = 2*h^2-(Uh-Uv)^2
        if disc >= 0
            return min(0.5*(Uh+Uv)+0.5*sqrt(disc), maxdist)
        end
    end

    Umin = min(Uv, Uh)
    return min(h+Umin, maxdist)
end

Can you post a complete, self-contained example? Otherwise you are unlikely to get meaningful help. That is, post enough such that we can copy paste into the REPL and get going (you test copy-pasting your code into your REPL before hitting submit). This includes generation of suitable test data that exhibits the behavior you have problems with.

That being said:

This entire thing smells of type instability. Try: (1) @noinline fast_marching!, (2) stop mixing types (maxdist = 1 sounds wrong. Don’t you want maxdist = convert(T, 1)?), (3) read @code_warntype fast_marching!(canvas.grid, 0.025, 4), (4) test with both Float32 and Float64 based data.

2 Likes

I updated to a complete example. As for the points you suggested:
Is it really a problem if the default argument is 1? I thought Julia’s type promotion system is smart enough to handle any “lower” numerical types. If I’m ever writing a function that takes real numbers as input, do I have to explicitly convert them all to a common type myself at the start?
I tried @noinline and it didn’t affect the allocations, and neither did testing with Float32 data.
And here is the output of @code_warntype:

@code_warntype fast_marching!(canvas.grid, 0.025f0, 4.0f0)
Body::Nothing
1 ── %1  = (Base.arraysize)(grid, 1)::Int64
│    %2  = (Base.arraysize)(grid, 2)::Int64
│    %3  = $(Expr(:foreigncall, :(:jl_alloc_array_2d), Array{UInt8,2}, svec(Any, Int64, Int64), :(:ccall), 3, Array{UInt8,2}, :(%1), :(%2), :(%2), :(%1)))::Array{UInt8,2}
│    %4  = (Base.arraylen)(%3)::Int64
│    %5  = (Core.lshr_int)(%4, 63)::Int64
│    %6  = (Core.trunc_int)(Core.UInt8, %5)::UInt8
│    %7  = (Core.eq_int)(%6, 0x01)::Bool
└───       goto #3 if not %7
2 ──       invoke Core.throw_inexacterror(:check_top_bit::Symbol, Int64::Any, %4::Int64)
└───       $(Expr(:unreachable))
3 ┄─       goto #4
4 ── %12 = (Core.bitcast)(Core.UInt64, %4)::UInt64
└───       goto #5
5 ──       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %17 = $(Expr(:foreigncall, :(:jl_array_ptr), Ptr{UInt8}, svec(Any), :(:ccall), 1, :(%3)))::Ptr{UInt8}
│    %18 = (Base.bitcast)(Ptr{Nothing}, %17)::Ptr{Nothing}
│          $(Expr(:foreigncall, :(:memset), Ptr{Nothing}, svec(Ptr{Nothing}, Int32, UInt64), :(:ccall), 3, :(%18), 0, :(%12), :(%12), 0, :(%3)))
└───       goto #9
9 ──       goto #10
10 ─ %22 = invoke Main.fast_marching!(%3::Array{UInt8,2}, _2::Array{Float32,2}, _3::Float32, _4::Float32)::Core.Compiler.Const(nothing, false)
└───       return %22

If you run fast_marching!(states::Matrix{UInt8}, grid::Matrix{T}, h::Real, maxdist::Real) where {T <: AbstractFloat} without clearing states in between (as done in the implicit loop of @benchmark fast_marching!($states, $grid, 0.025, 4)), then you are not doing any work: states is uniformly 0x02 after the first run, nothing gets enqueued, nothing is done. No wonder that this works without any allocations!

You might also need to clear grid in between invocations?

1 Like

I see, so

@benchmark fast_marching!($states .= 0, $(canvas.grid), 0.025f0, 4.0f0)

would be the correct way to benchmark this? In that case, it seems the performance is the same for both versions after all. I wonder if it’s the priority queue’s fault.