Dynamic dispatch in a for loop

Hello,

While trying to optimise my code, Juno’s profiler flagged a portion of my code as doing some dynamic dispatch. Since my original code is fairly long, here is the smallest example I could come up with to illustrate what is going on:

using Profile

function gss(f::Function, a::Float64, b::Float64, tol::Float64 = 1e-5)::Float64
    Gratio = Base.MathConstants.golden
    c = b - (b - a) / Gratio
    d = a + (b - a) / Gratio
    while abs(c - d) > tol
        if f(c) > f(d)
            b = d
        else
            a = c
        end

        c = b - (b - a) / Gratio
        d = a + (b - a) / Gratio
    end
    return (b + a) / 2.0
end

V(x::Float64, y::Float64) = x^2.0 + y^2.0

function solve_gss(a::Float64, b::Float64, V::Function)
    # Define auxiliary function
    aux_f(x::Float64) = x + V(x,a)

    # Find maximum
    aux_x = gss(aux_f, a, b)

    return aux_f(aux_x), aux_x
end

function solve_z(a::Float64, c::Float64, d::Float64, V::Function, gp::Int64 = 100)
    # Create grid for z
    grid_z = range(c, stop = d, length = gp)

    # Iterate over z
    aux_Vs = Vector{Float64}(undef, gp)
    aux_xs = Vector{Float64}(undef, gp)
    for (ind_z, z_td) in enumerate(grid_z)
        aux_Vs[ind_z], aux_xs[ind_z] = solve_gss(a, z_td, V)
    end

    # Find optimal z
    (maxV, maxLoc) = findmax(aux_Vs)

    return maxV, grid_z[maxLoc], aux_xs[maxLoc]
end

@profiler for i = 1:10000
    solve_z(0.0, 0.0, 10.0, V)
end

According to my limited understanding of Juno’s profiler, the issue is on the following line within function solve_z:

 aux_Vs[ind_z], aux_zs[ind_z] = solve_gss(a, z_td, V)

I’m confused because as far as I can see solve_gss is simply returning a Tuple{Float64,Float64} which is then stored in two arrays. I’ve tried specifying the return of solve_gss as a Tuple{Float64,Float64}:

 solve_gss(a::Float64, b::Float64, V::Function)::Tuple{Float64,Float64}

But that does not help either.

Any help is greatly appreciated.

I don’t think it’ll solve your specific problems, but AFAIK it is still true that with a function argument like V::Function, the compiler will not specialize on V. You need to write the ugly V::(T where T) instead to force specialization.

1 Like

I don’t think that’s true. You don’t have to provide concrete types in function signatures to get specialization, and in any case T where T isn’t concrete either, it’s just the same as Any, or leaving out the type entirely.

My guess is that this line inside solve_gss

aux_f(x::Float64) = x + V(x,a)

creates a new function for each iteration, even though the function is actually the same each time. Can you pull the definition of aux_f outside of solve_gss, @drarnau ?

You can also get rid of some allocations by not creating the temporary arrays aux_Vs and aux_zs, and instead keeping track of the optimum value inside the loop.

Another two very minor things is that x^2 is better than x^2.0, you should almost never use the latter, and it’s good style to use a / 2 rather than a / 2.0 (this will not really affect performance in this particular code.)

3 Likes

Thanks for your reply. I’ve implemented your suggestion as:

function solve_gss_ugly(a::Float64, b::Float64, V::(T where T))
    # Define auxiliary function
    aux_f(x::Float64) = x + V(x,a)

    # Find maximum
    aux_x = gss(aux_f, a, b)

    return aux_f(aux_x), aux_x
end

function solve_z_ugly(a::Float64, c::Float64, d::Float64, V::(T where T), gp::Int64 = 100)
    # Create grid for z
    grid_z = range(c, stop = d, length = gp)

    # Iterate over z
    aux_Vs = Vector{Float64}(undef, gp)
    aux_xs = Vector{Float64}(undef, gp)
    for (ind_z, z_td) in enumerate(grid_z)
        aux_Vs[ind_z], aux_xs[ind_z] = solve_gss_ugly(a, z_td, V)
    end

    # Find optimal z
    (maxV, maxLoc) = findmax(aux_Vs)

    return maxV, grid_z[maxLoc], aux_xs[maxLoc]
end

The profiler give me exactly the same as in my original implementation. Execution time and memory allocation look very similar too:

julia> @time solve_z(0.0, 0.0, 10.0, V)
  0.000057 seconds (502 allocations: 11.125 KiB)
(109.99961317879665, 10.0, 9.99998157992654)

julia> @time solve_z_ugly(0.0, 0.0, 10.0, V)
  0.000064 seconds (502 allocations: 11.125 KiB)
(109.99961317879665, 10.0, 9.99998157992654)

Any other idea?

If the function is called in the body, the function will specialize on the function type.

See https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing.

4 Likes

Thank you.

Unfortunately, I cannot pull out the definition of aux_f. In fact, that’s exactly in the nature of the problem I’m solving. I solve a version of the same problem N times. The difference between each problem is that I’m using a slightly different aux_f function and different bounds where to search a maximum. So, my code create many functions that are use just in one iteration. I don’t see a way around it, but maybe there is?

I tried to get rid of the temporary arrays by implementing:

function solve_z_noauxs(a::Float64, c::Float64, d::Float64, V::Function, gp::Int64 = 100)
    # Create grid for z
    grid_z = range(c, stop = d, length = gp)

    # Iterate over z
    max_V::Float64 = -Inf
    opt_x::Float64 = NaN
    opt_z::Float64 = NaN
    for (ind_z, z_td) in enumerate(grid_z)
        temp_V, temp_x = solve_gss(a, z_td, V)

        if temp_V > max_V
            opt_x = temp_x
            opt_z = grid_z[ind_z]
            max_V = temp_V
        end
    end

    return max_V, opt_z, opt_x
end

I get the same profile and slightly more time/allocated memory:

julia> @time solve_z(0.0, 0.0, 10.0, V)
  0.000057 seconds (502 allocations: 11.125 KiB)
(109.99961317879665, 10.0, 9.99998157992654)

julia> @time solve_z_noauxs(0.0, 0.0, 10.0, V)
  0.000067 seconds (600 allocations: 10.938 KiB)
(109.99961317879665, 10.0, 9.99998157992654)

Thank you for the suggestion on using x^2 and so on. My logic was that if I’m computing the power of a Float it’s better to use a Float. Same for dividing.

Thanks. Just to make sure I understand. Are you confirming that in my initial code the function will specialise?

This isn’t completely true, and :: Function is an important exception to the rule. See https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing

3 Likes

I’m probably missing something but I don’t understand why you can’t pull this function out:

aux_f(x::Float64) = x + V(x,a)

to take the a value as an argument instead:

aux_f(x::Float64, a) = x + V(x,a)

so instead of remaking aux_f for each a you just feed the a value into the same aux_f. Youwould also have to edit the calls in gss from f(x) to f(x, a), but you wouldn’t even have to edit the header of gss because in solve_gss you pass the same a into the gss call.

1 Like

I should have said: in your example code you can pull the definition of aux_f out from solve_gss to solve_z, but I wasn’t certain if you could do that in general.

So when you say you cannot do it, do you mean that it’s impossible in your full code, unlike in this example?

Exactly. I know that in this example I can. In my full code, it’s something that I need to do.

Thanks for chipping in. This code is just a toy example of a much bigger code. I’m aware that here I could pull out the function. In my full code I can’t.

Oh I see. It’d be cool if you could make a new example to show why you can’t, but that’s not necessarily related to your question.
I think you could still try lifting the function out in your original example just to see if it changes anything for the profiler and performance, just to check if the function-redefining is the source of what you’re seeing. If not, then we know for sure to look for the problem elsewhere.

1 Like

Good point. When I change V and solve_gss as you suggest:

V(x::Float64) = x^2

function solve_gss_Vout(a::Float64, b::Float64, V::Function)
    # Find maximum
    aux_x = gss(V, a, b)

    return V(aux_x), aux_x
end

function solve_z_Vout(a::Float64, c::Float64, d::Float64, V::Function, gp::Int64 = 100)
    # Create grid for z
    grid_z = range(c, stop = d, length = gp)

    # Iterate over z
    aux_Vs = Vector{Float64}(undef, gp)
    aux_xs = Vector{Float64}(undef, gp)
    for (ind_z, z_td) in enumerate(grid_z)
        aux_Vs[ind_z], aux_xs[ind_z] = solve_gss_Vout(a, z_td, V)
    end

    # Find optimal z
    (maxV, maxLoc) = findmax(aux_Vs)

    return maxV, grid_z[maxLoc], aux_xs[maxLoc]
end

The profiler looks exactly the same as in my original example. Time and memory allocation seem very similar:

julia> @time solve_z(0.0, 0.0, 10.0, V)
  0.000054 seconds (502 allocations: 11.125 KiB)
(109.99961317879665, 10.0, 9.99998157992654)

julia> @time solve_z_Vout(0.0, 0.0, 10.0, V)
  0.000051 seconds (502 allocations: 11.125 KiB)
(99.99963159887011, 10.0, 9.99998157992654)

This suggests that is not about the creation of a new function in each call.

1 Like

I made a version of this that pulled out the function definition (and removed the temporary arrays) which produced zero allocations. Unfortunately, I’m not at my computer now.

The speedup was not dramatic, though, maybe 30%

2 Likes

Great! I look forward to seeing your code. Thanks.

function gss(f::Function, a, b, tol = 1e-5)
    ϕ = MathConstants.golden
    c = b - (b - a) / ϕ
    d = a + (b - a) / ϕ
    while abs(c - d) > tol
        if f(c) > f(d)
            b = d
        else
            a = c
        end
        c = b - (b - a) / ϕ
        d = a + (b - a) / ϕ
    end
    return (b + a) / 2
end

V(x, y) = x^2 + y^2

function solve_gss(a, b, g::Function)
    x = gss(g, a, b)
    return g(x), x
end

function solve_z(a, c, d, f::Function, gp::Int = 100)
    local z, zs
    grid_z = range(c, d; length = gp)
    Vs = -Inf
    for z_ in grid_z
        Vs_, zs_ = solve_gss(a, z_, x->x+f(x, a))
        if Vs_ > Vs
            Vs, z, zs = Vs_, z_, zs_
        end
    end
    return Vs, z, zs
end
julia> @btime solve_z_old(0.0, 0.0, 10.0, V)
  42.033 μs (503 allocations: 11.16 KiB)
(109.99961317879665, 10.0, 9.99998157992654)

julia> @btime solve_z(0.0, 0.0, 10.0, V)
  31.975 μs (0 allocations: 0 bytes)
(109.99961317879665, 10.0, 9.99998157992654)
2 Likes

Thanks. I’ve played around with your code. The reduction in allocation is basically due to the fact that in this example you can define the function that is passed to solve_gss as x->x+f(x, a) which in my full code I can’t do.