How to speed up my code

Hi there,

I am solving a dynamic programming problem (first try in Julia). Previously I solved the same problem in Python. I was expecting Julia solving it much faster than Python, and I don’t get that much improvement. I would like to know if this is normal or otherwise, what I am doing wrong. Below you have the function bellman which I am interested to speed up (It is partially taken from QuantEcon lectures notes).

mutable struct AgentStruct{T <: Float64}
    r::T
    R::T
    y::T
    β::T
    b::T
    b_lim::T
    tau::T
    theta::T
    pi::T
    landa::T
    X::Matrix{T}
    z_vals::Vector{T}
    a_vals::Array{Float64,1}
    u::Function
  end

function ConsumerProblem(;r=0.00,
                         y=1.0,
                         β=0.995,
                         X=[0.5 0.5; 0.0319 0.9681],
                         z_vals=[0.0, 1.0],
                         b=0.0,
                         b_lim=0.0,
                         tau=0.02,
                         theta=0.15,
                         pi=0.2,
                         landa=0.0,
                         grid_max=8,
                         grid_size=100)

    R = 1.0 + r
    a_vals = collect(range(-b_lim, stop=grid_max, length=grid_size))
    u(c::Float64, h::Float64)::Float64 = log(c) + log(1.0-h)

    return AgentStruct(r, R, y, β, b, b_lim, tau, theta, pi, landa, X, z_vals, a_vals, u)
  end

function bellman(cons::AgentStruct,
                V::Array{Float64, 2};
                ret_policy::Bool=false)

    """
    The approximate Bellman operator, which computes and returns the
    updated value function TV (or the V-greedy policy c if
    return_policy is True).
    Inputs
    ----------
    cons : AgentStruct
        An instance of ConsumerProblem that stores primitives
    V : Matrix(Float64)
        A Matrix with dimension
    return_policy : bool, optional(default=False)
        Indicates whether to return the greed policy given V or the
        updated value function TV.  Default is TV.
    Output
    -------
    TV, g_a, g_d : Array(Float64)
                        Returns either the greed policy given V or the updated value
                        function TV.
    """
    # Unpack parameters and grid
    X, β, y, b, b_lim, tau, theta, pi, u = cons.X, cons.β, cons.y, cons.b, cons.b_lim, cons.tau, cons.theta, cons.pi, cons.u
    a_vals, z_vals = cons.a_vals, cons.z_vals

    # Initialize intermediate value functions and operator
    TV = zeros(size(V))
    V_u = ones(length(a_vals))*(-1e9)
    V_e = ones(length(a_vals))*(-1e9)
    V_f = ones(length(a_vals))*(-1e9)
    # Initialize policy functions
    g_a = Array{Int64, 2}(undef, length(a_vals), 3)
    g_d = Array{Int64, 1}(undef, length(a_vals))

    # Fix current a in each iteration and evaluate over the vector of possible a'
    for (i_a, a) in enumerate(a_vals)
        # Compute consumption for each possible employment status
        c_e::Array{Float64,1} = max.(a + (1.0-tau)*(y + b) .- a_vals, 0.0)
        c_u::Array{Float64,1} = max.(a + (1.0-tau)*(theta*y + b) .- a_vals, 0.0)
        c_f::Array{Float64,1} = max.(a + (1.0-tau)*b .- a_vals, 0.0)

        # Get maximums and maximizers for each possible employment status
        V_u[i_a], g_a[i_a,1] = findmax(u.(c_u,0.0) + β*(X[1,2]*V[:,2] + X[1,1]*V[:,1]))
        V_e[i_a], g_a[i_a,2] = findmax(u.(c_e,0.45) + β*(X[2,2]*V[:,2] + X[2,1]*V[:,1]))
        V_f[i_a], g_a[i_a,3] = findmax(u.(c_f,0.0) + β*(X[2,2]*V[:,2] + X[2,1]*V[:,1]))
    end

    # Update the operator
    TV[:,1] = copy(V_u)
    TV[:,2] = max.(V_e, pi*V_u+ (1.0-pi)*V_f)

    # Save the decision rule of accepting the job offer
    g_d[:] = V_e .> pi*V_u + (1.0-pi)*V_f

    if ret_policy
        return TV, g_a, g_d
    else
        return TV
    end
end

Testing it:

V_0 = ones(length(cons.a_vals), length(cons.z_vals))*(-120.0)

using BenchmarkTools

@btime V, g_a, g_d = bellman(cons, V_0, ret_policy=true)

  9.679 ms (363922 allocations: 7.85 MiB)
  12.801 ms (363922 allocations: 7.85 MiB)
  13.033 ms (363922 allocations: 7.85 MiB)

In Python I am getting around 15.5ms-16.3ms

Thank you in advance,

A few quick pointers:

  • Use fill(-1e9, length(a_vals)) instead of ones(…)*(-1e9)
  • You probably shouldn’t need ::Array{Float64,1} if your AgentStruct is defined with fully-typed fields
  • Storing the function u inside your AgentStruct is somewhat of an anti-pattern in Julia. It’s generally much better to find ways to use external- (and multiple-) dispatch to solve your problem. But as a quick fix, instead of u::Function, use a type parameter instead, so that you have an AgentStruct{T <: Float64, F <: Function} and then u::F. Edit: I just realized what I copy-pasted… on that note, you don’t need to parameterize T if it’s just always T <: Float64.
  • Sometimes returning a tuple and sometimes not based upon a boolean flag is a type instability that you’ll want to avoid.
  • You don’t need that copy

More thorough tips can be found at https://docs.julialang.org/en/stable/manual/performance-tips/

8 Likes

Welcome!

A couple of points, in addition to what Matt posted above:

  • Float64 is a concrete type, so the only type T for which T <: Float64 is…Float64 itself (edit: also Union{} but that’s not something you need to worry about here). So your T <: Float64 in your struct definition, while not harmful, is also not doing anything useful. I suspect something like T <: Real is closer to what you would actually want.
  • The @code_warntype tool, mentioned in the https://docs.julialang.org/en/stable/manual/performance-tips/ is very useful for the first steps of optimizing your code. Have you run it on your function and checked its results?
  • In max.(a + (1.0-tau)*(y + b) .- a_vals, 0.0) you’re mixing broadcasted functions (with .) with non-broadcasted functions (without .). That’s fine, but it’s probably sub-optimal, as it will result in extra temporary arrays being allocated. You can . everything or use the @. macro to automate it for a given expression. See More Dots: Syntactic Loop Fusion in Julia for a great writeup of why this is useful
  • You are allocating new arrays c_e etc. in every iteration of your loop. Instead, you can create those arrays once outside of the loop and then overwrite them:
...
c_e = Array{Float64, 1}(undef, whatever_the_size_is)
for (i_a, a) in enumerate(a_vals)
  c_e .= max.(....) # note the use of `.=` to store the result in the *existing* c_e rather than make a new array
  ...
end
6 Likes

Your code is pretty vectorized since you evaluate the vector of possible a' all at once, instead of using an explicit loop. If your Python code is similar it’s not obvious Python would be that inefficient, depending on how much time is spent in the vectorized operation.

I think big gains might come by using an explicit loop and not evaluating irrelevant possible a'. See the algorithm used here https://www.sas.upenn.edu/~jesusfv/comparison_languages.pdf and the part where they discuss using monotonicity and the envelope condition.

You should try what mbauman said and use AgentStruct{T <: Float64, F <: Function} and then u::F .

1 Like

Thank you to everyone!

@mbauman It would be better to not store u inside AgentStructure but define a global u before running the functions?

@rdeits I’ve checked with @code_warntype but it is quite confusing. I will check it again more carefully since I will need to get use to it. I am applying now your tip about creating the arrays outside the loop (I didn’t know about .=)

@aaowens That’s right, my Python code is vectorized and very similar. I want to use monotonicity as a refinement later on, but I wanted to ask first at this stage.

Applying your comments:

  9.280 ms (362421 allocations: 7.56 MiB)
  9.110 ms (362421 allocations: 7.56 MiB)
  9.134 ms (362421 allocations: 7.56 MiB)

Which is slightly better.

If it’s hard to analyze the code in your function, that might be a sign that your function is trying to do too many things. You may want to try splitting up the behavior into more small functions. That should make the overall behavior easier to understand and it means that you can test each piece in isolation. Functions calls are cheap in Julia (and small functions can be inlined by the compiler, making them essentially free).

3 Likes

Here is a slightly faster version.

using Parameters

function bellman(cons::AgentStruct,
                V::Array{Float64, 2};
                ret_policy::Bool=false)

    # Unpack parameters and grid
    @unpack X, β, y, b, b_lim, tau, theta, pi, u, a_vals, z_vals = cons

    # Initialize intermediate value functions and operator
    TV = zeros(size(V))
    V_u = fill(-1e9, length(a_vals))
    V_e = copy(V_u)
    V_f = copy(V_u)
    # Initialize policy functions
    g_a = zeros(Int, length(a_vals), 3)
    g_d = zeros(Int, length(a_vals))
    c_e = similar(a_vals)
    c_u = similar(a_vals)
    c_f = similar(a_vals)
    
    # Fix current a in each iteration and evaluate over the vector of possible a'
    for (i_a, a) in enumerate(a_vals)
        # Compute consumption for each possible employment status
        c_e .= max.((a + (1.0-tau)*(y + b)) .- a_vals, 0.0)
        c_u .= max.((a + (1.0-tau)*(theta*y + b)) .- a_vals, 0.0)
        c_f .= max.((a + (1.0-tau)*b) .- a_vals, 0.0)

        # Get maximums and maximizers for each possible employment status
        @views @. c_u = u(c_u,0.0) + β*(X[1,2]*V[:,2] + X[1,1]*V[:,1])
        @views @. c_e = u(c_e,0.45) + β*(X[2,2]*V[:,2] + X[2,1]*V[:,1])
        @views @. c_f = u(c_f,0.0) + β*(X[2,2]*V[:,2] + X[2,1]*V[:,1])        
        V_u[i_a], g_a[i_a,1] = findmax(c_u)
        V_e[i_a], g_a[i_a,2] = findmax(c_e)
        V_f[i_a], g_a[i_a,3] = findmax(c_f)
    end

    # Update the operator
    @. TV[:,1] = V_u
    @. TV[:,2] = max(V_e, pi*V_u + (1.0-pi)*V_f)

    # Save the decision rule of accepting the job offer
    @. g_d[:] = V_e > pi*V_u + (1.0-pi)*V_f

    if ret_policy
        return TV, g_a, g_d
    else
        return TV
    end
end

Also interpolate the inputs with $ when using @btime, as in @btime bellman($agent, $V);.

2 Likes

Thanks for those tips @mohamed82008! It is not only slightly faster, but much more elegant (@unpack).

1 Like

You will probably get an order of magnitude speedup when you take advantage of monotonicity and just look for the relevant crossing points.

1 Like