Improve performance of Bellman operator

julia> T(V, cost, cost, nω, nk, nc, P, 0.9, grid_Π) ≈ T_new(V, cost, cost, nω, nk, nc, P, 0.9, grid_Π)
true

julia> @btime T($V, $cost, $cost, $nω, $nk, $nc, $P, 0.9, $grid_Π);
  2.173 ms (36027 allocations: 2.52 MiB)

julia> @btime T_new($V, $cost, $cost, $nω, $nk, $nc, $P, 0.9, $grid_Π);
  806.028 μs (7223 allocations: 677.48 KiB)

I’ve only tried to reduce obvious allocations that weren’t necessary. I couldn’t get rid of all of them because I couldn’t figure out what you’re doing in the first loop - it doesn’t like me putting .= there due to an additional singleton dimension and using LinearAlgebra.mul! doesn’t work for some reason (not that familiar with this).

There is surely more speed to be had here, but for that I’d have to know what formula exactly this is supposed to calculate.

Your original code reads very MATLAB-y or Python-y due to (in my opinion) excessive explicit vectorization. Consider writing a kernel function that does what you want to do here on the smallest required data instead, and applying that to each individually outside via broadcasting.

Faster code with explanations what I changed
function T_new(V0::Array{Float64, 4}, # I reformatted these to make it easier to read
               cost_g::Matrix{Float64},
               cost_v::Matrix{Float64},
               n_ω::Int64,
               n_k::Int64,
               n_c::Int64,
               P::Array{Float64,3},
               β::Float64,
               grid_Π::Matrix{Float64})
    VP = zeros(n_ω, n_k, n_c, n_c)
    EV_cg_cv = zeros(n_ω, n_k, n_c, n_c, 3)

    for i in 1:n_ω, r in 1:3, cg in 1:n_c, cv in 1:n_c
        P_view = @view P[i,:,r]
        V0_view = @view V0[:,:,cg,cv]

        # this is the cause of the remaining allocations that I couldn't get rid of
        # an explicit loop would work, but I'm not 100% sure what this is supposed to do
        res = transpose(P_view) * V0_view

        EV_cg_cv[i, :, cg, cv, r] = res 
    end

    # No need to allocate `EV` explicitly when you then just overwrite it with the result of `sum`
    sums = sum(EV_cg_cv, dims = (3,4)) # no need for an array for the dims, a tuple suffices
    sums ./= n_c^2
    EV = dropdims(sums, dims = (3,4)) # this is free either way

    for cg in 1:n_c, cv in 1:n_c, k in 1:n_k # type annotations aren't necessary, those variables are Int already anyway
        cost = (0.0, cost_g[cg, k], cost_v[cv, k]) # again a tuple is perfectly fine here
        for i in 1:n_ω
            m = maximum(axes(EV,3)) do idx # avoid allocating intermediates by just iterating over axes
                β * EV[i,k,idx] - cost[idx] # index into the tuple instead of a costly new allocation
            end
            VP[i, k, cg, cv] = m + grid_Π[i, k] # our result is a scalar anyway, no need for views or broadcasting
        end
    end

    return VP
end

One additional thing I’d change is not passing in the lengths explicitly - size(arr, dim) is available for querying the existing size of an array in a given dimension, so you don’t have to keep everything consistent manually when you can just take the existing values (it’s cheap to retrieve & saved in each array - we’re not in C or Fortran here!). That makes your function signature much smaller as well, to T(V0, cost_g, cost_v, P, β, grid_Π).

I also haven’t investigated whether your loop order (julia is column major) is correct, so there may be more speed here as well.

Those are not needed here (yet?). Only applying standard allocation elimination is surely sufficient at this stage.

2 Likes